Sutton LLMs Warning: 2025 pivot to RL as human data plateaus

Richard Sutton has escalated his critique of the current AI focus, arguing the Sutton LLMs debate misses the real frontier: continual reinforcement-learning agents trained on their own experience. In late April and early May 2025, Sutton and collaborator David Silver outlined an “era of experience,” warning that human-generated text data is plateauing and that scaling LLMs alone will stall substantive progress in math, coding, and science tasks [4]. Ten days later, Sutton called LLMs a “momentary fixation” and forecast a shift to experience-first AI [1].

Key Takeaways

– Shows two 2025 releases, April 26 and May 6, where Sutton and Silver predict LLM plateaus and urge a pivot to experience-driven RL agents over scaling alone. – Reveals Sutton’s 2025 view that human-generated datasets are nearing limits, with major progress requiring continuous, agent-generated experience rather than more internet-scale text. – Demonstrates May 23, 2025 arXiv evidence across two games reporting better generalization without task prompts, supporting bottom-up skill evolution over prompt workflows. – Indicates Sutton’s post-DeepMind push for the Alberta Plan, warning capital concentration in LLMs dilutes foundational RL research priorities and slows true understanding of intelligence. – Suggests LLM hallucinations fuel public gullibility; Sutton calls doom fears “out of line,” urging training and policy responses to job shifts instead of apocalyptic narratives.

Why the Sutton LLMs critique is gaining momentum

Sutton’s core contention is that the industry’s singular focus on language models is a strategic misallocation. In a May 6, 2025 conversation, he described LLMs as a “momentary fixation,” arguing that AI’s next phase will be defined by agents that learn from interaction, not static corpora [1]. This sentiment is reinforced by his April 26, 2025 essay with David Silver, which predicts LLM performance will plateau in math, coding, and science if trained solely on human text [4].

Critically, Sutton is not dismissing LLMs as useless—he’s characterizing their current, corpus-bound paradigm as insufficient for fundamental progress. He frames the breakthrough path as continuous learning from agent-generated experience, a method that scales with time and interaction rather than with finite human-written data [4]. The thrust is that intelligence deepens in environments where agents can accumulate, compress, and refine their own data distributions [1].

The timing of these interventions matters. Sutton and Silver’s essay on April 26 was followed by a public interview on May 6, and then, 17 days later, a supporting experimental paper appeared on May 23 emphasizing bottom-up skill emergence in agents [5]. Over 27 days, the narrative shifted from thesis to thesis-plus-supporting evidence, giving the critique operational credibility beyond opinion pieces [4].

Sutton’s authority in reinforcement learning also amplifies the message. Fresh off a 2024 Turing Award, he has both historical and technical legitimacy to argue that the field should reorient around continual RL rather than doubling down on static-data LLM scaling [3]. That pedigree frames his critique as a call for discipline-level course correction, not a contrarian take [3].

From LLM scaling to an era of experience

In their April 26 essay, Silver and Sutton argue that LLMs trained on human-generated data are running up against the ceiling of that data source for domains requiring rigorous reasoning and exploration [4]. The claim is not that language or tokens are irrelevant, but that the statistical regularities of human text have diminishing returns for mastering complex, interactive problem spaces [4].

They project an “era of experience” in which agent-generated interaction data surpasses static human datasets in both volume and relevance for learning, creating a feedback loop: more interaction begets more useful data, which improves agents and accelerates further data generation [4]. That dynamic is distinct from web crawling or curated corpora because it’s tailored to the tasks and environments agents must master [4].

Sutton reiterated on May 6 that major progress requires agents learning continuously—gathering, evaluating, and reusing their experiences over time [1]. He contends that organizations benchmarked primarily on LLM outputs risk missing the longer arc of progress, which hinges on designing systems that grow their own data and skills across months and years of interaction [1]. It’s a call to reweight research portfolios toward continual RL infrastructure and away from one-off model scaling.

The argument also anticipates a cultural shift: success metrics moving from leaderboard snapshots to longitudinal competence growth. In this framing, agents that can reflect, update policies, and carry learning across tasks and environments will outpace prompt-bound LLM workflows [1]. Silver and Sutton’s thesis places the locus of innovation in agent design and learning loops, not just in parameter counts [4].

Evidence from games: bottom-up skills without prompts

Early empirical backing for this thesis arrived May 23, 2025, with “Rethinking Agent Design,” which tested agents in Civilization V and Slay the Spire [5]. The paper reports autonomous skill acquisition without task-specific prompts, arguing that experience-driven agents generalize better across tasks than top-down, prompt-engineered workflows [5]. While games are sandboxes, they provide controlled, measurable evidence of bottom-up competence emerging from interaction [5].

The result matters for the Sutton LLMs debate because it demonstrates a concrete regime where experience-first methods deliver generalization with less reliance on handcrafted prompts [5]. By subordinating prompt design to learned behavior, the experiments echo the essay’s forecast that continual, interaction-derived data will unlock broader capabilities than static text alone can support [4].

Crucially, the research logic shifts experimental burden from prompt curation to environment design and reward shaping—classic RL territory [5]. That resonates with Sutton’s long-standing emphasis on agents that learn over time, and it offers a reproducible path for researchers to probe generalization without tethering progress to ever-larger text corpora [1].

Funding, policy, and public risk: Sutton’s 2025 view

Sutton’s criticism of the LLM-default is also about incentives. In a 2025 interview after leaving DeepMind, he argued that capital concentration around LLMs dilutes foundational RL research and obstructs a deeper understanding of intelligence [2]. He favored the Alberta Plan—a push for continual RL research ecosystems—as a way to align funding and talent with the experience-driven trajectory he and Silver outlined [2].

This is not a wholesale repudiation of LLMs, but a rebalance pitch: treat them as components rather than the core engine of future AI [2]. In his view, over-indexing on static-data paradigm bets risks delaying the engineering of systems capable of lifelong learning and robust generalization under uncertainty [1]. The implication for labs and investors is to diversify portfolios toward agent infrastructure, evaluation environments, and long-horizon learning [2].

Public discourse, meanwhile, is distorted by LLM limitations. Sutton notes that hallucinations can generate gullibility in users, but he labels hardcore AI doom narratives “out of line” for 2025, preferring practical responses like training and policy supports for job transitions [3]. This stance separates misuse risks and model reliability from existential predictions, advocating immediate governance and workforce upskilling rather than speculative bans [3].

He also opposes directing AI toward military applications, suggesting policy effort should channel AI toward broad civil benefits while managing disruption responsibly [3]. That position extends his technical argument into ethics and governance: better RL systems plus better social scaffolding, rather than fear-driven paralysis or narrow defense priorities [3].

Why the Sutton LLMs argument targets data limits

A central pillar of Sutton’s case is that the finite stock of high-quality human text caps the benefits of pure scaling for many reasoning domains [4]. By contrast, an agent interacting in rich environments can accumulate effectively unbounded experience, which becomes the primary driver of skill growth over time [4]. The dynamics resemble compounding returns: experience generates competence that unlocks more experience, and so on [1].

This view also reframes evaluation. If progress comes from continuous interaction, then benchmarks should track learning curves across months and across environments, not just one-off zero-shot scores [1]. Sutton’s argument invites the community to measure how quickly and reliably agents improve, transfer, and stabilize skills under distribution shift—metrics LLM leaderboard culture rarely captures [4].

What this means for builders and the next 12 months

If Sutton is right, 2025–2026 will reward teams that build experience pipelines: simulators, task suites, data aggregation loops, and continual learning frameworks that keep agents improving across weeks and months [1]. Engineering priorities shift from prompt catalogs to curriculum design, reward modeling, and safe, scalable interaction environments tuned to compound useful experience [5]. The empirical target becomes generalization without manual prompt scaffolding [5].

Expect more hybrid systems: LLMs as perception, language, or planning modules inside agents that learn from interaction, rather than monolithic text predictors [4]. The near-term pragmatic move is to instrument agents that test how far interaction-derived data can substitute for prompt engineering, starting with constrained domains where safety and measurement are tractable [5]. Evidence from two games is not dispositive, but it’s a replicable template for broader studies in robotics, ops, and software automation [5].

On the organizational side, Sutton’s Alberta Plan signals an emerging cluster strategy: anchor institutions, long-horizon funding, and talent pipelines aligned with continual RL [2]. Given Sutton’s 2024 Turing recognition and the April–May 2025 cadence of publications, the alignment of intellectual, institutional, and experimental signals suggests the debate will intensify as more labs report longitudinal agent results through 2025 [3].

Sources:

[1] Intrepid Growth Partners / Derby Mill – Welcome to the Era of Experience (Derby Mill episode transcript): https://insights.intrepidgp.com/p/welcome-to-the-era-of-experience

[2] The Logic – AI pioneer eyes new Alberta-based venture after parting ways with Google’s DeepMind: https://thelogic.co/news/exclusive/ai-pioneer-eyes-new-alberta-based-venture-after-parting-ways-with-googles-deepmind/ [3] BetaKit – New Turing Award winner Richard Sutton calls doomers “out of line,” talks path to human-like AI: https://betakit.com/new-turing-award-winner-richard-sutton-calls-doomers-out-of-line-talks-path-to-human-like-ai/

[4] The AI Innovator – Welcome to the Era of Experience (essay by David Silver and Richard S. Sutton): https://theaiinnovator.com/welcome-to-the-era-of-experience/ [5] arXiv – Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution: https://arxiv.org/abs/2505.17673

Image generated by DALL-E 3