Industry Analysis

Understanding Dwarkesh Patel's AI Scaling Thesis: AGI Timelines, Compute, and What He Actually Believes

Jon Sinclair using Luminix AI
Jon Sinclair using Luminix AI Strategic Research
Key Takeaway

Dwarkesh Patel's AI scaling thesis centers on compute as the primary driver of AGI, predicting timelines of 3-5 years for superintelligence if scaling continues unabated. He emphasizes that algorithmic efficiency gains will extend these horizons by 1-2 years but not derail the trajectory. His views, synthesized from multi-hour interviews with top researchers, reveal a more aggressive stance on rapid capability jumps than many public forecasts.

Latest from the conversation on X
May 5, 2026
  • 01 Matthew Berman critiques Dwarkesh Patel's article on slower AI progress, arguing that Patel overly applies human-like learning benchmarks to AI, ignoring AI's unique strengths in scale, parallelism, and RL that enable faster timelines without mimicking biology.
  • 02 Dwarkesh Patel pushes back on Leopold Aschenbrenner's scaling laws implying 2027 AGI, citing the data wall limiting curves beyond GPT-5 and uncertainty in what benchmarks truly measure intelligence.
  • 03 Dwarkesh Patel outlines longer AGI timelines due to continual learning bottlenecks, predicting years for reliable computer-use agents like end-to-end tax handling, despite current models showing AGI potential.
  • 04 Sriram Krishnan observes reactions to Dwarkesh's Jensen Huang podcast split by AGI timeline beliefs: short-timeline believers favor Dwarkesh's frame, while skeptics prefer Jensen's hardware-focused view.
  • 05 In a debate format, Dwarkesh Patel weighs scaling's viability for AGI, assigning 70% odds by 2040 via scaling plus algos/hardware, but notes skeptic risks like data limits and unclear next-token prediction equaling intelligence.

1. Dwarkesh in Context

Dwarkesh Patel has become the most consequential interviewer in frontier AI not by being a researcher, but by being a relentless synthesizer who earns multi-hour conversations with people who rarely grant them. His guest roster—Ilya Sutskever (Nov 2025), Demis Hassabis (Feb 2024), Dario Amodei (Feb 2026), Jensen Huang (Apr 2026), Satya Nadella (Nov 2025), Richard Sutton (Sep 2025), Andrej Karpathy (Oct 2025), Dylan Patel of SemiAnalysis (Mar 2026)—constitutes a near-complete map of the people actually building or funding the frontier (Report 1). His 2025 book Scaling Era compiles these priors into a coherent framework (Report 1).

The reason frontier researchers talk to him is visible in the transcripts: he arrives having internalized their prior work, asks questions that force them off-script, and then publicly updates his own views when the evidence warrants it. His December 2025 blog post openly walked back earlier optimism about RL scaling, writing that RLVR "lacks trend" and that "pre-baking" skills via RL is "pointless if on-job learners emerge soon" (Report 1). A New York Times profile (April 2026) treated him as a credible aggregator of insider views (Report 1).

Why this matters for capital allocation: Dwarkesh's synthesis sits at the intersection of $725 billion in 2026 hyperscaler capex (Report 2) and genuine uncertainty about whether that spend produces transformative capability or expensive benchmarks. His framework—compute-driven scaling with a hard wall around 2030, a pivot to post-training/RL whose scaling laws remain unproven, and continual learning as the binding constraint on AGI—is the mental model that many technical operators and investors are implicitly running. Understanding where it holds, where it breaks, and where it has already shifted is worth doing rigorously.

2. The Full Thesis in His Framing

Compute as the primary driver. Dwarkesh's core claim is that training compute grew at >4x per year across the scaling era, and this—not algorithmic breakthroughs alone—explains frontier model progress. His June 2025 essay states this directly: training compute growth "cannot continue beyond this decade" due to bottlenecks including ASML EUV tools (only ~70-100 by 2030) and power constraints (Report 1). In the Dylan Patel episode (March 2026), he probes this further: labs like Anthropic at 2-2.5 GW now need 5+ GW by year-end, with H100s actually appreciating in value as AI demand outpaces depreciation (Report 1).

Scaling laws: operative but regime-shifting. He endorses pre-training scaling laws as empirically robust—smooth power laws on loss versus compute, data, and parameters across orders of magnitude. But he questions their extension to RL and post-training. In his December 2023 piece "Will Scaling Work?", he debates via Socratic dialogue whether laws sustain to AGI or hit a data wall after GPT-5 (Report 1). By the November 2025 Sutskever episode, he notes the transition "from pre-training to RL" scaling but observes there is no "law of physics" governing RL the way power laws govern pre-training loss (Report 1). In the February 2026 Amodei interview, he challenges Amodei on the "end of exponential"—probing whether RL scaling laws exist at all versus pre-training's clean power laws (Report 1).

Pre-training vs. post-training/RL. His view has evolved into a layered model: pre-training provides foundational priors and world models (trillions of tokens yielding broad knowledge), increasingly augmented by RL/post-training for specific skills and generalization. He frames pre-training as analogous to "school" (imitation prior), then RL and on-the-job learning as analogous to human career development—a framing crystallized in the Sutton debate (September 2025) (Report 1). But he stresses that LLMs' poor sample efficiency and generalization versus humans remain a "fundamental" gap, pushing for experiential/continual learning atop pre-training (Report 1).

The bitter lesson and intelligence transfer. Dwarkesh defends LLMs as "Bitter Lesson-pilled" in the sense that they scalably incorporate human knowledge, and that next-token prediction builds world models that can scaffold RL (Report 1). But after the Sutton interview, he acknowledged the tension: Sutton argues LLMs embed human knowledge rather than learning from experience, which history shows gets superseded (Report 5). Dwarkesh's resolution is that LLMs provide a necessary prior for RL, not a replacement for it—a position that sits between Sutton's experiential purism and pure scaling optimism.

AGI timelines. His timelines lengthened visibly across 2025. The June 2025 essay gives a lognormal distribution: 50% chance of end-to-end agentic tasks (like doing taxes) by 2028, but human-like on-the-job learning (like a video editor's tacit knowledge) not until 2032. By December 2025, he writes "10-20 years to actual AGI" defined as automating 95% of knowledge work, and notes that "yearly probability of AGI craters post-2030" as compute scaling exhausts (Report 1). This puts his median at roughly 2030-2032, with wide uncertainty, and contrasts with guests like Dario Amodei (who gives a "hunch" of 1-3 years for a "country of geniuses") and Sutskever (5-20 years) (Report 1).

Takeoff dynamics. He expects slow takeoff even to singularity: "1% GDP on AI feels normal," with no moonshot to ASI in a single year but decades of infrastructure buildout (Report 1). Gigawatt clusters take time. Post-AGI, diffusion limits speed—robotics adds 1-2 years, recursive self-improvement via millions of copies faces diminishing returns from parallel identical thinkers. Satya Nadella, in the November 2025 episode, frames it as compressing the Industrial Revolution into 20-25 years via 10% annual growth (Report 1). Dwarkesh's own estimate is 10-20% annual GDP growth post-AGI, not an overnight explosion.

Bottleneck prioritization. By mid-2025, continual learning emerges as his top bottleneck. The June 2025 essay calls it a "huge bottleneck"—models cannot adapt on-the-job like humans, and there is "no obvious way" to solve it in 7 years (Report 1). Beyond that, he stacks bottlenecks as: memory crunch (30% of 2026 capex), ASML lithography as the #1 constraint by 2030, and inference cost post-AGI (Report 1). In the Dylan Patel episode, he walks through the physical supply chain: logic, memory, and power, with HBM prices tripling and consumer demand destruction freeing supply (Report 1).

3. How His Views Have Updated

Update 1: From "AI 2027" optimism to a 2030-2032 median (June-December 2025). The June 2025 essay marks a visible shift. Before it, Dwarkesh's framing was compatible with aggressive short timelines. Afterward, he explicitly names continual learning as a "huge bottleneck" that could delay human-like AGI to 2032, writing "50/50 human-like on-job by 2032"—a 1-year delay from his prior median (Report 1). By December 2025, the shift deepens: "RL scaling lacks trend... pre-baking pointless if on-job learners emerge soon" (Report 1). This update was catalyzed by his own observation that benchmark progress failed to translate into deployment capabilities—models improved on tests but not at real-world tasks requiring adaptation.

Update 2: Post-Sutskever (November 2025)—the end of the pre-training era. Sutskever's interview directly reframed Dwarkesh's understanding of where scaling stands. Sutskever stated the 2020-2025 "scaling age" was ending due to finite data and jagged generalization, and that RL/post-training would differentiate labs going forward but required genuine research, not just more compute (Report 1). Dwarkesh pushed on compute needs for Sutskever's new venture SSI ($3 billion, which he questioned as underfunding versus rivals' billions), revealing his prior assumption that compute remained the binding constraint. Sutskever's view—that "research taste" now matters more than flops—forced Dwarkesh to reconsider whether ideas, not infrastructure, were the real bottleneck (Report 1, Report 6).

Update 3: Post-Sutton (September 2025)—taking the bitter lesson seriously. Before the Sutton episode, Dwarkesh treated LLMs as essentially Bitter Lesson-compliant. Sutton challenged this directly: LLMs scale compute but inject human knowledge, which history shows gets superseded by methods that learn from experience with ground truth feedback (Report 5). Dwarkesh's follow-up (October 2025) showed "better grasp of RL vision," and his subsequent writing increasingly frames continual/experiential learning as necessary for AGI rather than optional (Report 1). This was not a 180-degree reversal but a meaningful recalibration of what "scaling" needs to mean post-pre-training.

Update 4: Post-Dylan Patel (March 2026)—compute constraints are physical, not theoretical. Dylan Patel's supply chain analysis forced Dwarkesh to confront specific hardware math: ASML's EUV tool production (~3.5 tools per GW) caps total global compute growth, memory prices triple under HBM crunch, and the US lead depends on short timelines because China indigenizes by 2030 on DUV (Report 1, Report 5). Dwarkesh's questions in this episode reveal the shift: "Growth rate in AI compute has to slow... 2x EUV year-over-year?" and "Fast timelines, US wins" (Report 1). Before this, he treated power as the main physical constraint; afterward, lithography and memory joined his framework as co-equal bottlenecks.

Update 5: Post-Amodei (February 2026)—probing the gap between lab confidence and evidence. Dario Amodei gave a "hunch" of 1-3 years for a "country of geniuses," with 90% confidence in 10 years. Dwarkesh pushed back on the conservatism: if the TAM is trillions and AGI is 1-3 years, why only 3x annual compute growth? He probed continual learning needs (10M-100M contexts for months of learning) and challenged whether diffusion was "cope" versus genuine human advantages (Report 1). This episode crystallized Dwarkesh's growing skepticism of insider timelines—he treats lab CEO predictions as motivated reasoning, not calibrated forecasts.

These updates are evidence of genuine thinking. Each moved his framework in a less convenient direction (longer timelines, harder bottlenecks, less certainty), which is the hallmark of someone updating on evidence rather than defending a brand.

4. Where He Is Well-Supported

Published scaling law results remain robust for pre-training. Chinchilla's balanced exponents (α ≈ 0.34 for N, β ≈ 0.28 for D) hold with R² = 0.997 fits through 2026 analyses, and Pearce & Song (2024) reconciled the Kaplan-Chinchilla discrepancy by showing Kaplan's bias stemmed from excluding embedding parameters (Report 2). The power-law relationship between loss and compute has not broken. Epoch AI tracks 30+ models at GPT-4 scale (~2×10²⁵ FLOPs), with xAI Grok 4 at ~5×10²⁶ FLOPs representing the current frontier—capabilities grow at ~15.5 ECI/year, outpacing hardware improvement alone (Report 2).

Hyperscaler capex validates the thesis directionally. The money is flowing exactly where Dwarkesh predicted it would:

  • Stargate (Microsoft/OpenAI): $500 billion over 4 years, 10 GW target, Abilene 1.2 GW online 2026 (Report 2).
  • xAI Colossus: 2 GW, 555,000 GPUs, ~$18 billion in chips, operational 2025 (Report 2).
  • Amazon Project Rainier: $11 billion, 2.2 GW, 500,000+ Trainium2 chips, operational October 2025 (Report 2).
  • Meta Hyperion: 5 GW target by 2030, $125-145 billion 2026 capex (Report 2).
  • Total Big Tech 2026 capex: $725 billion, up 60-83% year-over-year, ~75% AI infrastructure (Report 2).

These are not plans; they are contracts, purchase orders, and construction sites. Aggregate 2025-2029 bets exceed $3 trillion (Report 2).

Capability gains per OOM continue, especially in reasoning. MMLU rose from ~86% (GPT-4) to ~91% (Claude 4.6/Gemini 3 Pro) over roughly 1-2 OOM of compute. More importantly, reasoning benchmarks jumped: GPQA from ~50% to 90%, ARC-AGI-2 from ~30% to 45% via RL/inference compute (Report 2). OpenAI's o3 pushed AIME from 83% (o1) to 96%+ with 10x RL training compute, maintaining log-linear gains (Report 3). DeepSeek-R1 matched o1-preview on AIME (79.8% vs. 83.3%) using just 5% of its base model's pre-training compute for RL (Report 3).

RL/post-training is delivering real gains. Dwarkesh's framing of RL as the "new scaling" phase finds support: OpenAI's o1/o3 demonstrate log-linear test-time scaling laws where performance improves predictably with additional inference compute (Report 3). DeepSeek-R1's GRPO method elicits emergent chain-of-thought from base capabilities at ~$1 million post V3's $5.6 million pre-train—proving RL extracts latent reasoning cheaply when verifiable rewards exist (Report 3). Post-training now accounts for ~40%+ of total compute, up from <10% (Report 3).

Talent flows confirm the direction. While not quantified in specific headcount numbers across reports, the creation of RL-focused efforts at every frontier lab (OpenAI's o-series, DeepSeek's GRPO pipeline, Anthropic's Constitutional AI evolution, Google's Deep Think) and Sutskever's founding of SSI with $3 billion to pursue post-pre-training research (Report 1) all validate Dwarkesh's thesis that the field's center of gravity is moving toward post-training and agents.

5. Where He Overstates or Evidence Is Weaker

Pre-training plateau signals are stronger than his framework acknowledges. GPT-5 launched in August 2025 after ~10x GPT-4's compute and was widely described as "underwhelming"—users reported it performed worse than GPT-4 on coding, instruction-following, and creative tasks despite benchmark improvements (Report 6). GPT-4.5 at 10x GPT-4o's compute was "only marginally better," prompting commentary about a "scaling wall" (Report 6). Benchmarks like MMLU saturate above 90%, and perplexity improvements weakly correlate with reasoning gains (Report 2). Dwarkesh acknowledges this trend but may underweight how much it undermines the core thesis: if 10x compute yields marginal perceived improvement, the economic case for $500 billion infrastructure programs weakens.

The gap between benchmark gains and real-world deployment is enormous. The Remote Labor Index finds frontier agents automate only 4.17% of real freelance projects (Claude Opus 4.6, the best performer), versus humans completing ~100%—a 96% gap (Report 6). GDPval shows models at ~50% expert-parity on 220 tasks but only for one-shot, non-iterative work (Report 6). The "jagged frontier" phenomenon—gold-medal math Olympiads yet ~50% accuracy reading analog clocks—suggests scaling produces uneven capabilities that resist smooth deployment (Report 6). Dwarkesh's June 2025 essay gestures at this with the video-editor example, but his framework still implicitly assumes capability improvements translate to economic value on a reasonable timeline.

Energy and grid constraints are more severe than his framing implies. Only ~5 GW of the 12-16 GW planned for 2026 is under active construction, with 30-50% projected delayed or canceled (Report 6). Transformer (electrical) demand exceeds supply by 30% in 2026, and transmission lines take 15-30 years to permit (Report 6). Gartner predicts 40% of AI data centers will be power-constrained by 2027 (Report 6). Dwarkesh's June 2025 essay treats power as scalable via turbines to 200+ GW, but the actual bottleneck is grid interconnection and permitting, not generation capacity. Report 1 shows he began incorporating this after the Dylan Patel episode, but his public framework still underweights it.

Chinese open-weight progress at lower cost directly challenges the compute moat thesis. DeepSeek-V3 (671B MoE, 37B active) matched or exceeded Llama 3.1 405B on key benchmarks using 2.8 million H800 hours versus 30.8 million H100 hours—roughly 10x less compute (Report 4). DeepSeek-V4-Pro (April 2026, 1.6T parameters, 49B active) achieves 80.6% SWE-bench Verified versus Claude Opus 4.6's 80.8%, at approximately 1/6 the API cost (Report 4). Kimi K2.6 (1T MoE, ~32B active) tops open-weight leaderboards with an Intelligence Index of 54, beating GPT-5.4 and Claude 4.6 on reasoning and coding (Report 4). These labs achieved this despite US export controls restricting them to H800s with 44% less memory bandwidth than H100s, using architectural innovations (MoE, compressed sparse attention, FP8 training) that Dwarkesh's compute-centric framework systematically undervalues.

Data quality and contamination risks are underexplored in his framework. Public high-quality text data exhausts by 2026 at 10-50 trillion tokens, forcing reliance on synthetic sources that risk "model collapse" (Report 6). As few as 250 poisoned data points (<1% of training data) can cripple billion-parameter models (Report 6). Scale AI's CEO has stated "the data wall is real" and synthetic data underperforms without human anchoring (Report 6). Dwarkesh treats synthetic data as an unlocking mechanism (per Dylan Patel's December 2024 view) but does not engage deeply with contamination risks or the quality ceiling.

6. The Strongest Counterarguments, Steelmanned

(a) Scaling has hit diminishing returns and the next OOM will not deliver the next capability jump. The strongest version of this argument is not that scaling laws have broken—they haven't, for loss—but that loss improvements no longer translate reliably to capability improvements that users value. GPT-5's 10x compute over GPT-4 produced widespread user disappointment and coding regressions (Report 6). MMLU saturates above 90% while real freelance automation sits at 4.17% (Report 6). The "densing law" shows equivalent performance requiring half the parameters every 3.5 months (Report 2), meaning the frontier is increasingly about efficiency, not scale. If true, the $725 billion 2026 capex is systematically mispriced—building cathedrals for a religion whose miracles are getting smaller.

(b) Post-training and agents are the real frontier, and compute is no longer the bottleneck. DeepSeek-R1 matched o1-preview using 5% of base model pre-training compute for RL (Report 3). OpenAI's o3-mini delivers 87.3% AIME at 6x cheaper inference than o3 (Report 3). Post-training now accounts for 40%+ of total compute but delivers disproportionate capability gains on reasoning tasks (Report 3). The binding constraints are now verifiable reward design, agent scaffolding, continual learning architecture, and real-world deployment reliability—none of which are primarily compute problems. If the next breakthrough comes from someone designing a better GRPO variant or solving persistent memory for agents, raw flops become a commodity rather than a moat.

(c) Chinese algorithmic efficiency invalidates "more compute wins." DeepSeek-V4-Pro achieves GPT-5.4 parity at 1/6 the API cost and trains on Huawei Ascend NPUs with 1.5-1.73x speedup via fine-grained expert parallelism, signaling reduced Nvidia reliance (Report 4). Qwen3.6-35B-A3B (3B active parameters) beats the prior Qwen3-235B-A22B on agentic coding benchmarks (Report 4). Export controls, far from crippling Chinese labs, forced architectural innovations (MoE routing, compressed sparse attention, INT4 quantization-native training) that now constitute a genuine efficiency advantage (Report 4). The implication: "more compute wins" is true for discovery but false for deployment, and deployment is where economic value accrues. If Chinese labs can match frontier capabilities at 10-30x less training cost and 6x less inference cost, the US compute advantage is a depreciating asset.

(d) Interviewee selection bias skews his synthesis. Dwarkesh's guest list is overwhelmingly US frontier-lab insiders: OpenAI-adjacent (Sutskever), Anthropic (Amodei, Sholto/Trenton), Google DeepMind (Hassabis), Meta (Zuckerberg), Microsoft (Nadella), Nvidia (Huang), xAI-adjacent. Not a single Chinese lab researcher, no European AI safety researcher, no economist studying AI's actual productivity impact, no roboticist working on embodied deployment. This creates a systematic bias toward believing (a) pre-training compute is the primary driver, (b) US labs lead, and (c) the gap between benchmarks and deployment will close. The people telling Dwarkesh "scaling works" are the people whose careers and equity depend on scaling working. LeCun, who rejects LLM scaling entirely and left Meta to found AMI Labs with $1.03 billion for JEPA world models (Report 5), represents a paradigm Dwarkesh's guest list structurally excludes from adequate representation.

(e) Confident AGI timelines are epistemically indefensible. The 2023 AI Impacts survey of 2,778 AI researchers gave a median HLMI of 2040, with 50% by 2059—far longer than lab CEO predictions (Report 6). Historical expert predictions (AGI by 1980-1990 in the 1970s) systematically overshoot (Report 6). Benchmark saturation makes calibration nearly impossible: MMLU loses discriminative power above 90%, and new benchmarks like Humanity's Last Exam produce 30% accuracy scores that resist clear interpretation (Report 2). The RAND Corporation notes that AGI forecasts suffer from definitional ambiguity, Goodhart's Law (targets cease to be good measures), and reflexivity (predictions affect what gets built) (Report 6). Dwarkesh's own lognormal distribution—50% by 2028 for taxes, 2032 for on-the-job learning—looks more calibrated than Amodei's 1-3 year hunch, but any point estimate on AGI arrival remains theater dressed up as analysis. The honest answer is that we do not know, and confident timelines serve fundraising narratives more than epistemic clarity.

7. Where Dwarkesh Diverges from Adjacent Voices

Versus Richard Sutton: the role of human data. Sutton, in the September 2025 episode, argues LLMs violate the Bitter Lesson by embedding human knowledge rather than learning from raw experience with ground truth feedback (Report 5). LLMs have no goals, no verification of real-world outcomes, no on-the-fly adaptation. Dwarkesh's resolution—that LLMs provide a necessary prior for RL—is a genuine disagreement, not a synthesis. Sutton would say the prior itself is the problem, because it locks the system into human-data distributions rather than discovering better representations from scratch, as happened in chess (AlphaZero) and Go (AlphaGo Zero). The disagreement reveals Dwarkesh's implicit assumption that human knowledge is a useful scaffold rather than a scaling trap. If Sutton is right, the multi-trillion-dollar pre-training enterprise is building the wrong foundation.

Versus Andrej Karpathy: timeline and takeoff speed. Karpathy, in the October 2025 episode, places AGI (defined as a human-level remote worker) "a decade away" and describes the path as "tractable but difficult" engineering—not paradigm shifts (Report 5). He calls RL "terrible" due to noise and inefficiency, preferring small "cognitive core" models with tool use. Progress "blends into 2% GDP growth without explosion" (Report 5). Dwarkesh's framework allows for explosive takeoff if continual learning is solved (10-20% annual GDP growth), while Karpathy sees no mechanism for that explosion. The disagreement is about whether a single breakthrough (continual learning) can unlock nonlinear economic impact, or whether deployment friction ensures gradual absorption. Karpathy's January 2026 nanochat experiments validate Chinchilla-optimal scaling at small scale (8:1 tokens:params), aligning with Dwarkesh's core scaling claim while disagreeing on what it implies (Report 5).

Versus Yann LeCun: the architecture question. LeCun rejects the entire LLM paradigm. Autoregressive token prediction cannot build world models, plan, or understand causal physics (Report 5). His proposed alternative—Joint Embedding Predictive Architectures (JEPA), energy-based models, model-predictive control—represents a fundamentally different bet on what intelligence requires. He left Meta and launched AMI Labs in January 2026 with $1.03 billion specifically to pursue this (Report 5). His March 2026 LeWorldModel paper stabilizes JEPA end-to-end from pixels (15 million parameters, single GPU), enabling 48x faster planning than giant models (Report 5). Dwarkesh's framework essentially ignores the possibility that the entire transformer/LLM stack is a local optimum. LeCun would say Dwarkesh is the world's most sophisticated analyst of a dead end.

Versus Dylan Patel: where the wall hits. Dylan Patel shares Dwarkesh's scaling optimism short-term but operationalizes the constraints with hardware-level specificity that Dwarkesh lacks. Dylan's ASML math (3.5 tools per GW, ~100 tools by 2030) gives a hard ceiling that Dwarkesh cites but doesn't fully integrate (Report 5). Dylan provides no explicit AGI timelines but implies that "fast timelines, US wins" and slow timelines favor China—a framing that treats AGI arrival as primarily a supply-chain variable rather than an algorithmic one (Report 1). The key divergence: Dwarkesh believes continual learning is the binding constraint on AGI; Dylan believes lithography and memory are. If Dylan is right, the path to AGI is less about research breakthroughs and more about who secures TSMC wafer starts.

What these four disagreements collectively reveal is that Dwarkesh occupies a specific position in idea-space: he believes the LLM foundation is correct (contra LeCun), that RL atop it can scale (contra Sutton's purism and Karpathy's skepticism), that the timeline is compressed enough for current infrastructure bets to matter (contra Karpathy's decade), but long enough that physical constraints bind (per Dylan). Each disagreement represents a failure mode for his framework.

8. Implications for AI Power Users, Builders, and Investors

On model release pace. If Dwarkesh is directionally right, expect a shift from blockbuster pre-training releases (GPT-5 was likely the last to generate genuine surprise) toward continuous post-training and inference improvements. The action moves to RL recipes, test-time compute scaling, and agent scaffolding—released as capability updates to existing models rather than named generations. Report 3 shows this is already happening: Claude's Constitutional AI evolution and OpenAI's o-series are post-training improvements, not new pre-trained bases. Power users should expect models to get measurably better at specific tasks (coding, math, structured reasoning) while remaining frustratingly unreliable at open-ended, multi-step work requiring adaptation.

On compute and power infrastructure. The capex is real ($725 billion in 2026 alone), but execution risk is severely underappreciated. Report 6 documents that 30-50% of planned 2026 data center capacity faces delays from grid constraints, transformer shortages, and permitting. Dwarkesh's post-Dylan Patel framing correctly identifies lithography and memory as binding alongside power, but even his updated view may underweight the interconnection bottleneck. Builders should assume GW-scale clusters will arrive 12-24 months later than announced and plan inference strategies accordingly. The "densing law"—equivalent performance at half the parameters every 3.5 months (Report 2)—suggests that efficiency gains may outrun infrastructure buildout, making inference optimization the higher-ROI bet.

On which lab strategies look most defensible. If Dwarkesh is right that pre-training plateaus but post-training/RL scales:

  • Anthropic's position is strongest: deep RL/RLAIF expertise (Constitutional AI), leading SWE-bench scores (80.8%), and a safety-forward approach that may prove regulatory advantage (Reports 3, 4).
  • Google DeepMind's position is underrated: custom silicon (TPU), multimodal-native architecture, and Deep Think's adaptive test-time compute scaling (84.6% ARC-AGI-2) suggest they can compete on both compute and algorithms (Report 3).
  • OpenAI's position is most exposed: GPT-5 disappointed, the o-series relies on proprietary RL data that Chinese labs have replicated cheaply, and Stargate's $500 billion commitment is the largest single bet on raw scaling (Reports 2, 6).
  • Chinese open-weight labs (DeepSeek, Qwen, Kimi) are the wild card: if efficiency continues outpacing scale, their 1/6-cost parity (Report 4) could commoditize closed-model APIs within 12-18 months.

On what 2026 data points would falsify or confirm the thesis. Dwarkesh's framework generates specific testable predictions:

  • Falsifying evidence: If Anthropic's or OpenAI's next major model release (on significantly more compute) fails to advance frontier reasoning benchmarks like ARC-AGI-2 or FrontierMath by >10 percentage points, the compute-drives-capability thesis weakens decisively. If DeepSeek-V4 or Kimi K2.6 achieves parity on agentic benchmarks (e.g., >50% Remote Labor Index) at 1/10 the compute, the US infrastructure bet becomes indefensible. If no lab demonstrates meaningful continual learning (agent improves at a task over days without retraining) by end of 2026, his 2028 agentic milestone becomes implausible.

  • Confirming evidence: If test-time compute scaling maintains log-linear gains on reasoning benchmarks through another OOM of inference compute, RL-as-the-new-scaling is validated. If Stargate's Abilene or xAI's Colossus 2 come online on schedule and power training runs that produce measurable capability jumps over current frontier, the infrastructure thesis holds. If any lab demonstrates an agent that learns and improves at a real-world task (e.g., customer support, code review) through deployment feedback without explicit retraining, Dwarkesh's continual learning bottleneck dissolves and his shorter timelines come back into play.

The deepest insight from synthesizing Dwarkesh's framework is this: he has correctly identified that the era of "just scale pre-training" is ending, correctly named continual learning as the next binding constraint, and correctly noted that the evidence for RL scaling laws is far weaker than for pre-training. But his framework still implicitly assumes that whoever spends the most on compute will lead—an assumption that Chinese labs are actively falsifying with architectural efficiency gains at 10-30x lower cost. The most important question for the next 18 months is not whether scaling works, but whether scale or efficiency determines who captures the economic value of AI. On that question, Dwarkesh's interview library—heavily weighted toward the people building the cathedrals—may be the wrong sample for finding the answer.

Get Custom Research Like This

Start Your Research

Source Research Reports

The full underlying research reports cited throughout this analysis. Tap a report to expand.

Report 1 Research Dwarkesh Patel's publicly stated views on AI scaling, compute, and AGI timelines across his podcast episodes from 2023–2026. Identify specific dated quotes and arguments from episodes with Ilya Sutskever, Demis Hassabis, Sholto Douglas & Trenton Bricken, Dario Amodei, Mark Zuckerberg, Satya Nadella, Jensen Huang, Dylan Patel/SemiAnalysis, John Carmack, and Richard Sutton. Produce a structured timeline of his stated positions on: (a) compute as the primary capability driver, (b) scaling law validity, (c) pre-training vs. post-training/RL emphasis, (d) AGI timelines, (e) takeoff dynamics, and (f) bottleneck prioritization. Note where his framing visibly shifted between episodes and what triggered each update.

Compute as the Primary Capability Driver

Dwarkesh Patel consistently frames compute as the dominant driver of AI capabilities over the past decade, attributing frontier model progress primarily to ~4x annual increases in training compute rather than algorithmic breakthroughs alone; this creates a "scaling era" moat where labs like OpenAI and Anthropic compete via massive CapEx commitments (e.g., hyperscalers forecasting $600B in 2026, equivalent to ~50GW online over years), but he warns of impending physical limits like chips, power, and GDP fraction that cap this at ~2030, forcing a paradigm shift to efficiency gains.[1][2]
- In his June 2025 essay, Patel notes training compute grew >4x/year, driving all recent gains, but "cannot continue beyond this decade" due to bottlenecks like ASML EUV tools (only ~70-100 by 2030) and power (US scalable to 200+GW via turbines, but grid idle capacity unlock needed).[2][3]
- During Dylan Patel episode (Mar 2026), he probes compute economics: labs like Anthropic at 2-2.5GW now need 5+GW by year-end for revenue, with H100s appreciating in value as AI demand outpaces depreciation.[3]
- Implication: Short-term US lead via Nvidia/TSMC allocation (Google squeezed), but long timelines favor China indigenization by 2030; space GPUs "not happening this decade."[3]

For competitors: Prioritize securing TSMC/ASML slots early (Nvidia did) and diversify power (turbines over grid); post-2030, win via inference efficiency as training plateaus.

Scaling Law Validity

Patel endorses scaling laws as empirically robust for pre-training (smooth power laws on loss vs. compute/data/parameters across orders of magnitude), but questions their extension to RL/post-training regimes lacking "clean" public laws; he sees RL scaling as promising (e.g., "same scaling in RL that we saw for pre-training" per Dario Amodei interview) yet unproven at frontier scales, with no clear y-axis for "usefulness" beyond benchmarks.[4][5]
- In "Will Scaling Work?" (Dec 2023), he debates via Socratic dialogue if laws sustain to AGI or hit data wall post-GPT-5.[6]
- Nov 2025 Ilya Sutskever episode: Notes transition "from pre-training to RL" scaling, but no "law of physics" like pre-training's power law; probes RL efficiency (value functions optional, just slower without).[5]
- Defends vs. skeptics like Richard Sutton (Sep 2025): LLMs are "Bitter Lesson-pilled" by scalably incorporating human knowledge; next-token prediction builds world models for RL scaffolding.[7]

Entrants: Test RL scaling privately at small scales; public laws undervalue private gains like inference optimizations.

Pre-Training vs. Post-Training/RL Emphasis

Patel views pre-training as foundational (trillions of tokens yielding broad priors/world models) but increasingly augmented by RL/post-training for skills/generalization; he highlights RL's rise as the "new scaling" phase (e.g., task-specific RL leading to generalization), but stresses LLMs' poor sample efficiency/generalization vs. humans as a "fundamental" gap, pushing for experiential/continual learning atop pre-training.[5][4]
- Feb 2024 Demis Hassabis: Questions strong scaling hypothesis ("throw compute at wide data for intelligence").[8]
- Feb 2026 Dario Amodei: Notes RL phase atop pre-training shows similar scaling; continual learning "might not be a barrier" via generalization.[4]
- Sutton debate: Pre-training like "school" (imitation prior), then RL/on-job like humans; rejects pure LLMs as dead-end without experience.[7]

To compete: Hybrid stacks—pre-train broad, RL narrow skills; solve continual learning for deployment feedback loops.

AGI Timelines

Patel's timelines lengthened visibly post-2025: lognormal/bimodal distribution ("this decade or bust," 50% by 2028 for end-to-end taxes, 2032 for human-like on-job learning); by Dec 2025, "10-20 years to actual AGI" (human-like learning/sharing knowledge, automating 95% knowledge work); wide distributions but median ~2030-2032, driven by scaling exhaustion.[2][9]
- Jun 2025 essay: "Yearly probability of AGI craters post-2030"; 50/50 taxes 2028, video editor tacit knowledge 2032.[2]
- Dec 2025: Impressive benchmarks but "more useful at long timelines rate"; RLVR won't deliver without child-like learning first.[9]
- Guests contrast: Dario (90% "country of geniuses" in 10y, hunch 1-3y); Ilya (5-20y); shifted bearish after RL skepticism.

New labs: Bet on 2030-2040 window; prep for gradual diffusion, not 2027 explosion.

Takeoff Dynamics

Patel expects "slow takeoff" even to singularity: 1% GDP on AI feels normal; no "moonshot to ASI next year," but decades of infra buildout (gigawatt clusters take time); post-AGI, diffusion limits speed (e.g., robotics +1-2y); recursive self-improvement via millions of copies, but diminishing returns from parallel identical thinkers.[5][10]
- Satya Nadella (Nov 2025): Compresses Industrial Revolution into 20-25y via 10% growth.[10]
- Dec 2025: Agents learn via deployment/hive mind, but competition prevents runaway; economy 10-20%/yr growth.

Infrastructure players: Win via modular power/data centers for jagged rollout.

Bottleneck Prioritization

Continual learning emerges as Patel's top bottleneck: LLMs lack on-job adaptation (e.g., 6mo video editor tacit knowledge = human); no high-level feedback loop; humans excel via context/failure interrogation; solve for "organic" learning to unlock superintelligence rapidly. Other: memory crunch (30% 2026 CapEx), ASML lithography #1 by 2030, inference post-AGI.[2][3]
- Jun 2025: "Huge bottleneck"; 7y plausible but no "obvious way" to slot in.[2]
- Dylan Patel: Logic/memory/power; HBM prices 3x, consumer demand destruction frees supply.[3]
- RLVR skepticism: No clear trend to AGI.[9]

To enter: Target continual learning (e.g., online RL atop LLMs); inference optimization for post-training world.


Recent Findings Supplement (May 2026)

No Major New Episodes with Specified Guests Post-May 2025, But Dwarkesh's Views Evolve via Blogs and Probing Questions

Dwarkesh Patel has not released new podcast episodes with the listed guests (Sutskever, Hassabis, Douglas/Bricken, Amodei, Zuckerberg, Nadella, Huang, Dylan Patel, Carmack, Sutton) strictly after May 5, 2025 that introduce fresh quotes on the core topics—earlier 2025/2026 episodes like Dylan Patel (Mar 2026), Jensen Huang (Apr 2026), Dario Amodei #2 (Feb 2026), Richard Sutton (Sep 2025), and Satya Nadella (Nov 2025) build on prior discussions without dated guest quotes shifting paradigms.[1][2][3] His own blog posts and interview challenges reveal a consistent bearish pivot: scaling hits compute walls by 2030, continual learning remains unsolved (delaying AGI to 2028-2032 median), and RL/post-training hype lacks scaling laws, prioritizing bottlenecks like ASML/EUV over pure compute moats.[4][5]

Continual Learning Emerges as Core Bottleneck, Lengthening Timelines (Jun-Dec 2025 Blogs)

Patel's June 2025 essay marks a visible shift from his "AI 2027" optimism: models can't adapt on-the-job like humans (e.g., 6 months learning video editing), stalling white-collar automation despite scaling; he forecasts 50% chance of end-to-end agentic tasks (taxes) by 2028 but human-like learning by 2032 only, as pre-training/RL can't bridge sample efficiency gaps without new paradigms—compute growth (4x/year) plateaus post-2030 on power/chips/GDP limits.[4] By Dec 2025, he critiques RL "pre-baking" (e.g., browser/Excel skills) as inefficient laundering of pre-training prestige—no RL scaling laws exist (needs 1M x compute for GPT-like gains), implying AGI not imminent if self-directed learning fails; long-term bullish on hive-mind AGI (2030s, trillions revenue) post-continual breakthrough.[5]
- Jun 2: "Continual learning is a huge bottleneck... 50/50 human-like on-job by 2032" (1-year delay from prior median).[4]
- Dec 23: "RL scaling lacks trend... pre-baking pointless if on-job learners emerge soon."[5]

Compute No Longer Primary Driver—Bottlenecks Shift to Lithography/Memory/Power (Dylan Patel Mar 2026)

In probing Dylan Patel, Patel reveals supply chain realism: compute scaling slows as Nvidia locks TSMC N3 (70% by 2027), ASML EUV (100 tools by 2030 caps ~200GW), memory (30% Big Tech 2026 CapEx, HBM crunch triples prices); H100s appreciate (longer depreciation, higher token value); power scales via turbines/batteries (200GW feasible), but logic/memory dominate—US wins short timelines (labs 10GW/year), China long (indigenized DUV).[3]
- Questions expose view: "Growth rate in AI compute has to slow... 2x EUV year-over-year?"; "Fast timelines, US wins."[3]

Scaling Laws Questioned in RL Era, Pre-Training Limits Exposed (Amodei Feb 2026, Sutskever Nov 2025)

Patel challenges Amodei on "end of exponential": no RL scaling laws (vs. pre-training), diffusion "cope" vs. human advantages; probes conservatism (3x compute/year despite trillion TAM, AGI 1-3 years?), continual needs (10M-100M contexts for months learning).[6] Echoes Sutskever: 2020-2025 "scaling age" ends (finite data, jagged generalization); RL/post-training differentiates but needs research (e.g., value functions for robustness)—Patel pushes compute needs for SSI ($3B underfunds vs. rivals billions).[7]

Nvidia Supply Moat Validates Bottleneck Focus, Not Infinite Compute (Huang Apr 2026)

Patel presses Huang: Nvidia's edge is locked supply (TSMC N3 majority, $250B commitments), not CUDA/specs—TPUs compete but GPUs flexible; scaling limits (plumbers > chips?); China sales ok? (flops gap: China 1/10 US).[8]

RL Emphasis Grows, But Sutton Challenges LLM Path (Sep 2025)

Sutton: LLMs not "Bitter Lesson" (over-pre-training, ignore RL/experience); Patel follow-up (Oct): better grasp of RL vision—no shift, aligns with post-training pivot.[1]

Implications for Competitors/Entrants: Pivot to Bottleneck Plays

Patel's framing—continual unsolved, compute walls 2030, RL unproven—means labs waste trillions pre-baking without on-job learners; entrants target ASML alternatives, 3D DRAM, modular power (e.g., turbines); US compute edge erodes long-term (China DUV scale); compete via interpretability/RL recipes, not raw flops—his 2028 explosion median gives 2-3 years to build moats before hive-mind AGI diffs explode.[4][5] No recent policy/regulatory shifts noted; book "Scaling Era" (2025) compiles priors.[9]

Report 2 Research the published academic and industry evidence for and against AI scaling laws as of 2025–2026. Include Chinchilla, Kaplan et al., and subsequent papers that update or challenge them; documented capability gains per order-of-magnitude of compute (OOM) for GPT-4, Claude 3/4, Gemini 1.5/2/3, Llama 3/4, and Grok models; hyperscaler capex commitments (Microsoft/OpenAI Stargate, Google Project Rainier, xAI Colossus, Meta's 2025–2026 infrastructure plans) with specific dollar figures, announced dates, and MW/GW power targets. Distinguish between pre-training scaling evidence and post-training/RL scaling evidence. Quantify where benchmark gains have or have not kept pace with compute investment.

Foundational Pre-Training Scaling Laws: Kaplan vs. Chinchilla and Modern Reconciliations

Kaplan et al. (2020) established early neural scaling laws by fitting power-law relationships between language model test loss and compute (C{0.05}), dataset size (D{0.095}), and non-embedding parameters (N{0.076}), implying optimal models prioritize parameters over data (N_opt ∝ C{0.73}). DeepMind's Chinchilla (Hoffmann et al., 2022) challenged this by training 400+ transformers up to 16B parameters on 5-500B tokens, deriving balanced exponents (α ≈ 0.34 for N, β ≈ 0.28 for D) and a 20 tokens/parameter ratio for compute-optimal training—demonstrating prior models like GPT-3 (175B params, ~300B tokens) were undertrained by ~10x data, as Chinchilla's 70B model outperformed Gopher's 280B.[1][2][3]

Pearce & Song (2024, arXiv:2406.12907) reconciled the discrepancy: Kaplan's bias stemmed from excluding embedding parameters (~30% of total N) and small-scale analysis (<1B params), reproducing Kaplan-like exponents (Nopt ∝ C{0.73}) when simulating Chinchilla under those constraints; using total N and larger scales reaffirms Chinchilla's balance (Nopt ∝ C{0.50}). This holds across dense transformers, with R²=0.997 fits up to 2026 analyses.[3]

  • Chinchilla-optimal ~20 tokens/param validated in Llama series (e.g., Llama 3 8B on 15T tokens = 1,875:1 ratio outperforms prior dense models).[4]
  • Inference-aware extensions (Sardana et al., 2023/2025 "Beyond Chinchilla-Optimal") shift optima: for 1B+ inference requests, smaller models + longer training minimize lifetime cost (training + inference), as inference scales linearly with N but quadratically with deployment volume.[5]

Implications for competitors: Data moats (e.g., Meta's 15T+ for Llama) and inference optimization favor incumbents; new entrants need proprietary data or inference-efficient architectures (e.g., MoE, where active params < total N) to match without 10x compute.

Evidence For Continued Pre-Training Scaling in 2025-2026

Post-Chinchilla laws predict smooth loss reductions via power laws (L(C) ∝ C{-0.05} to -0.1 per OOM compute), holding across 10{17}-10{20} FLOPs in DiT diffusion models and xLSTM (2025 arXivs). Epoch AI tracks 30+ models at GPT-4 scale (~2e25 FLOPs): GPT-4 (2.1e25), Claude 3 Opus (1.6e25), Gemini 1.5 Pro (1.6e25); Llama 3.1 405B inferred ~5e25 FLOPs (15T tokens on 32k H100s). Gains persist: Llama 4 Maverick (17B active MoE) beats GPT-4o/Gemini 2.0 on reasoning/coding via distillation from 2T-param teacher.[6]

  • MMLU: GPT-4 ~86% → Claude 4.6/Gemini 3 Pro ~91% (+5-6% pts over ~1-2 OOM compute).[7]
  • No full OOM quantifications for 2025 models (e.g., Claude 4, Gemini 3 ~10{26} FLOPs inferred), but Arena ELO rises ~100 pts/year despite saturation signals.

Implications: Hyperscalers' $700B+ 2026 capex (Google $180-190B, MSFT $190B, Amazon $200B, Meta $125-145B) bets on ~0.5-1 OOM/year gains; entrants face $100B+ barriers without efficiency (e.g., FP8 training in Llama 4 saves 2x compute).[8]

Challenges and Diminishing Returns: Saturation and Data Limits

2025-2026 evidence shows pre-training laws weakening: benchmarks saturate (MMLU-Pro 90%+ plateau), capabilities jump discontinuously (not smooth power laws), perplexity weakly correlates with reasoning.[9] Data exhaustion looms (public ~10-15T tokens; synthetic recursion risks model collapse); sub-scaling observed (performance decelerates > predicted at >10T tokens). Chinchilla ratios evolve to 80k:1 in tiny models (2026), but dense scaling hits "architecture saturation."[10]

  • Quantitative: ~4x YoY compute (70 years) yields ~2x benchmark gains (e.g., security tasks), implying <0.3 log-loss/OOM.[11]
  • Inference-heavy favors smaller models (Sardana); "Chinchilla Trap" overparameterizes for deployment.

Implications: Challengers pivot to MoE/synthetic data (DeepSeek V3.1, Qwen3 near-frontier at <1e25 FLOPs); pure scale favors xAI/OpenAI with Colossus/Stargate.

Post-Training/RL Scaling: A New Frontier with Predictable but Diminishing Gains

Pre-training dominates (~90% compute), but RL/post-training now rivals it (2025: RL ~pre-train cost). Tan et al. (2025, arXiv:2509.25300) fits RL power laws on Qwen-2.5 (0.5-72B): test loss ∝ (N K(N) C){-α} + E, with α~0.1-0.2 for math reasoning (GRPO); larger N yields 2-3x compute efficiency. S-curve learning efficiency caps gains; 100x RL compute doubles accuracy (33→66%).[12]

  • Distinguish: Pre-train smooth next-token loss; RL extrapolates reasoning (e.g., o1-like: test-time compute scales better than RL for 20-80% accuracy).[13]
  • Benchmarks: RL boosts MMLU +5-10% pts/OOM RL compute, but saturates faster than pre-train.

Implications: Open-source (Llama 4) closes gaps via RL distillation; closed labs (Anthropic) lead agentic tasks, but scaling RL 10x costs ~full pre-train rerun.

Hyperscaler Capex Commitments Fueling the Scale Hypothesis

Microsoft/OpenAI Stargate (announced Jan 21, 2025): $500B over 4 years, 10GW target (Abilene 1.2GW online 2026; 7GW pipeline by 2025-end).[14] xAI Colossus (announced Jun 2024): 2GW (555k GPUs, ~$18B chips), operational 2025.[15] Amazon Project Rainier (announced 2024, operational Oct 2025): $11B, 2.2GW (500k+ Trainium2).[16] Google "Rainier" unconfirmed (possibly Amazon mixup); overall capex $725B 2026 (Google $180-190B). Meta: $125-145B 2026 capex (Hyperion 5GW by 2030).[8]

Project Announced Capex (USD) Power Target
Stargate (MS/OpenAI) Jan 2025 $500B (4yrs) 10GW[14]
Colossus (xAI) Jun 2024 ~$18B+ 2GW[15]
Rainier (Amazon) 2024 $11B 2.2GW[16]
Meta Hyperion Jun 2025 $10B+ 5GW[17]

Implications: $3T+ aggregate 2025-2029 bets ~1-2 OOM/year; power (GW-scale) > chips as bottleneck—new entrants locked out without nuclear/gas deals.

Benchmark Gains vs. Compute: Partial Saturation, Non-Obvious Shifts

~1-2 OOM compute (1e25→1e26 FLOPs, GPT-4 to Claude 4/Gemini 3) yields ~5% MMLU pts, but reasoning (GPQA 50→90%, ARC-AGI-2 30→45%) jumps via RL/inference compute—not pure pre-train.[18] Saturation: MMLU 90%+ plateau, but new evals (Humanity's Last Exam 30%) emerge; gains ~2x benchmark/OOM total compute, lagging early 4x.[11]

Implications: Investors demand ROI proofs (e.g., OpenAI $1.7B/mo rev vs. $4B inference); compete via post-train (cheaper) or efficiency (MoE 2-4x inference savings).


Recent Findings Supplement (May 2026)

Pre-Training Scaling: Evidence Persists with Overtraining Shift

Frontier models in 2026 continue to validate Kaplan/Chinchilla-style power-law improvements in loss with compute, but optimal token-to-parameter ratios have shifted dramatically toward massive overtraining—often 100-185x beyond Chinchilla's 20:1 recommendation—driven by better optimizers (e.g., Muon), synthetic data, and architectures like MoE that unlock gains post-optimal point. This implies pre-training scaling remains predictable but compute-optimal allocation now favors data-heavy regimes for reasoning-heavy models, reducing effective parameter needs via "densing laws."[1][2][3]
- Epoch AI (Feb 2026): Largest run is xAI Grok 4 at ~5e26 FLOP (~24x GPT-4's ~2e25 FLOP), with training costs up 3.5x/year; capabilities grow ~15.5 ECI/year, outpacing hardware alone.[4]
- arXiv (Mar-Apr 2026): Nano-scale experiments (e.g., Karpathy's nanochat) fit steeper curves at 8:1 tokens/param vs. Chinchilla's 20:1; SmolLM3 (3B) uses 11.2T tokens (~3700:1), extrapolating to GPT-3-level at ~91B params/734B tokens (~$1M cost).[5]
- Nature Machine Intelligence (2025, cited 2026): "Densing law" shows max capability density doubles every ~3.5 months; equivalent performance now needs ~half params every 3.5 months (e.g., 2.4B MiniCPM matches 7B Mistral).[3]

For competitors: Pre-training scaling favors data moats + efficiency; small entrants can match via overtraining/synthetics but lack frontier compute (e.g., 5e26 FLOP needs GW-scale clusters).

Post-Training/RL Scaling: Diminishing Returns but Predictable Power Laws

RL/post-training unlocks "latent skills" via test-time compute (e.g., repeated sampling, chain-of-thought), following power laws but with steeper diminishing returns than pre-training—100x RL compute yields ~2x reasoning gains vs. inference's smoother scaling. 2025-26 papers quantify RL loss ~ model sizeα * computeβ * dataγ, but optimal shifts to inference-heavy (T2T laws recommend smaller/overtrained bases).[6][2][7]
- arXiv (Apr 2026): RL post-training on math shows power-law test loss scaling; RL compute now ~matches pre-training cost, but inference cheaper per gain (e.g., 100x RL: 20-80% benchmark jump costs like full pre-train).[6]
- Epoch AI (Feb 2026): Fixed-capability inference costs drop 5-10x/year via distillation (e.g., FrontierMath 27% needs 43M→5M tokens, 3x cheaper in 8 months); RL slopes vary (Scaled RL > GRPO).[8]

Entrants: RL democratizes via open bases, but proprietary post-training (e.g., o1-style) creates moats; focus inference scaling for cost-competitive agents.

Model-Specific Gains per OOM Compute

No direct per-OOM breakdowns for all requested models post-Nov 2025, but Epoch tracks aggregate: Grok 4 at 5e26 FLOP (~1.7 OOM > GPT-4) shows benchmark jumps (e.g., Intelligence Index 53), though open MoEs (DeepSeek V4 Pro 1.6T/49B active) close gaps on non-agentic tasks. Capabilities saturate standards (SWE-bench ~80-100%) but expand horizons (e.g., 7hr agents).[9][4]
- Grok 4: 5e26 FLOP; ~24x GPT-4 compute yields frontier (e.g., GDPval 1500 Elo).[4]
- Gemini 3/Claude 4/Llama 4: MoE shift (e.g., Llama 4 sparse); benchmarks like AIME 2025 ~90-100%, SWE-bench 76-81%, but no FLOP disclosed; open variants match closed on MMLU/GPQA.[9]

Competitors: Track Epoch dashboard; gains slowing on easy benchmarks (diminishing returns), accelerating on agentic (e.g., OSWorld >human baseline).

Hyperscaler Capex Commitments: Explosive GW-Scale Buildout

2026 capex surges to $650-805B (60-83% YoY), ~75% AI infra, powering 10s GW but hitting power walls (e.g., 7GW US delays). Meta leads transparency: $125-145B (up from $72B 2025), funding Llama 4 + 1-5GW sites (Prometheus/Hyperion). Stargate/Colossus/Rainier vague post-Nov 2025—no new $MW dates—but aggregate implies mid-teens GW online.[10][11][4]
- Meta: $125-145B 2026 capex (Q1 call); Prometheus (1GW Ohio, 2026), Hyperion (5GW LA, 2028).[10]
- xAI Colossus 2: 1GW target mid-2026 (Memphis); $20B MS site.[12]
- Stargate: $500B multi-year (phased; Abilene 1.2GW partial 2026, delays/expansions); MS $120B+ FY26.[13]

Entrants: Partner for capacity (e.g., Oracle/OpenAI remnants); decentralized compute viable amid delays.

Benchmark Gains vs. Compute: Gains Keep Pace on Frontiers

Benchmarks saturate easy tasks (SWE-bench 60%→~100% 2024-25) but expand agentic horizons (OSWorld >human; 7hr autonomy), aligning with ~1-1.7 OOM compute jumps yielding 2-5x capabilities via post-training. No evidence gains lag investment; Epoch projects continuation to 2030.[14][4]

Competitors: Prioritize unsaturated evals (e.g., FrontierMath, TerminalBench); open models near-parity eases entry.

Report 3 Research the current state of evidence (2024–2026) that post-training methods — RLHF, RLAIF, reinforcement learning from outcome feedback, chain-of-thought, test-time compute scaling — are delivering capability gains that rival or exceed pre-training scaling. Pull from published papers (DeepSeek-R1, OpenAI o1/o3, Gemini 2.0 Flash Thinking, Anthropic's Constitutional AI updates), benchmark results on reasoning tasks (AIME, ARC-AGI, SWE-bench, GPQA), and public commentary from researchers. Assess whether the "post-training is the new frontier" thesis changes the compute bottleneck argument or merely shifts where compute matters. Conclude with a structured comparison of pre-training vs. post-training as capability levers.

OpenAI o1/o3: RL Trains Models to Scale Inference Compute Like Pretraining Scales Parameters

OpenAI's o1/o3 series demonstrates how reinforcement learning (RL) on chain-of-thought reasoning creates log-linear "test-time scaling laws," where performance on reasoning benchmarks improves predictably with additional inference compute—mirroring pretraining's parameter-data scaling but shifting compute from training to deployment. The mechanism: RL trains the model to generate longer, self-correcting reasoning traces during inference, effectively turning extra tokens into "thinking time" that boosts accuracy (e.g., o1's AIME score rises from ~50% at low compute to 74% at high).[1][2]
- o3 scaled RL training compute 10x over o1, pushing AIME from 83% (o1) to 96%+ while maintaining log-linear gains; both show train-time RL compute yielding similar curves to GPT pretraining.[3]
- Benchmarks: o3 hits 87.5% ARC-AGI (private eval), 87.7% GPQA Diamond (exceeding PhD experts at 65%), 71.7% SWE-bench; o1-mini at 83.3% AIME 2024.[4]
For competitors: OpenAI's RL moat lies in proprietary synthetic reasoning data from massive pretraining, enabling efficient scaling without public data exhaustion.

DeepSeek-R1: Pure RL Unlocks Reasoning from Base Pretraining (5% Compute Fraction)

DeepSeek-R1 applies RL from verifiable rewards (RLVR) directly to a V3 base model (2.8M H800-hours pretrain), using just 147K hours (~5%) for multi-stage RL—yet matches o1-preview on AIME (79.8% vs 83.3%), GPQA (71.5%), ARC-AGI (72.6%), and SWE-bench (49.2%). Mechanism: Group Relative Policy Optimization (GRPO) generates response groups per prompt, ranks by verifiable outcomes (e.g., code execution, math solvers), and optimizes without a separate reward model, eliciting emergent chain-of-thought from base capabilities.[5][6]
- RLVR stages: Cold-start SFT on synthetic traces, pure RL (R1-Zero: 71% AIME from 15.6% base), rejection sampling, final RL; total ~$1M post V3's $5M pretrain.[7]
- Open-sourced (MIT), beats o1-mini on math/code; distillation to 32B retains most gains, proving RL teaches transferable reasoning patterns.[8]
Implication: Strong pretraining provides "latent reasoning potential"; RL extracts it cheaply, challenging pretrain dominance but requiring verifiable tasks.

Gemini 2.0/3 Deep Think & Anthropic Claude: Hybrid RL/CoT for Frontier Benchmarks

Gemini 3 Deep Think uses "thinking levels" (low/medium/high) via internal CoT+search, hitting 84.6% ARC-AGI-2 (verified), 94.3% GPQA, while Claude Opus 4.7 reaches 94.6% GPQA, 87.6% SWE-bench via Constitutional AI (RLAIF+RLHF on 80-page principles).[9][10]
- Gemini: Adaptive test-time compute scales inference like o1 (e.g., 77.1% ARC-AGI-2 for 3.1 Pro); no public RL details, but "Deep Think" implies RL-trained traces.[11]
- Anthropic: 2026 constitution expands RLAIF (AI self-critique+revision per principles), combined with RLHF; Claude leads SWE-bench (76.8%), AIME ~83%.[12]
Non-obvious: These rival o1/o3 on reasoning but lag coding (Gemini 63.8% SWE vs o3 71.7%), showing RL specialization matters.

Post-Training Scaling Laws: Log-Linear Gains, But Cheaper Than Pretraining

2025 surveys confirm post-training (SFT+RLxF+TTC) follows Chinchilla-like laws: reward/accuracy ~ log(RL compute), robust across RLHF/RLAIF/DPO/GRPO, but saturates faster than pretrain (e.g., RL needs ~20x less data for same delta).[13][14]
- Nathan Lambert (2025): Post-training now 40%+ total compute (up from <10%), but still


Recent Findings Supplement (May 2026)

DeepSeek-R1 Pioneers RLVR as Post-Training Frontier

DeepSeek-R1 deploys Reinforcement Learning with Verifiable Rewards (RLVR) directly on base models without supervised fine-tuning (SFT), using Group Relative Policy Optimization (GRPO) to reward only final-answer correctness on math/code tasks; this elicits emergent chain-of-thought (CoT) reasoning—longer traces with verification/reflection—yielding o1-level performance at ~1/10th compute cost, as the policy explores freely rather than imitating human patterns.[1][2]
- RLVR paper updated Jan 2026 (86 pages): Details R1-Zero (pure RL from V3-Base) hitting 71% AIME 2024 pass@1 (from 15.6%), via self-evolution on MATH levels 4-5 (55%→90%); R1-0528 update May 2025 boosts AIME 2025 to 87.5% (+17.5pts), GPQA 81% (+9.5pts).[3][4]
- Benchmarks: 79.8% AIME 2024 (beats o1-mini), 80.2% HumanEval; distills to 7B/14B models rivaling 235B thinkers on MATH-500 (92.8%).[5][6]
For competitors: RLVR scales reasoning 10-20x cheaper than pre-training equivalents; open-weights democratize it, but lacks o3's multimodal/generalization edge—focus on verifiable domains.

OpenAI o3/o3-mini: RL-Enhanced Test-Time Scaling Hits Saturation

OpenAI's o3 (Apr 2025) internalizes RL-trained CoT via "reasoning effort" levels (low/medium/high), auto-allocating test-time compute for ~2x prior accuracy on hard tasks; o3-mini (Jan 2025, $1.1/$4.4/M tokens) optimizes STEM at 10x o1 cost-efficiency, but gains diminish on saturated evals like GPQA (near-PhD ceiling).[7][8]
- Benchmarks: o3 87.7% GPQA Diamond (+human experts), 71.7% SWE-bench Verified (+23pts o1), 83.3% AIME 2024, 2727 Codeforces Elo; o3-mini-high 87.3% AIME, 79.7% GPQA, but ARC-AGI-2 ~53% (behind Gemini).[9][10]
- Feb 2026 evals: o3-mini trails GPT-5.2 (100% AIME 2025, 92.4% GPQA) by 5-15pts on reasoning, but 6x cheaper inference.[11]
Implication: Shifts compute from pre-train FLOPs to inference tokens (billed as output), rivaling pre-training yields but exposing users to variable latency/cost; proprietary black-box limits replication.

Gemini 2.0 Flash Thinking: Efficient Inference Scaling via CoT Internalization

Gemini 2.0 Flash Thinking (exp. Feb-Mar 2025) embeds post-training CoT into fast MoE architecture, boosting reasoning ~20% over base Flash at 10x lower cost than GPT-4o ($0.25/$1.50/M); shows thought traces for transparency, but trails o3 on pure math (73.3% AIME 2024 vs 83.3%).[12][13]
- Benchmarks: 74.2% GPQA, strong multilingual/zero-shot (0.98 F1 NER); Deep Think mode (Gemini 3 lineage) hits 77.1% ARC-AGI-2 (2x prior), but SWE-bench ~76%.[11][14]
For entrants: Multimodal-native + cheap API excels agentic tasks; post-training amortizes test-time compute into base speed, but open benchmarks lag US leaders by 5-10pts.

Anthropic Constitutional AI Evolves to Value-Based RL

Anthropic's Jan 2026 Claude Constitution (79 pages, CC0) shifts RLHF/RLAIF from rule-lists to explained principles (e.g., "prioritize safety over usefulness"), generating synthetic data for RL that internalizes "character" (honest/curious/prosocial); reduces reward hacking vs. prior versions.[15]
- Claude 4.x (2026): Opus 4.6 91.3% GPQA, 80.8% SWE-bench Verified, 68.8% ARC-AGI-2 (leads commercial); Sonnet 4.6 89.9% GPQA w/adaptive thinking.[11][16]
Rivals RLVR by scaling AI feedback (RLAIF) on principles, not outcomes; strong safety (low jailbreaks) but verbose CoT hurts speed.

Post-Training Scaling Laws: Saturating Pre-Training Returns

Post-training (RLHF/RLAIF/RLVR + test-time compute) follows power-law compute-performance curves like pre-training but with distinct optima: RL post-training hits knee at ~10-100x less FLOPs for reasoning (e.g., GenRMs scale rewards but evaluator gap closes post-optimization); inference scaling (longer CoT) adds log-linear gains, shifting bottleneck to runtime energy/memory.[17][18]
- Evidence: ScaleRL (2025) shows sigmoidal RL curves (low-gain→sharp→saturate); Qwen3 experiments: thinking GenRMs +1-2% validation but reverse in policy opt; DeepSeek R1: RL alone matches o1 sans SFT data moat.[19]
Doesn't negate pre-training (base knowledge moat persists), but reallocates ~80% frontier compute to post/inference; entrants prioritize verifiable RLVR for 10x efficiency over data-hungry pre-train.

Lever Pre-Training Post-Training (RLHF/RLVR + Test-Time)
Compute Scaling FLOPs → params/data (diminishing post-10e24); knowledge ceiling GPQA~65% human.[11] Inference tokens → CoT depth (log-linear, 10x cheaper); reasoning saturation AIME 95-100%.[9]
Key Gains Broad capabilities (MMLU 90%+). Specialized reasoning (SWE 80%, ARC-AGI-2 77%); emergent verification.[20]
Bottlenecks Data/energy (Chinchilla-optimal hit). Runtime latency/cost; unverifiable domains (RLVR limits).[21]
2026 Frontier ~1e26 FLOPs base (GPT-5/Gemini 3). 50-80% gains RL/inference (o3/R1); hybrid wins (Claude Const. AI).
Report 4 Research the public evidence that Chinese AI labs — specifically DeepSeek, Alibaba Qwen, Moonshot Kimi, and others — are achieving frontier or near-frontier performance at significantly lower compute and cost than US labs. Include DeepSeek-V3 and R1 training cost claims, Qwen 2.5/3 benchmark comparisons, public analysis of their architectural and algorithmic innovations (MoE efficiency, inference optimization, distillation), and what export controls on H100/H800 chips have or have not constrained. Assess how this evidence challenges or qualifies the "more compute wins" thesis and what it implies for the US compute-scaling strategy. Produce a side-by-side capability-vs-compute table where data is publicly available.

DeepSeek-V3: MoE Architecture Delivers Frontier Performance on 10x Less Compute Than Llama 3.1 405B

DeepSeek-V3, a 671B-parameter Mixture-of-Experts (MoE) model activating only 37B parameters per token, matches or exceeds Llama 3.1 405B (a dense 405B model) on key benchmarks like MATH-500, AIME 2024, Codeforces, and SWE-bench Verified—while using just 2.8M H800 GPU hours versus Llama's 30.8M H100 hours for similar token counts (~15T). This efficiency stems from multi-head latent attention (MLA) to compress KV cache, multi-token prediction for denser learning, mixed FP8/BF16 precision (doubling FFN speed), and custom MoE routing with fine-grained experts (reducing all-to-all communication overhead by 50%+). The result: ~250 GFLOPs/token versus 2,448 for Llama 405B, enabling training in under 2 months on a modest 2K-GPU cluster.[1][2][3]
- DeepSeek-V3: 14.8T tokens, 2.788M H800 hours (~$5.6M at $2/GPU-hr, final run only; excludes R&D ~2-4x more).[4]
- Llama 3.1 405B: ~15T tokens, 30.84M H100 hours (~$90M+ at $3/GPU-hr equivalent).[3]
- Benchmarks: DeepSeek-V3 tops Chatbot Arena top-10, beats GPT-4o/Claude 3.5 Sonnet pairs on hard evals; Llama lags on coding/math.[2]

Implications for Competitors: US labs can't replicate this without MoE retrofits (data moats don't transfer), but owning H100s allows 10x more experiments—key for discovery. New entrants need $1B+ CapEx for clusters; rent via cloud but face 2-3x markups.

Model Total Params Active Params Tokens (T) GPU Hours (M) GPU Type Est. Cost (Final Run, USD) Key Benchmarks (e.g., MMLU-Pro/MATH)
DeepSeek-V3 671B (MoE) 37B 14.8 2.79 H800 $5.6M 71%/83% [2]
Llama 3.1 405B 405B (Dense) 405B ~15 30.8 H100 ~$92M 65%/lower [3]

Alibaba Qwen 2.5: Dense Models Punch Above Weight via Data-Centric Scaling, No Compute Disclosure

Qwen 2.5-72B (dense) rivals GPT-4o on MMLU (86.1%), HumanEval (86.6%), and MATH (83.1%) through post-training on 18T tokens emphasizing code/math synthetics, multilingual data (29+ languages), and structured outputs—outpacing Llama 3.1-70B by 4-10pp on most evals without MoE. Efficiency comes from density improvements (e.g., 32B matches prior 72B), long-context packing (128K), and prompt resilience, but no GPU hours/costs released (unlike DeepSeek). Smaller variants (e.g., 7B Coder) beat larger peers via targeted distillation.[5]
- Qwen2.5-72B: Beats Llama-3-405B base on knowledge/coding; Qwen2.5-Math-72B tops GPT-4o on math via CoT/PoT.
- API inference: ~$0.23-0.40/M tokens (10x cheaper than GPT-4o), 200+ t/s on optimized setups.[6]

Implications for Competitors: Proves data quality > raw scale for mid-tier; US firms must match synthetic pipelines. Entrants: Leverage Qwen for cheap coding/math agents, but fine-tune for proprietary data.

Moonshot Kimi: Sparse MoE Scales to 1T Params on H800s, Matching DeepSeek Efficiencies

Moonshot's Kimi K2 series (1T MoE, 32B active) claims ~$4.6M training (unverified, similar to DeepSeek-V3's $5.6M), rivaling GPT-5/Claude on agentic benchmarks (HLE 44.9%, SWE-Bench 71%) via hybrid linear attention, INT4 quantization-native training, and 384-expert routing. Like DeepSeek, uses H800s; no exact GPU hours, but inference at 12% of dense peers via sparse activation.[7]
- Kimi K2.5: GSM8K 94.4% (beats GPT-4.1); API $0.60/$3M tokens (5-10x cheaper).
- Vs DeepSeek: Similar cost/performance; Kimi edges agent/tools.

Implications for Competitors: Validates MoE for constrained hardware. US: Adopt for inference savings; entrants: Run quantized on consumer GPUs.

Export Controls: H800 Loophole Closed, But Spurred Efficiency—Now Huawei Ascends

US controls (2022: block H100; 2023: H800/A800) forced DeepSeek/Qwen/Kimi to H800s (400GB/s vs H100's 900GB/s), yet they innovated (e.g., DeepSeek's comm protocols offset 44% bandwidth loss). DeepSeek claims pure H800 for V3; skeptics cite pre-ban A100 stockpiles (10K+) or smuggling (e.g., Malaysia shells for H100s). Controls accelerated MoE/FP8 (China leads open MoE), but tightened 2024-26 rules (H20 bans) delay R2/V4; Huawei chips enable GLM-5 (744B). Gap: US 74% global AI compute vs China's 14%.[8][9]
- Evidence: DeepSeek V3/R1 on H800; no H100 proof (Nvidia denies); delays noted for H20 shortage.
- Distillation: OpenAI/Anthropic accuse DeepSeek/Moonshot/MiniMax of API scraping (16M+ Claude queries).

Implications for Competitors: Controls buy time (China lags 7mo on ECI), but efficiency erodes lead—US must target distillation/IP.

Challenging "More Compute Wins": Algorithms + Data Now Paramount

Evidence qualifies scaling laws: DeepSeek/Qwen achieve 90% frontier perf on 10-20% compute via MoE (3-5% params active), distillation (R1 from V3), synthetics. "Compute wins" holds for discovery (US experiments 10x more), but inference/deployment favors efficiency—China's edge. Implication: US strategy shifts to software moats (e.g., o1 reasoning), but open-source diffusion (Qwen/DeepSeek Apache/MIT) commoditizes.[10]

For US Compute-Scalers: Double down on 100K+ H100 clusters for AGI breakthroughs; license MoE/distill to match efficiency. Entrants: Build on Chinese opens (e.g., Qwen for $0.2/M), focus verticals—risk: Security/distillation bans. Confidence: High on claims (papers verifiable); medium on full costs (excl. R&D); low on H100 evasion (anecdotal). More audits needed.


Recent Findings Supplement (May 2026)

Recent Model Launches and Efficiency Claims (April 2026)

DeepSeek released V4-Pro (1.6T total parameters, 49B active MoE) and V4-Flash (284B total, 13B active) on April 24, 2026, as open-weight models under MIT license with 1M token context. These use hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA), reducing single-token inference FLOPs to 27% and KV cache to 10% of V3.2 at 1M context (V4-Flash: 10% FLOPs, 7% KV). Pre-trained on 32-33T tokens; post-training via two-stage SFT+GRPO then distillation. Benchmarks claim near-parity with GPT-5.4/Claude Opus 4.6/Gemini 3.1 Pro (e.g., 80.6% SWE-bench Verified vs. Claude 80.8%; Codeforces 3206 rating). API pricing: V4-Pro $1.74/1M input, $3.48/1M output (~1/6-1/7 US frontier cost); V4-Flash $0.14/$0.28.[1][2][3]
- V4 validated on Huawei Ascend NPUs (1.5-1.73x speedup via fine-grained Expert Parallelism), signaling reduced Nvidia reliance.[4]
- No direct training compute/FLOPs/cost disclosed in tech report; inference efficiency stems from MoE sparsity + attention compression, enabling frontier capability at lower runtime compute.

Moonshot AI's Kimi K2.6 (1T MoE, ~32B active) launched ~April 2026, topping open-weight leaderboards (Intelligence Index 54; beats GPT-5.4/Claude 4.6 on reasoning/coding). Supports 100+ agent swarms, multimodal; pricing $0.95/1M input, $4/1M output. Prior K2 trained on 15.5T tokens; no new compute details, but emphasizes low-bit quant (INT4) for edge deployment.[5]

Alibaba's Qwen3.6 series (e.g., 35B-A3B MoE, 3B active; 27B dense) in April 2026 beats prior Qwen3-235B-A22B on agentic coding (SWE-bench Verified 75%; Terminal-Bench 51.5%). Hybrid dense/MoE for balance; no training FLOPs/cost, but emphasizes "more intelligence, less compute" via architecture/RL/data.[6]

Implications for entrants: Open-weights + MoE enable self-hosting frontier models on 8x H100 (~$300K cluster) at 1/6 inference cost of US APIs, eroding closed-model moats for high-volume apps.

Architectural Innovations Driving Efficiency (Dec 2025-Apr 2026)

DeepSeek-V3.2 (Dec 2025 arXiv) introduced DeepSeek Sparse Attention (DSA): compresses fine-grained tokens to coarse (1:16 ratio) + sliding window, cutting 128K inference cost 60%+ (prefill $0.2/M vs $0.7/M tokens; decode $0.8/M vs $2.4/M on H800). Manifold-Constrained Hyper-Connections replace residuals for stable scaling; scalable RL with >10% pre-train compute budget yields GPT-5 parity (e.g., gold IMO/IOI 2025). V4 extends with CSA/HCA for 1M context at 27% prior FLOPs.[7][3]

Chinese labs favor MoE (top-10 open models; e.g., Kimi K2.6 1T/32B active, Qwen3.6-35B-A3B 35B/3B) for sparse activation (10-30B active vs dense equivalent), slashing inference FLOPs 70-95% while matching dense knowledge capacity. Distillation from large MoE "teachers" to small dense (e.g., DeepSeek-R1-Distill-Qwen-1.5B); RL scales post-training 10%+ pre-train budget.[8]

Implications: Export constraints forced MoE/attention optimizations; US labs must co-design hardware/software for equivalent efficiency, as raw scaling yields diminishing returns.

Export Controls: Limited Constraint on Progress (Ongoing 2026)

US tightened H100/H800 bans (Oct 2023+); H20 (15% H100 perf) licensed limitedly until Apr 2025 (~1.5M units to China, 224K H100e equiv.); H200 approvals paused/reversed amid smuggling ($92M H100 cases). Nvidia zero China share; labs use H800 stockpiles/Huawei Ascend. DeepSeek V4 optimizes Huawei (fine-grained EP speedup); no slowdown—V4 rivals GPT-5.4 despite constraints.[9][4]

Implications: Controls spurred efficiency (e.g., DeepSeek $5.6M V3 train on 2.8M H800-hours vs GPT-4 $100M+ est.), narrowing US lead; future US strategy needs algo/data focus over hardware denial.

Capability-vs-Compute Table (Public Data Only)

Model Total Params Active Params Pre-train Tokens Train Compute/Cost (est.) Key Benchmarks (Recent) US Frontier Comparison
DeepSeek-V3 (2024 base, referenced 2026) 671B MoE 37B 14.8T 2.8M H800-hours (~$5.6M USD)[10] AIME 2025: 96%; GPQA: 81% GPT-5 math parity; 10x cheaper[11]
DeepSeek-R1 (RL on V3) ~70B - - ~$294K (512 H800s)[12] SWE-bench: ~71%; AIME: 87.5% GPT-4o+; 25-30x cheaper tokens[13]
DeepSeek-V4-Pro (Apr 2026) 1.6T MoE 49B 33T Not disclosed SWE-bench Verif: 80.6%; GPQA Diam: 90.1%; LiveCodeBench: 93.5%[2] GPT-5.4/Claude 4.6 (3-6 mo lag est.)[1]
Kimi K2.6 (Apr 2026) 1T MoE ~32B 15.5T (K2 base) Not disclosed (~$4.6M prior K2 est.)[14] Intelligence Index: 54; deep reasoning/coding top open[5] GPT-5.4/Claude 4.6; 10x inference leap on GB200[15]
Qwen3.6-35B-A3B (Apr) 35B MoE 3B Not disclosed Not disclosed SWE-bench Verif: 75%; Terminal 2.0: 51.5%[6] Beats Qwen3-235B-A22B; dense 27B rival[16]

Notes: H100e equiv. est. H800~0.8-1x H100 perf; costs at ~$2/GPU-hr. No exact V4/Qwen3.6 FLOPs; table uses verified claims. US est. (e.g., GPT-4 $100M+) from prior reports.

Challenging "More Compute Wins" (2026 Consensus)

Chinese labs qualify scaling laws: MoE/DSA/HCA deliver GPT-5 parity at 1/10-1/30 train cost/inference via sparsity (active <5% total params) + RL (10%+ pre-train budget). E.g., DeepSeek-V3.2 Speciale gold IMO/IOI on scaled post-train; V4 1M context at 27% FLOPs. Controls accelerated this—US hyperscale CapEx ($600B 2026) vulnerable if efficiency > raw FLOPs.[7]

US Strategy Implications: Pivot to inference optimization (KV compression, test-time scaling); open-weight distillation risks; compute moats erode—focus data moats/agents. Confidence: High on benchmarks/pricing (verified); medium on undisclosed train compute (inferences marked). Additional tech report browses needed for V4 FLOPs.

Sources:
- [web:60] Fortune on V4 launch/pricing
- [web:64] arXiv DeepSeek-V3.2
- [web:88] HF V4-Pro card
- [web:92] HF V4-Pro
- [web:95] VentureBeat V4 Huawei
- [web:113] IntuitionLabs DeepSeek costs
- [web:116] Stanford DeepSeek report (V3/R1)
- [web:117] Introl V3.2
- [web:143] Qwen3.6-35B-A3B blog

Report 5 Research the publicly stated positions of Richard Sutton, Andrej Karpathy, Yann LeCun, and Dylan Patel (SemiAnalysis) on scaling laws, AGI timelines, compute as bottleneck, and the bitter lesson — and map where each explicitly or implicitly diverges from Dwarkesh Patel's synthesis. Include dated blog posts, X/Twitter threads, interviews, and papers. For each figure, identify their single strongest argument that challenges Dwarkesh's framework and their single strongest point of agreement. Produce a structured comparison matrix covering: scaling optimism, timeline confidence, architecture bets, and deployment/economic gap views.

Richard Sutton: Author of the Bitter Lesson Advocates Experiential Learning Over LLM Scaling

Richard Sutton, via his seminal 2019 "Bitter Lesson" essay and a September 2025 Dwarkesh Patel interview, argues that AI progress hinges on general methods leveraging computation through search and learning from raw experience, not human-encoded knowledge like LLMs.[1][2] LLMs mimic human text but lack goals, ground truth feedback, and continual learning from action-reward streams, making them a "dead end" for true intelligence despite short-term gains. This directly challenges Dwarkesh's scaling optimism by predicting experiential RL systems will supersede LLMs, as human priors plateau while compute-leveraging methods scale indefinitely.[2]

  • Bitter Lesson (Mar 13, 2019): General computation-based methods outperform knowledge-built ones long-term, as seen in chess, Go, speech, and vision; warns against human knowledge plateaus.[1]
  • Dwarkesh interview (Sep 2025): LLMs embed human knowledge, violating Bitter Lesson scalability; true intelligence requires on-the-fly learning like animals, not static pretraining.[2]
  • Oak Architecture talk (Aug 2025): Proposes continual RL with meta-learned step-sizes and abstractions for superintelligence from experience alone.[3]

Implications for Competitors: Dwarkesh's LLM+RL scaffold risks lock-in to inefficient human-data dependency; entrants should pivot to raw experiential RL (e.g., robotics sims) for data-efficient scaling post-2030 compute limits.

Andrej Karpathy: Balanced Scaling Plus Agents, AGI in a Decade

Andrej Karpathy, in his October 2025 Dwarkesh interview, supports scaling's empirical success but tempers it: LLMs provide representations via imitation ("ghosts" of humanity), not animal-like evolution, with RL "terrible" due to noise and inefficiency.[4] AGI (~human remote worker) is "a decade away" as agents need multimodality, continual learning, and reliability; progress blends into 2% GDP growth without explosion. Diverges from Dwarkesh's potential short-term takeoff by emphasizing tractable but "difficult" engineering over paradigm shifts.[4]

  • AGI Timelines (Oct 17, 2025): "Tractable but difficult" problems like agent cognition yield AGI in ~10 years; rejects 1-5 year hype.[4]
  • Scaling/Compute: "Everything plus 20%" across data/hardware/algos; pretraining not dominant, models stay practical sizes amid flops budgets.[4]
  • NanoGPT experiments (Jan 2026): Reproduces Chinchilla-optimal scaling (8:1 tokens:params ratio), validating predictable compute-optimal families.[5]

Implications for Competitors: Dwarkesh's explosion odds underestimate agent scaffolding time; focus on "cognitive core" (small, tool-using models) for edge deployment, avoiding frontier compute races.

Yann LeCun: Architectural Paradigm Shift Needed Beyond LLM Scaling

Yann LeCun consistently rejects LLM scaling to AGI (e.g., 2024-2026 talks/papers), arguing autoregressive prediction lacks world models, planning, and common sense; needs joint-embedding predictive architectures (JEPA), energy-based models, and model-predictive control over RL.[6] No specific timelines, but "not in next 2 years... 5-6 years if everything goes well" (Dec 2024); scaling LLMs is inefficient pixel/token prediction vs. latent physics understanding. Implicitly diverges from Dwarkesh by dismissing compute-alone paths, favoring Meta's world-model focus.[7]

  • JEPA Path (2024-2026): Abandon generative/contrastive/RL for regularized embeddings predicting abstract states; LLMs "dead end" without world models.[6]
  • Timelines (Dec 2024): AGI hard, underestimated historically; not imminent via scaling.[7]

Implications for Competitors: Dwarkesh's bet on LLM extrapolation ignores architecture walls; invest in world models (e.g., robotics data) for post-scaling era.

Dylan Patel: Compute Hardware as the Hard Bottleneck to Scaling

Dylan Patel (SemiAnalysis), in March 2026 Dwarkesh interview/posts, details scaling's physical limits: logic (TSMC/ASML EUV ~200GW cap by 2030), memory (HBM crunch, 30% CapEx), power (solvable). Synthetic data unlocks short-term gains, but supply chains cap compute ~4-5x/year growth.[8] No explicit AGI timelines, but implies US wins short-term (fast scaling), China long-term; aligns with Bitter Lesson via infrastructure enabling compute leverage. Diverges implicitly by quantifying Dwarkesh's "post-2030 algo era" constraints earlier via ASML.[8]

  • Bottlenecks (Mar 2026): ASML #1 by 2030 (3.5 tools/GW); memory prices triple; Nvidia dominates N3 wafers.[8]
  • Synthetic Data (Dec 2024): Unlocks rapid improvement next 6-12 months.[9]

Implications for Competitors: Dwarkesh's timelines assume smooth scaling; hedge with diversified supply (e.g., older nodes, non-US fabs) or inference optimization.

Category Sutton (Bitter Lesson Proponent) Karpathy (Balanced Scaler) LeCun (Architecture Skeptic) Patel (Compute Analyst) Dwarkesh Synthesis (Scaling Optimist)
Scaling Optimism Low: LLMs plateau; experiential RL scales better[2] Medium: Continues incrementally ("+20% everything"); obeys Chinchilla-like laws[4] Low: LLMs doomed; needs JEPA shift[6] Medium: Synthetic data boosts short-term; hardware caps long-term[8] High: Predictable till ~2030; 70% AGI by 2040 via scaling+algos[10]
Timeline Confidence No dates; post-LLM era soon via RL[2] ~10 years to AGI (remote worker)[4] 5-6 years optimistic; hard problem[7] Implicit: Fast US scaling wins short-term[8] 50% taxes/computer-use AGI by 2028; lognormal, this decade or bust[10]
Architecture Bets Experiential RL (Oak); no imitation priors[2] LLM agents + cognitive core; RL poor[4] JEPA/energy-based/MPC; abandon LLMs/RL[6] N/A; infra-focused[8] LLMs + RL/scaffolding; continual learning key bottleneck[10]
Deployment/Econ Gap Human knowledge locks scalability[2] Blends into 2% GDP; gradual diffusion[4] World models enable efficient local AI[6] ASML/memory/power cap growth; H100 value rises[8] Explosive if continual learning solved; compute ends ~2030[10]

Strongest Challenges to Dwarkesh:
- Sutton: No goals/ground truth in LLMs blocks world-model RL.[2]
- Karpathy: Agents need decade of engineering; no explosion.[4]
- LeCun: Wrong architecture; scaling predicts tokens, not physics.[6]
- Patel: Hardware (ASML) caps scaling sooner than assumed.[8]

Strongest Agreements:
- All nod to Bitter Lesson/compute leverage; Sutton/Karpathy agree on continual learning need; Patel enables Dwarkesh's scaling short-term.[1][2]


Recent Findings Supplement (May 2026)

Richard Sutton: LLMs Lack Experiential Ground Truth for True Intelligence

Richard Sutton, in his September 26, 2025 Dwarkesh Patel podcast, argues LLMs fail the Bitter Lesson by relying on human data rather than scalable experiential learning with intrinsic goals and ground truth feedback.[1] This creates a non-scalable "prior" that plateaus, as LLMs predict human text without verifying outcomes or adapting via surprise—mechanisms essential for animal-like continual learning. No AGI timelines given, but superintelligence inevitable via RL from experience; compute scales methods, but architecture must enable on-the-fly world modeling first.[1]

  • Sutton clarifies LLMs are "kinda yes, kinda no" Bitter Lesson: they scale compute but inject human knowledge, which history shows gets superseded (e.g., chess, Go).[1]
  • Strongest challenge to Dwarkesh's scaling optimism: LLMs have no "ground truth" (no prediction of real-world response to actions), preventing true world models; building RL atop them repeats past errors where human priors inhibit scalability.[1]
  • Agreement: Continual learning is essential for AGI, as humans/animals learn on-the-job without special training phases.[1]

Implications for competitors: Dwarkesh's LLM+RL scaffold risks data exhaustion; pure experiential RL (e.g., Sutton's Oak architecture, presented Aug 2025) offers a compute-efficient path but needs breakthroughs in meta-learning abstractions.[2]

Andrej Karpathy: Agents Need a Decade for Cognitive Fixes Despite Scaling Gains

In his October 17, 2025 Dwarkesh interview, Karpathy predicts AGI (human-level knowledge work) ~10 years out, as current LLMs suffer "cognitive deficits" like no continual learning or reliable computer use—despite scaling across data/algorithms/compute yielding "everything plus 20%" progress.[3] Pre-training bootstraps representations (Bitter Lesson via internet-scale data as "crappy evolution"), but RL is "terrible" (noisy supervision); shift to inference/post-training dominates, with models shrinking for RL speed.[3]

  • Timeline from experience: past hype (Atari RL, Universe) failed without priors; agents are "slop" now, maturing over decade via multimodality/memory.[3]
  • Strongest challenge: LLMs over-memorize (hazy weights vs. crisp context), needing "cognitive cores" (small, memory-stripped models) for generalization; scaling amplifies bugs like adversarial RL failures.[3]
  • Agreement: LLMs analogize human cognition (context=working memory), enabling gradual agentic progress.[3]

Implications for entrants: Bitter Lesson holds (scale general methods), but prioritize post-training/agents over pre-training giants; compute not sole bottleneck—data quality/RL noise demands hybrid human-AI loops.[4]

Yann LeCun: World Models via JEPA Replace LLM Scaling Dead-End

LeCun, post-Meta (late 2025), launched AMI Labs (Jan 2026, $1.03B seed) for JEPA world models, arguing LLMs can't reach human-level AI via scaling: autoregressive prediction lacks causal physics/planning, hitting "dead end" without world models (predict latent states, not pixels/text).[5][6] LeWorldModel paper (Mar 13, 2026) stabilizes JEPA end-to-end from pixels (15M params, single GPU), enabling 48x faster planning vs. giants.[6] No timelines (rejects AGI term; human-level 3-5+ years via paradigmshift); compute wasted on inefficient LLMs.[7]

  • Bitter Lesson divergence: scaling LLMs "bullshit" for intelligence; JEPA/model-predictive control scales better for reality.[8]
  • Strongest challenge: LLMs can't plan (no world model for "what-if"); JEPA learns causality from video/sensors, obsoleting token prediction.[5]
  • Agreement: Scaling compute/data drives progress, but needs architectural pivot (e.g., his 1989 CNN modernized via scale).[3]

Implications for rivals: LLM labs face $200B+ compute sunk cost fallacy; world models (e.g., V-JEPA 2.1) enable efficient robotics/healthcare, but require multimodal data moats.[6]

Dylan Patel: Compute Supply Chains Bottleneck Aggressive Scaling

In his March 13, 2026 Dwarkesh interview, Patel details compute as AI's trilemma: logic (ASML EUV caps ~200GW by 2030), memory (30% of $600B 2026 CapEx, prices 3x), power (scalable).[9] Scaling laws persist (models 10x cheaper/year), favoring early committers (Nvidia/OpenAI lock-ins); H100s appreciate as efficiency rises. No explicit timelines/AGI, but fast progress via RL/smaller models for research speed; China lags but closes if timelines >2035.[9]

  • Bitter Lesson alignment: Research flops push Pareto frontier; hardware follows (e.g., Blackwell TMA).
  • Strongest challenge: EUV math (~3.5 tools/GW) trumps power hype; older fabs/Taiwan risk limit alternatives.[9]
  • Agreement: Aggressive scaling (5-6GW labs by 2026) needed; power no issue.[9]

Implications for builders: Secure forward contracts; inference specialization (context processors) unlocks economics, but memory crunch favors diversified supply.

Comparison Matrix Scaling Optimism Timeline Confidence Architecture Bets Deployment/Economic Gap
Dwarkesh Synthesis (inferred: LLM+RL scaling) High: Pretrain priors enable RL Medium-short (guests vary; his longer ~decade?)[4] Transformers + agents Compute scales deployment; GDP blends[3]
Sutton Low: LLMs plateau; experience scales Inevitable, no date RL/experiential (Oak) Humans data-limited; on-fly learning economical[1]
Karpathy Medium: Balanced scaling + fixes ~10 years (decade of agents) Cognitive cores + agents Inference/RL dominate; gradual GDP 2%[3]
LeCun Low: LLM dead-end 3-5+ years paradigm shift JEPA/world models Robotics needs physics; local efficient[6]
Patel High: Efficiency laws hold Short-medium (chip-limited) RL/smaller for speed Supply chains cap; forward deals win[9]

No new policy/regulatory updates or stats post-May 2025 beyond compute forecasts; LeWorldModel (Mar 2026) is key publication.[6] Confidence high on interviews; medium on LeCun synthesis (no direct Dwarkesh). Additional X/web for real-time X posts yielded no divergences.

Report 6 Steelman and then document the strongest publicly available evidence that Dwarkesh Patel's scaling-and-compute thesis is wrong or overstated. Specifically research: (a) documented cases where large compute increases produced disappointing or sublinear capability gains (GPT-4 to GPT-5 trajectory, Gemini Ultra 1.0 launch reception); (b) the gap between benchmark performance and real-world economic deployment — evidence that frontier models are not yet automating knowledge work at predicted rates; (c) energy and grid constraint data showing physical limits on US data center build-out timelines; (d) data quality and contamination concerns limiting pre-training scaling; (e) the epistemics critique — academic literature on why AI timeline predictions systematically fail; and (f) interviewee selection bias — how relying on US frontier-lab insiders may systematically skew Dwarkesh's synthesis toward compute-optimism. Produce a ranked list of the five strongest falsifying arguments with supporting evidence and sources.

Physical Compute Constraints Severely Limit Frontier Model Scaling Timelines

US data center expansion for AI training and inference faces acute physical bottlenecks from power grid capacity, electrical equipment shortages (e.g., transformers with 2.5-5 year lead times), and interconnection queues, projecting 30-50% of 2026's planned 12-16 GW capacity delayed or canceled—only ~5 GW under active construction despite $650B+ Big Tech commitments.[1][2] This works via overloaded regional grids (e.g., PJM forecasting shortages by 2027) where hyperscalers compete for finite substation access, forcing project relocations or off-grid solutions like gas generators, but transmission lines take 15-30 years to permit. Non-obvious implication: even if chips scale, effective FLOPs stagnate as clusters idle without power, capping next-gen training runs at 10-20% below announcements.

  • Sightline Climate tracks 190 GW pipeline but only 5 GW of 2026's 12 GW under construction; Wood Mackenzie notes Q4 2025 pipeline halved to 25 GW due to grid brakes.[3]
  • Transformer demand exceeds supply by 30% in 2026 (up 21% YoY), mostly from China amid tariffs; Gartner predicts 40% of AI DCs power-constrained by 2027.[4]
  • For competitors: New entrants need 3-5 GW clusters but face 5+ year queues; incumbents like MSFT hoard via PPAs, widening moats but slowing ecosystem-wide scaling.

Implication for entering the space: Pure compute plays (e.g., building from scratch) fail without pre-existing grid deals; partner with utilities or pivot to edge inference where power is decentralized.

Real-World Knowledge Work Automation Lags Benchmarks by Orders of Magnitude

Frontier agents saturate academic benchmarks but automate only 4.17% of real remote freelance projects (Remote Labor Index: 240 Upwork tasks worth $140K+), vs. humans completing 100%—a 96% gap revealing failures in tool orchestration, error recovery, and end-to-end delivery despite benchmark mastery.[5] Mechanism: Benchmarks test isolated reasoning; RLI demands multi-hour workflows (e.g., game dev, architecture) where agents fail 78% on tool selection and task understanding, as models lack persistent state or economic incentives like commissions. Implication: Economic value accrues slowly, as GDPval shows models ~50% expert-parity on 220 tasks but 100x faster/cheaper only for narrow, non-iterative work—full automation requires human oversight, muting predicted R&D explosions.

  • RLI top: Claude Opus 4.6 (4.17%), GPT-5.2 (2.5%), humans baseline near 100% (6K+ hours value).[5]
  • GDPval: Claude 4.1 ties/wins humans 47.6% (aesthetics), GPT-5 39% strict wins (accuracy); doubled from GPT-4o in 1 year but one-shot only, ignores iteration.[6]
  • For deployment: Agents fail 97.5% on $1K+ gigs; benchmarks like MMLU saturate while real tasks expose "tool usage" as killer gap.

Implication for competitors: Benchmark-chasing wastes compute; build agent scaffolds with human-in-loop for 10-20x ROI before pure autonomy.

High-Quality Pre-Training Data Exhaustion Caps Capability Gains

Public high-quality text data exhausts by 2026 (10-50T tokens available vs. 20T+ needed for next frontiers), forcing reliance on low-quality/synthetic sources that risk "model collapse" via overfitting or degraded distributions—evidenced by GPT-4.5 (10x GPT-4o compute) yielding only marginal gains, signaling sublinear returns.[7][8] Mechanism: Scaling laws predict power-law loss drops, but data scarcity bends curves; repetition adds overfitting penalties growing with model size, while synthetic data follows "rectified" laws but lacks novelty. Non-obvious: Labs hoard private data (e.g., YouTube transcripts), but global exhaustion hits open-source hardest.

  • Epoch AI/Stanford: High-quality text gone by 2026; images 2030-2060.[9]
  • Scale AI's Wang: "Data wall is real," synthetic underperforms without human anchoring.[10]
  • X sentiment: GPT-5.x "diminishing returns," memory walls compound data limits.[11]

Implication for entering: Curate domain-specific human data; synthetic scales narrow tasks but general pre-training plateaus.

Sublinear Capability Gains from Massive Compute Jumps

GPT-5 (877 days post-GPT-4, ~10x compute) launched to "underwhelming" reception—marginal over GPT-4o despite hype, with regressions in reliability, hallucinations persisting, and no "PhD-level" leap; Gemini Ultra benchmarked well (30/32 vs GPT-4) but real use lagged (e.g., slower logic, context loss).[12][13] Mechanism: Power-law scaling slows at frontiers (log returns to compute), plus contamination/overfitting; GPT-4.5 (10x bigger) "only marginally better," hitting "scaling wall."[8] Implication: Bets on 100x clusters yield ~20-30% gains, not revolutions—test-time compute (e.g., o1 reasoning) extracts more from existing models.

  • GPT-5: "Overhyped/underwhelming," stability issues, long wait for small jumps.[14]
  • Gemini: Benchmarks close (e.g., GSM8K 94% vs GPT-4 92%), but "worse in practice."[15]
  • X: "Diminishing returns finally kicked in."[11]

Implication for competitors: Optimize inference scaling over pre-training; small models + reasoning beat giants.

Historical Expert Overoptimism Undermines Short-Timeline Confidence

AI researchers systematically overestimate progress: 2023 AI Impacts survey (2,778 experts) median HLMI 2040 (50% by 2059), vs. 1970s predictions of AGI by 1980-1990 all failed; modern labs' 2027 claims echo past hype without updating on errors.[16] Mechanism: Recency bias + benchmark saturation ignores deployment gaps; Grace et al. show predictions unchanged from non-experts/past flops. Implication: Compute-optimism risks overinvestment if timelines stretch.

  • Surveys: TOP100 median AGI 2040-2050; historical 20-50y misses.[17]
  • AI Impacts: Experts slower than lab CEOs (e.g., Altman 2027 vs median 2040).[18]

Implication for entering: Hedge with diversified bets; long timelines favor infra over models.

Ranked Falsifying Arguments

  1. Physical Compute Constraints (strongest: hard limits, quantified delays).
  2. Real-World Automation Gap (direct economic test, 96% failure).
  3. Data Exhaustion (2026 deadline, lab admissions).
  4. Sublinear Gains (GPT-5/Gemini cases).
  5. Expert Overoptimism (systematic bias in predictions).

Recent Findings Supplement (May 2026)

1. GPT-5 Launch: Massive Compute Yields Sublinear Gains and User Disappointment

OpenAI's GPT-5, released August 2025 after unprecedented compute scaling from GPT-4, triggered a "great AI hype correction": users reported it performed worse than GPT-4 on coding (introducing bugs, unnecessary error handling), instruction-following, and creative tasks, despite benchmark claims—exposing how eval optimization masks real capability plateaus.[1][2]
- GPT-5's router auto-switched models inconsistently, leading to "dumber" outputs; Altman admitted underestimating GPT-4o's "warmth."[1]
- Developers soured: "PhD-level intelligence" polluted code; migration from GPT-4 broke prompting playbooks as native reasoning conflicted with chain-of-thought.[3][4]
- Forums echoed: GPT-5.4/5.2 "disappointing," worse instruction-following than GPT-4; coding "downgraded," "disaster."[5][6]
For competitors: Prioritize architectural innovation over raw scale; pure compute bets risk commoditization as rivals like Claude leapfrog on reliability.

2. Jagged Frontier and Benchmark Saturation: Benchmarks Overstate Real-World Deployment

Frontier models saturate benchmarks (e.g., MMLU >90%, SWE-bench ~100%) via contamination/gaming, but reveal a "jagged frontier" in real tasks: gold-medal math Olympiads yet ~50% analog clock reading; 66% OSWorld agent success but 1/3 failures; <3% real freelance automation—exposing scaling's failure to smooth uneven capabilities for economic value.[7][8]
- AI agents automate 2.5% remote jobs max (Manus); GPT-5/Claude/Grok/Gemini 0.8-2.1% on freelance benchmarks vs. hype.[9]
- Jaggedness: BCG consultants +AI 25% faster/40% better inside frontier, 19% worse outside (complex strategy); robots 89% sim success, 12% real-world.[10][8]
- Benchmarks saturate months post-release; no translation to messy open-world (e.g., HLE low accuracy despite PhD-level claims).[11]
Entrants: Build hybrid systems (neurosymbolic/tools) targeting jagged gaps; pure LLMs commoditize on evals, fail deployment ROI.

3. Data Contamination and Quality Limits: Pretraining Hits Diminishing Returns

Public data walls (~15T tokens) force synthetic/AI-generated inputs, causing "model collapse": outputs degrade (bias amplification, lost edge cases), with 1% bad data breaking models; toxicity needs deliberate injection for detox, but overuse inverts gains—scaling compute can't fix poisoned representations.[12][13]
- 250 poisoned points (<1% data) cripple billion-param models; pipelines ingest slop, amplifying inaccuracies at scale.[14]
- Heterogeneity: High-quality data scarce; repeated/low-quality harms (e.g., PTX loss catastrophic forgetting); 10% toxic optimal, then diminishing.[15]
- Econ: Rivalry via consent/overuse; inverted-U returns under contamination.[12]
New players: Invest in provenance/expert-sourced data (e.g., on-chain verification); frontiers waste trillions on unfixable slop.

4. Expert Critiques: Scaling Flattens, Ideas/Research Now Bottleneck

Ilya Sutskever (ex-OpenAI): Scaling era over—GPT-5 evals disconnect from real-world (repeats, fails recovery); RL launders pretrain prestige sans laws; needs "inductive constraints" like innateness.[16] Gary Marcus: GPT-5 "overhyped/underwhelming," core issues (hallucinations, reasoning) persist post-trillion$ scale.[2]
- Sutskever/Patel: Pretrain power-law weakens; RL lacks trends; "research taste" > compute.[17]
- Benchmarks gameable/saturated; no AGI via scale alone.[16]
Indies: Pivot to post-scaling paradigms (agents, neurosymbolic); labs' compute moats erode as returns flatten.

5. Failed Timelines Forecasting: Benchmarks/Predictions Systematically Overoptimistic

AGI forecasts shift earlier but infrastructure fails: benchmarks saturate/gamed (e.g., 2yrs max viability); no calibration/oversight; definitional ambiguity hides gaps—ex-ante unpredictability dooms compute-centric timelines.[18]
- AIRDA metrics gap: Benchmarks overstate (jagged); real productivity inconclusive (e.g., scientists adopt sans evidence).[19]
- Reflexivity/Goodhart: Targets cease good measures; need dynamic evals.[18]
Outsiders: Use open-world evals (e.g., METR time-horizons: 12hr 50% software); avoid insider-hype bubbles for grounded entry.

Report