Source Report | Understanding Dwarkesh Patel's AI Scaling Thesis: AGI Timelines, Compute, and What He Actually Believes

Foundational Pre-Training Scaling Laws: Kaplan vs. Chinchilla and Modern Reconciliations

Kaplan et al. (2020) established early neural scaling laws by fitting power-law relationships between language model test loss and compute (C^{0.05}), dataset size (D^{0.095}), and non-embedding parameters (N^{0.076}), implying optimal models prioritize parameters over data (N_opt ∝ C^{0.73}). DeepMind's Chinchilla (Hoffmann et al., 2022) challenged this by training 400+ transformers up to 16B parameters on 5-500B tokens, deriving balanced exponents (α ≈ 0.34 for N, β ≈ 0.28 for D) and a 20 tokens/parameter ratio for compute-optimal training—demonstrating prior models like GPT-3 (175B params, ~300B tokens) were undertrained by ~10x data, as Chinchilla's 70B model outperformed Gopher's 280B.[1][2][3]

Pearce & Song (2024, arXiv:2406.12907) reconciled the discrepancy: Kaplan's bias stemmed from excluding embedding parameters (~30% of total N) and small-scale analysis (<1B params), reproducing Kaplan-like exponents (Nopt ∝ C^{0.73}) when simulating Chinchilla under those constraints; using total N and larger scales reaffirms Chinchilla's balance (Nopt ∝ C^{0.50}). This holds across dense transformers, with R²=0.997 fits up to 2026 analyses.[3]

Chinchilla-optimal ~20 tokens/param validated in Llama series (e.g., Llama 3 8B on 15T tokens = 1,875:1 ratio outperforms prior dense models).[4]
Inference-aware extensions (Sardana et al., 2023/2025 "Beyond Chinchilla-Optimal") shift optima: for 1B+ inference requests, smaller models + longer training minimize lifetime cost (training + inference), as inference scales linearly with N but quadratically with deployment volume.[5]

Implications for competitors: Data moats (e.g., Meta's 15T+ for Llama) and inference optimization favor incumbents; new entrants need proprietary data or inference-efficient architectures (e.g., MoE, where active params < total N) to match without 10x compute.

Evidence For Continued Pre-Training Scaling in 2025-2026

Post-Chinchilla laws predict smooth loss reductions via power laws (L(C) ∝ C^{-0.05} to -0.1 per OOM compute), holding across 10^{{17}-10^{20}} FLOPs in DiT diffusion models and xLSTM (2025 arXivs). Epoch AI tracks 30+ models at GPT-4 scale (~2e25 FLOPs): GPT-4 (2.1e25), Claude 3 Opus (1.6e25), Gemini 1.5 Pro (1.6e25); Llama 3.1 405B inferred ~5e25 FLOPs (15T tokens on 32k H100s). Gains persist: Llama 4 Maverick (17B active MoE) beats GPT-4o/Gemini 2.0 on reasoning/coding via distillation from 2T-param teacher.[6]

MMLU: GPT-4 ~86% → Claude 4.6/Gemini 3 Pro ~91% (+5-6% pts over ~1-2 OOM compute).[7]
No full OOM quantifications for 2025 models (e.g., Claude 4, Gemini 3 ~10^{26} FLOPs inferred), but Arena ELO rises ~100 pts/year despite saturation signals.

Implications: Hyperscalers' $700B+ 2026 capex (Google $180-190B, MSFT $190B, Amazon $200B, Meta $125-145B) bets on ~0.5-1 OOM/year gains; entrants face $100B+ barriers without efficiency (e.g., FP8 training in Llama 4 saves 2x compute).[8]

Challenges and Diminishing Returns: Saturation and Data Limits

2025-2026 evidence shows pre-training laws weakening: benchmarks saturate (MMLU-Pro 90%+ plateau), capabilities jump discontinuously (not smooth power laws), perplexity weakly correlates with reasoning.[9] Data exhaustion looms (public ~10-15T tokens; synthetic recursion risks model collapse); sub-scaling observed (performance decelerates > predicted at >10T tokens). Chinchilla ratios evolve to 80k:1 in tiny models (2026), but dense scaling hits "architecture saturation."[10]

Quantitative: ~4x YoY compute (70 years) yields ~2x benchmark gains (e.g., security tasks), implying <0.3 log-loss/OOM.[11]
Inference-heavy favors smaller models (Sardana); "Chinchilla Trap" overparameterizes for deployment.

Implications: Challengers pivot to MoE/synthetic data (DeepSeek V3.1, Qwen3 near-frontier at <1e25 FLOPs); pure scale favors xAI/OpenAI with Colossus/Stargate.

Post-Training/RL Scaling: A New Frontier with Predictable but Diminishing Gains

Pre-training dominates (~90% compute), but RL/post-training now rivals it (2025: RL ~pre-train cost). Tan et al. (2025, arXiv:2509.25300) fits RL power laws on Qwen-2.5 (0.5-72B): test loss ∝ (N K(N) C)^{-α} + E, with α~0.1-0.2 for math reasoning (GRPO); larger N yields 2-3x compute efficiency. S-curve learning efficiency caps gains; 100x RL compute doubles accuracy (33→66%).[12]

Distinguish: Pre-train smooth next-token loss; RL extrapolates reasoning (e.g., o1-like: test-time compute scales better than RL for 20-80% accuracy).[13]
Benchmarks: RL boosts MMLU +5-10% pts/OOM RL compute, but saturates faster than pre-train.

Implications: Open-source (Llama 4) closes gaps via RL distillation; closed labs (Anthropic) lead agentic tasks, but scaling RL 10x costs ~full pre-train rerun.

Hyperscaler Capex Commitments Fueling the Scale Hypothesis

Microsoft/OpenAI Stargate (announced Jan 21, 2025): $500B over 4 years, 10GW target (Abilene 1.2GW online 2026; 7GW pipeline by 2025-end).[14] xAI Colossus (announced Jun 2024): 2GW (555k GPUs, ~$18B chips), operational 2025.[15] Amazon Project Rainier (announced 2024, operational Oct 2025): $11B, 2.2GW (500k+ Trainium2).[16] Google "Rainier" unconfirmed (possibly Amazon mixup); overall capex $725B 2026 (Google $180-190B). Meta: $125-145B 2026 capex (Hyperion 5GW by 2030).[8]

Project	Announced	Capex (USD)	Power Target
Stargate (MS/OpenAI)	Jan 2025	$500B (4yrs)	10GW[14]
Colossus (xAI)	Jun 2024	~$18B+	2GW[15]
Rainier (Amazon)	2024	$11B	2.2GW[16]
Meta Hyperion	Jun 2025	$10B+	5GW[17]

Implications: $3T+ aggregate 2025-2029 bets ~1-2 OOM/year; power (GW-scale) > chips as bottleneck—new entrants locked out without nuclear/gas deals.

Benchmark Gains vs. Compute: Partial Saturation, Non-Obvious Shifts

~1-2 OOM compute (1e25→1e26 FLOPs, GPT-4 to Claude 4/Gemini 3) yields ~5% MMLU pts, but reasoning (GPQA 50→90%, ARC-AGI-2 30→45%) jumps via RL/inference compute—not pure pre-train.[18] Saturation: MMLU 90%+ plateau, but new evals (Humanity's Last Exam 30%) emerge; gains ~2x benchmark/OOM total compute, lagging early 4x.[11]

Implications: Investors demand ROI proofs (e.g., OpenAI $1.7B/mo rev vs. $4B inference); compete via post-train (cheaper) or efficiency (MoE 2-4x inference savings).

Recent Findings Supplement (May 2026)

Pre-Training Scaling: Evidence Persists with Overtraining Shift

Frontier models in 2026 continue to validate Kaplan/Chinchilla-style power-law improvements in loss with compute, but optimal token-to-parameter ratios have shifted dramatically toward massive overtraining—often 100-185x beyond Chinchilla's 20:1 recommendation—driven by better optimizers (e.g., Muon), synthetic data, and architectures like MoE that unlock gains post-optimal point. This implies pre-training scaling remains predictable but compute-optimal allocation now favors data-heavy regimes for reasoning-heavy models, reducing effective parameter needs via "densing laws."[1][2][3]
- Epoch AI (Feb 2026): Largest run is xAI Grok 4 at ~5e26 FLOP (~24x GPT-4's ~2e25 FLOP), with training costs up 3.5x/year; capabilities grow ~15.5 ECI/year, outpacing hardware alone.[4]
- arXiv (Mar-Apr 2026): Nano-scale experiments (e.g., Karpathy's nanochat) fit steeper curves at 8:1 tokens/param vs. Chinchilla's 20:1; SmolLM3 (3B) uses 11.2T tokens (~3700:1), extrapolating to GPT-3-level at ~91B params/734B tokens (~$1M cost).[5]
- Nature Machine Intelligence (2025, cited 2026): "Densing law" shows max capability density doubles every ~3.5 months; equivalent performance now needs ~half params every 3.5 months (e.g., 2.4B MiniCPM matches 7B Mistral).[3]

For competitors: Pre-training scaling favors data moats + efficiency; small entrants can match via overtraining/synthetics but lack frontier compute (e.g., 5e26 FLOP needs GW-scale clusters).

Post-Training/RL Scaling: Diminishing Returns but Predictable Power Laws

RL/post-training unlocks "latent skills" via test-time compute (e.g., repeated sampling, chain-of-thought), following power laws but with steeper diminishing returns than pre-training—100x RL compute yields ~2x reasoning gains vs. inference's smoother scaling. 2025-26 papers quantify RL loss ~ model size^α * compute^β * data^γ, but optimal shifts to inference-heavy (T2T laws recommend smaller/overtrained bases).[6][2][7]
- arXiv (Apr 2026): RL post-training on math shows power-law test loss scaling; RL compute now ~matches pre-training cost, but inference cheaper per gain (e.g., 100x RL: 20-80% benchmark jump costs like full pre-train).[6]
- Epoch AI (Feb 2026): Fixed-capability inference costs drop 5-10x/year via distillation (e.g., FrontierMath 27% needs 43M→5M tokens, 3x cheaper in 8 months); RL slopes vary (Scaled RL > GRPO).[8]

Entrants: RL democratizes via open bases, but proprietary post-training (e.g., o1-style) creates moats; focus inference scaling for cost-competitive agents.

Model-Specific Gains per OOM Compute

No direct per-OOM breakdowns for all requested models post-Nov 2025, but Epoch tracks aggregate: Grok 4 at 5e26 FLOP (~1.7 OOM > GPT-4) shows benchmark jumps (e.g., Intelligence Index 53), though open MoEs (DeepSeek V4 Pro 1.6T/49B active) close gaps on non-agentic tasks. Capabilities saturate standards (SWE-bench ~80-100%) but expand horizons (e.g., 7hr agents).[9][4]
- Grok 4: 5e26 FLOP; ~24x GPT-4 compute yields frontier (e.g., GDPval 1500 Elo).[4]
- Gemini 3/Claude 4/Llama 4: MoE shift (e.g., Llama 4 sparse); benchmarks like AIME 2025 ~90-100%, SWE-bench 76-81%, but no FLOP disclosed; open variants match closed on MMLU/GPQA.[9]

Competitors: Track Epoch dashboard; gains slowing on easy benchmarks (diminishing returns), accelerating on agentic (e.g., OSWorld >human baseline).

Hyperscaler Capex Commitments: Explosive GW-Scale Buildout

2026 capex surges to $650-805B (60-83% YoY), ~75% AI infra, powering 10s GW but hitting power walls (e.g., 7GW US delays). Meta leads transparency: $125-145B (up from $72B 2025), funding Llama 4 + 1-5GW sites (Prometheus/Hyperion). Stargate/Colossus/Rainier vague post-Nov 2025—no new $MW dates—but aggregate implies mid-teens GW online.[10][11][4]
- Meta: $125-145B 2026 capex (Q1 call); Prometheus (1GW Ohio, 2026), Hyperion (5GW LA, 2028).[10]
- xAI Colossus 2: 1GW target mid-2026 (Memphis); $20B MS site.[12]
- Stargate: $500B multi-year (phased; Abilene 1.2GW partial 2026, delays/expansions); MS $120B+ FY26.[13]

Entrants: Partner for capacity (e.g., Oracle/OpenAI remnants); decentralized compute viable amid delays.

Benchmark Gains vs. Compute: Gains Keep Pace on Frontiers

Benchmarks saturate easy tasks (SWE-bench 60%→~100% 2024-25) but expand agentic horizons (OSWorld >human; 7hr autonomy), aligning with ~1-1.7 OOM compute jumps yielding 2-5x capabilities via post-training. No evidence gains lag investment; Epoch projects continuation to 2030.[14][4]

Competitors: Prioritize unsaturated evals (e.g., FrontierMath, TerminalBench); open models near-parity eases entry.

Research Question