Research Question

Research the published academic and industry evidence for and against AI scaling laws as of 2025–2026. Include Chinchilla, Kaplan et al., and subsequent papers that update or challenge them; documented capability gains per order-of-magnitude of compute (OOM) for GPT-4, Claude 3/4, Gemini 1.5/2/3, Llama 3/4, and Grok models; hyperscaler capex commitments (Microsoft/OpenAI Stargate, Google Project Rainier, xAI Colossus, Meta's 2025–2026 infrastructure plans) with specific dollar figures, announced dates, and MW/GW power targets. Distinguish between pre-training scaling evidence and post-training/RL scaling evidence. Quantify where benchmark gains have or have not kept pace with compute investment.

Foundational Pre-Training Scaling Laws: Kaplan vs. Chinchilla and Modern Reconciliations

Kaplan et al. (2020) established early neural scaling laws by fitting power-law relationships between language model test loss and compute (C{0.05}), dataset size (D{0.095}), and non-embedding parameters (N{0.076}), implying optimal models prioritize parameters over data (N_opt ∝ C{0.73}). DeepMind's Chinchilla (Hoffmann et al., 2022) challenged this by training 400+ transformers up to 16B parameters on 5-500B tokens, deriving balanced exponents (α ≈ 0.34 for N, β ≈ 0.28 for D) and a 20 tokens/parameter ratio for compute-optimal training—demonstrating prior models like GPT-3 (175B params, ~300B tokens) were undertrained by ~10x data, as Chinchilla's 70B model outperformed Gopher's 280B.[1][2][3]

Pearce & Song (2024, arXiv:2406.12907) reconciled the discrepancy: Kaplan's bias stemmed from excluding embedding parameters (~30% of total N) and small-scale analysis (<1B params), reproducing Kaplan-like exponents (Nopt ∝ C{0.73}) when simulating Chinchilla under those constraints; using total N and larger scales reaffirms Chinchilla's balance (Nopt ∝ C{0.50}). This holds across dense transformers, with R²=0.997 fits up to 2026 analyses.[3]

  • Chinchilla-optimal ~20 tokens/param validated in Llama series (e.g., Llama 3 8B on 15T tokens = 1,875:1 ratio outperforms prior dense models).[4]
  • Inference-aware extensions (Sardana et al., 2023/2025 "Beyond Chinchilla-Optimal") shift optima: for 1B+ inference requests, smaller models + longer training minimize lifetime cost (training + inference), as inference scales linearly with N but quadratically with deployment volume.[5]

Implications for competitors: Data moats (e.g., Meta's 15T+ for Llama) and inference optimization favor incumbents; new entrants need proprietary data or inference-efficient architectures (e.g., MoE, where active params < total N) to match without 10x compute.

Evidence For Continued Pre-Training Scaling in 2025-2026

Post-Chinchilla laws predict smooth loss reductions via power laws (L(C) ∝ C{-0.05} to -0.1 per OOM compute), holding across 10{17}-10{20} FLOPs in DiT diffusion models and xLSTM (2025 arXivs). Epoch AI tracks 30+ models at GPT-4 scale (~2e25 FLOPs): GPT-4 (2.1e25), Claude 3 Opus (1.6e25), Gemini 1.5 Pro (1.6e25); Llama 3.1 405B inferred ~5e25 FLOPs (15T tokens on 32k H100s). Gains persist: Llama 4 Maverick (17B active MoE) beats GPT-4o/Gemini 2.0 on reasoning/coding via distillation from 2T-param teacher.[6]

  • MMLU: GPT-4 ~86% → Claude 4.6/Gemini 3 Pro ~91% (+5-6% pts over ~1-2 OOM compute).[7]
  • No full OOM quantifications for 2025 models (e.g., Claude 4, Gemini 3 ~10{26} FLOPs inferred), but Arena ELO rises ~100 pts/year despite saturation signals.

Implications: Hyperscalers' $700B+ 2026 capex (Google $180-190B, MSFT $190B, Amazon $200B, Meta $125-145B) bets on ~0.5-1 OOM/year gains; entrants face $100B+ barriers without efficiency (e.g., FP8 training in Llama 4 saves 2x compute).[8]

Challenges and Diminishing Returns: Saturation and Data Limits

2025-2026 evidence shows pre-training laws weakening: benchmarks saturate (MMLU-Pro 90%+ plateau), capabilities jump discontinuously (not smooth power laws), perplexity weakly correlates with reasoning.[9] Data exhaustion looms (public ~10-15T tokens; synthetic recursion risks model collapse); sub-scaling observed (performance decelerates > predicted at >10T tokens). Chinchilla ratios evolve to 80k:1 in tiny models (2026), but dense scaling hits "architecture saturation."[10]

  • Quantitative: ~4x YoY compute (70 years) yields ~2x benchmark gains (e.g., security tasks), implying <0.3 log-loss/OOM.[11]
  • Inference-heavy favors smaller models (Sardana); "Chinchilla Trap" overparameterizes for deployment.

Implications: Challengers pivot to MoE/synthetic data (DeepSeek V3.1, Qwen3 near-frontier at <1e25 FLOPs); pure scale favors xAI/OpenAI with Colossus/Stargate.

Post-Training/RL Scaling: A New Frontier with Predictable but Diminishing Gains

Pre-training dominates (~90% compute), but RL/post-training now rivals it (2025: RL ~pre-train cost). Tan et al. (2025, arXiv:2509.25300) fits RL power laws on Qwen-2.5 (0.5-72B): test loss ∝ (N K(N) C){-α} + E, with α~0.1-0.2 for math reasoning (GRPO); larger N yields 2-3x compute efficiency. S-curve learning efficiency caps gains; 100x RL compute doubles accuracy (33→66%).[12]

  • Distinguish: Pre-train smooth next-token loss; RL extrapolates reasoning (e.g., o1-like: test-time compute scales better than RL for 20-80% accuracy).[13]
  • Benchmarks: RL boosts MMLU +5-10% pts/OOM RL compute, but saturates faster than pre-train.

Implications: Open-source (Llama 4) closes gaps via RL distillation; closed labs (Anthropic) lead agentic tasks, but scaling RL 10x costs ~full pre-train rerun.

Hyperscaler Capex Commitments Fueling the Scale Hypothesis

Microsoft/OpenAI Stargate (announced Jan 21, 2025): $500B over 4 years, 10GW target (Abilene 1.2GW online 2026; 7GW pipeline by 2025-end).[14] xAI Colossus (announced Jun 2024): 2GW (555k GPUs, ~$18B chips), operational 2025.[15] Amazon Project Rainier (announced 2024, operational Oct 2025): $11B, 2.2GW (500k+ Trainium2).[16] Google "Rainier" unconfirmed (possibly Amazon mixup); overall capex $725B 2026 (Google $180-190B). Meta: $125-145B 2026 capex (Hyperion 5GW by 2030).[8]

Project Announced Capex (USD) Power Target
Stargate (MS/OpenAI) Jan 2025 $500B (4yrs) 10GW[14]
Colossus (xAI) Jun 2024 ~$18B+ 2GW[15]
Rainier (Amazon) 2024 $11B 2.2GW[16]
Meta Hyperion Jun 2025 $10B+ 5GW[17]

Implications: $3T+ aggregate 2025-2029 bets ~1-2 OOM/year; power (GW-scale) > chips as bottleneck—new entrants locked out without nuclear/gas deals.

Benchmark Gains vs. Compute: Partial Saturation, Non-Obvious Shifts

~1-2 OOM compute (1e25→1e26 FLOPs, GPT-4 to Claude 4/Gemini 3) yields ~5% MMLU pts, but reasoning (GPQA 50→90%, ARC-AGI-2 30→45%) jumps via RL/inference compute—not pure pre-train.[18] Saturation: MMLU 90%+ plateau, but new evals (Humanity's Last Exam 30%) emerge; gains ~2x benchmark/OOM total compute, lagging early 4x.[11]

Implications: Investors demand ROI proofs (e.g., OpenAI $1.7B/mo rev vs. $4B inference); compete via post-train (cheaper) or efficiency (MoE 2-4x inference savings).


Recent Findings Supplement (May 2026)

Pre-Training Scaling: Evidence Persists with Overtraining Shift

Frontier models in 2026 continue to validate Kaplan/Chinchilla-style power-law improvements in loss with compute, but optimal token-to-parameter ratios have shifted dramatically toward massive overtraining—often 100-185x beyond Chinchilla's 20:1 recommendation—driven by better optimizers (e.g., Muon), synthetic data, and architectures like MoE that unlock gains post-optimal point. This implies pre-training scaling remains predictable but compute-optimal allocation now favors data-heavy regimes for reasoning-heavy models, reducing effective parameter needs via "densing laws."[1][2][3]
- Epoch AI (Feb 2026): Largest run is xAI Grok 4 at ~5e26 FLOP (~24x GPT-4's ~2e25 FLOP), with training costs up 3.5x/year; capabilities grow ~15.5 ECI/year, outpacing hardware alone.[4]
- arXiv (Mar-Apr 2026): Nano-scale experiments (e.g., Karpathy's nanochat) fit steeper curves at 8:1 tokens/param vs. Chinchilla's 20:1; SmolLM3 (3B) uses 11.2T tokens (~3700:1), extrapolating to GPT-3-level at ~91B params/734B tokens (~$1M cost).[5]
- Nature Machine Intelligence (2025, cited 2026): "Densing law" shows max capability density doubles every ~3.5 months; equivalent performance now needs ~half params every 3.5 months (e.g., 2.4B MiniCPM matches 7B Mistral).[3]

For competitors: Pre-training scaling favors data moats + efficiency; small entrants can match via overtraining/synthetics but lack frontier compute (e.g., 5e26 FLOP needs GW-scale clusters).

Post-Training/RL Scaling: Diminishing Returns but Predictable Power Laws

RL/post-training unlocks "latent skills" via test-time compute (e.g., repeated sampling, chain-of-thought), following power laws but with steeper diminishing returns than pre-training—100x RL compute yields ~2x reasoning gains vs. inference's smoother scaling. 2025-26 papers quantify RL loss ~ model sizeα * computeβ * dataγ, but optimal shifts to inference-heavy (T2T laws recommend smaller/overtrained bases).[6][2][7]
- arXiv (Apr 2026): RL post-training on math shows power-law test loss scaling; RL compute now ~matches pre-training cost, but inference cheaper per gain (e.g., 100x RL: 20-80% benchmark jump costs like full pre-train).[6]
- Epoch AI (Feb 2026): Fixed-capability inference costs drop 5-10x/year via distillation (e.g., FrontierMath 27% needs 43M→5M tokens, 3x cheaper in 8 months); RL slopes vary (Scaled RL > GRPO).[8]

Entrants: RL democratizes via open bases, but proprietary post-training (e.g., o1-style) creates moats; focus inference scaling for cost-competitive agents.

Model-Specific Gains per OOM Compute

No direct per-OOM breakdowns for all requested models post-Nov 2025, but Epoch tracks aggregate: Grok 4 at 5e26 FLOP (~1.7 OOM > GPT-4) shows benchmark jumps (e.g., Intelligence Index 53), though open MoEs (DeepSeek V4 Pro 1.6T/49B active) close gaps on non-agentic tasks. Capabilities saturate standards (SWE-bench ~80-100%) but expand horizons (e.g., 7hr agents).[9][4]
- Grok 4: 5e26 FLOP; ~24x GPT-4 compute yields frontier (e.g., GDPval 1500 Elo).[4]
- Gemini 3/Claude 4/Llama 4: MoE shift (e.g., Llama 4 sparse); benchmarks like AIME 2025 ~90-100%, SWE-bench 76-81%, but no FLOP disclosed; open variants match closed on MMLU/GPQA.[9]

Competitors: Track Epoch dashboard; gains slowing on easy benchmarks (diminishing returns), accelerating on agentic (e.g., OSWorld >human baseline).

Hyperscaler Capex Commitments: Explosive GW-Scale Buildout

2026 capex surges to $650-805B (60-83% YoY), ~75% AI infra, powering 10s GW but hitting power walls (e.g., 7GW US delays). Meta leads transparency: $125-145B (up from $72B 2025), funding Llama 4 + 1-5GW sites (Prometheus/Hyperion). Stargate/Colossus/Rainier vague post-Nov 2025—no new $MW dates—but aggregate implies mid-teens GW online.[10][11][4]
- Meta: $125-145B 2026 capex (Q1 call); Prometheus (1GW Ohio, 2026), Hyperion (5GW LA, 2028).[10]
- xAI Colossus 2: 1GW target mid-2026 (Memphis); $20B MS site.[12]
- Stargate: $500B multi-year (phased; Abilene 1.2GW partial 2026, delays/expansions); MS $120B+ FY26.[13]

Entrants: Partner for capacity (e.g., Oracle/OpenAI remnants); decentralized compute viable amid delays.

Benchmark Gains vs. Compute: Gains Keep Pace on Frontiers

Benchmarks saturate easy tasks (SWE-bench 60%→~100% 2024-25) but expand agentic horizons (OSWorld >human; 7hr autonomy), aligning with ~1-1.7 OOM compute jumps yielding 2-5x capabilities via post-training. No evidence gains lag investment; Epoch projects continuation to 2030.[14][4]

Competitors: Prioritize unsaturated evals (e.g., FrontierMath, TerminalBench); open models near-parity eases entry.