Source Report
Research Question
Research how the scaling and post-training theses advanced by Douglas and Bricken compare with publicly stated positions from researchers at OpenAI, Google DeepMind, Meta AI, and xAI. Specifically: (1) What is the publicly estimated evidence for pre-training scaling continuing vs. plateauing (compute scaling laws, chinchilla follow-ups, GPT-4→o-series capability trajectory)? (2) How do OpenAI's o-series (chain-of-thought RL), Google's Gemini post-training approach, and Meta's open-source findings compare with the Anthropic post-training thesis? (3) What do independent researchers (e.g., Epoch AI, Eleuther, academic labs) estimate about the marginal returns to pre-training compute vs. post-training compute in 2024–2026? (4) Is the "post-training is the new frontier lever" view an Anthropic-specific worldview or an emerging consensus? Produce a comparative table of lab positions with supporting citations.
Anthropic's Post-Training Thesis: RL Scaling as the Core Differentiator
Sholto Douglas and Trenton Bricken at Anthropic articulate a thesis where pre-training builds broad associative capabilities through massive compute-optimal scaling (consistent with Chinchilla-like laws), but post-training—particularly scalable RL—unlocks reliable reasoning by inducing meta-learning, error correction, and verified traces that feed back into data loops. This creates a compounding flywheel: RL generates high-quality reasoning data for future pre-training, turning post-training into the efficiency frontier for agentic reliability. Unlike pure pre-training's pattern-matching, RL refines sparse-reward behaviors (e.g., constitutional RL for helpfulness/harmlessness), with evidence from long-context perplexity gains showing no plateau but compute-bounded returns.[1][2]
- Douglas: Pre-training on code/long contexts yields "dramatic" next-token prediction gains equivalent to "huge increments in model scale," with positive transfer to reasoning.[1]
- Bricken: RL enables Pareto optimization (e.g., Anthropic's constitutional RL using LM judges), addressing reliability "nines" for agents.[1]
For competitors, this implies data moats (e.g., OpenAI's o-series RL) are vulnerable if RL scaling generalizes across labs, but Anthropic's interpretability edge (e.g., circuits for addition) accelerates RL stability.
Evidence on Pre-Training Scaling: Diminishing but Persistent (4-5x/Year Growth)
Frontier pre-training compute has grown 4-5x annually since 2020 (e.g., GPT-3 at 3e23 FLOPs to Gemini Ultra at 5e25), following Chinchilla-optimal regimes, but reports from OpenAI/Anthropic/Google indicate data exhaustion and modest gains beyond GPT-4 scale, shifting focus to post-training/inference. GPT-4 to o-series trajectory shows pre-training plateaus on raw loss, but capabilities jump via RL (e.g., o1's 78% MMMU vs. GPT-4o), with Epoch AI noting a post-2018 slowdown to 4.2x/year.[3][4]
- Chinchilla follow-ups (e.g., D-CPT laws) confirm optimal N/D ratios, but high-quality data constraints hit ~2024.[5]
- o-series: RL on CoT yields expert-level reasoning (e.g., AIME jumps), scaling log-linearly with train/test compute, distinct from pre-training constraints.[2]
Entrants must prioritize synthetic data/RL loops over raw pre-training FLOPs, as marginal returns favor verification (e.g., math/code) over general tokens.
Lab Post-Training Approaches: RL-Dominated, with Test-Time Emergence
OpenAI's o-series pioneered CoT RL post-training on base models (e.g., GPT-4o), scaling train-time RL for strategy refinement and test-time compute for thinking, yielding 2-3x benchmark gains (e.g., GPQA). Google's Gemini 2.5 allocates more RL compute for multi-step agents/tools, boosting Elo +110 via SFT/RM/RLHF. Meta's Llama 3.1 uses iterative SFT/rejection/DPO on synthetic data, enabling 405B to rival GPT-4 via post-training quality. Anthropic mirrors with constitutional RL; xAI emphasizes pre-training scale but integrates RL (e.g., Grok-4 >50% FLOPs post-training).[2][6][7]
- Mechanism: RL refines CoT (OpenAI/Google), generates data (Meta), or balances frontiers (Anthropic).
- Vs. Anthropic: All converge on RL for reasoning, but OpenAI leads test-time scaling; Meta democratizes via open-source.
Competitors need RL infra (e.g., verifiable rewards) to match, as post-training now 25-55% of total compute.[8]
Independent Estimates: Post-Training Marginal Returns Rising (2024-2026)
Epoch AI projects pre-training at 4-5x/year through 2026 but highlights post-training's underreported growth (e.g., o1/Grok RL at 10-50% of total FLOPs), with inference scaling revitalizing gains amid data limits. EleutherAI focuses pre-training scaling laws (e.g., contamination effects), but 2025 works extend to post-training quantization/inference. No explicit 2024-2026 ratios, but trends show post-training 40-100% uplift vs. pre-training's diminishing returns (e.g., data-optimal at ~15T tokens).[4][9]
- Epoch: R&D > training compute; post-training subsets (e.g., $500M of OpenAI's 2024 $5B).[10]
Estimate favors post-training for reasoning (high confidence from lab shifts); pre-training for generality (medium, data-bound).
Consensus Emergence: Post-Training as Shared Frontier
Anthropic's RL thesis is increasingly consensus: labs allocate 45-55% compute to post-training (vs. 75% pre-2024), with o1/Gemini/Llama validating scalable RL/test-time as the "new scaling domain." xAI follows (Grok RL-heavy); no lab disputes, but execution varies (OpenAI: CoT; Google: agents). Not Anthropic-specific—industry pivot post-GPT-4 data walls.[11][8]
| Lab | Pre-Training View | Post-Training Emphasis | Key Evidence/Quote |
|---|---|---|---|
| Anthropic | Continues (Chinchilla holds); compute-bound | RL scaling (constitutional); meta-learning | "RL improves along Pareto frontier" [Douglas/Bricken][1] |
| OpenAI | Plateauing returns post-GPT-4 | CoT RL; train/test compute scaling | "Improves with more RL... differs from pretraining"[2] |
| Google DeepMind | 4-5x growth; long-context gains | RLHF/SFT for agents/tools | +110 Elo from RL compute [Gemini 2.5][6] |
| Meta AI | Optimal at 15T tokens; synthetic unlock | Iterative DPO/SFT; open distillation | 405B rivals GPT-4 via post-training [Llama 3.1][7] |
| xAI | Aggressive scaling (Grok-5 10T params) | >50% FLOPs RL | "RL relative to pre-compute is better"[12] |
To enter: Build RL pipelines on open models (e.g., Llama); focus synthetic RL data over tokens (high ROI asymmetry). Confidence: High on shift (lab docs); medium on exact ratios (proprietary).
Recent Findings Supplement (May 2026)
Pre-Training Scaling: Evidence Against Plateauing
Google DeepMind's Gemini 3 release in November 2025 provided the strongest recent evidence that pre-training scaling laws remain intact, with the model—trained on the same ~1T parameters as Gemini 2.5—achieving massive gains (e.g., first to break 1500 Elo on LMSYS Arena, beating GPT-5.1 on 19/20 benchmarks) via improved pre-training data quality, architecture, and compute allocation; this refuted 2025 claims of a "scaling wall" by showing 2-3x more effective pre-training compute than GPT-4o equivalents.[1][2] Mechanism: Better data curation (e.g., long-context handling) and optimization allowed scaling without parameter explosion, implying labs like xAI (planning 10T-parameter Grok 5 pre-training in ~2 months by early 2026) and Anthropic will see similar jumps on Blackwell hardware.[3] Implication: No Chinchilla-style plateau yet; GPT-4 to o-series trajectory continues via hybrid pre/post, but economic shifts (inference demand diverting GPUs) temporarily slowed pure pre-training at OpenAI.[4]
- Gemini 3 Pre-Training Lead Sebastian Borgeaud: "Significant innovations in long-context processing" enabled continued scaling.[5]
- xAI/Elon Musk: Pre-training for 10T Grok 5 (~2 months) signals aggressive scaling resumption in 2026.[6]
- Epoch AI (Feb 2026): Software progress ~10x/year reduces compute needed for same capability, sustaining pre-training trends despite data walls.[7]
For competitors: Pre-training moat favors compute-rich labs (Google, xAI); others must optimize data/software or risk falling behind on base capabilities.
Post-Training as Efficiency Multiplier: OpenAI o-Series vs. Gemini
OpenAI's o-series (o1/o3/o4-mini by early 2026) pioneered RL-heavy post-training on base models like GPT-4o, using chain-of-thought RL to boost reasoning (e.g., o3-mini controllability scales with size but degrades with extra RL compute), while Google's Gemini post-training (instruction tuning + RLHF) complemented pre-training for multimodal gains; both outperform pure scaling, but Anthropic's thesis (per Sholto Douglas) emphasizes RL infra for complementary pre/post balance.[8][9][10] Mechanism: Post-training unlocks latent base knowledge via verifiable rewards (e.g., math proofs), with o-series enabling test-time scaling (longer CoT) that rivals 10x pre-training compute. Implication: Diminishing RL returns (Toby Ord: 100x RL ~ half of 100x inference) shift focus to inference/test-time, but efficiency holds (e.g., DeepScaleR-1.5B matches o1-preview on AIME2024 with fraction of compute).[11]
- OpenAI o3: RL on reasoning traces; controllability drops >10x with heavy training.[12]
- Gemini Deep Think (Feb 2026): Post-training + inference scaling hits 90% on IMO-ProofBench Advanced/PhD math.[13]
- Meta Muse Spark (Apr 2026): >10x pre-training efficiency via post-training scaling laws.[14]
For entrants: Post-training lowers barriers (e.g., open-source RL like DeepSeek-R1-Zero replicates o1 scaling), but proprietary data/RL infra creates gaps.
Independent Estimates: Post-Training Edges Pre-Training Margins (2025-2026)
Epoch AI (Feb 2026) estimates software progress at ~10x/year (2-50x CI), making post-training (RL/inference) rival pre-training returns; inference 1OOM often matches 0.5-1OOM pre-training, with RL saturating faster (e.g., 10,000x RL for 20-80% reasoning jump).[15] EleutherAI/PleIAs: Even small models (350M-1.2B) hit Chinchilla-optimal via quality data, implying marginal pre-training gains diminish post-15T tokens.[16] Mechanism: Power-law predictability in RL post-training (Qwen2.5 0.5B-72B: larger models 10x more efficient), but data quality > volume. Implication: 2026 favors post-training hybrids amid data walls.
- Epoch: Pre-training growth slowed to <5x/year; post-training/inference fills gap.[17]
- USTC/Oxford/Shanghai AI Lab (Apr 2026): RL efficiency latent saturation at scale.[18]
For competitors: Independents validate post-training as high-ROI; pre-training viable only with 10x+ software gains.
Emerging Consensus: Post-Training as New Frontier
No longer Anthropic-specific: All labs (OpenAI o-series RL, Google Gemini RLHF/inference, Meta 10%+ pre-training to post, xAI Grok scaling RL) allocate 25-55% compute to post-training (up from 75% pre), driven by data limits and inference economics; Sholto Douglas: pre/post complementary for 2026 acceleration.[19][10] Mechanism: RL unlocks "aha" reasoning (DeepSeek-R1), test-time compute (o1/o3) amplifies without full retrain. Implication: Consensus on "post-training revolution" (Jan 2026 analyses), with pre-training foundational but post as efficiency lever.[20]
| Lab | Pre-Training View | Post-Training Emphasis | Key 2025-2026 Evidence/Cite |
|---|---|---|---|
| Anthropic | Continues; complements RL infra[10] | RL lead (Douglas/Bricken); 2026 acceleration | Claude Code RL scaling[10] |
| OpenAI | S-curve top; inference diverted GPUs[4] | o-series CoT RL; test-time frontier | o3-mini controllability; ARGO rubrics[21] |
| Google DeepMind | Intact (Gemini 3 refutes wall)[22] | RLHF + inference scaling | Deep Think 90% PhD math[13] |
| Meta AI | 10x efficiency gains[14] | >10% pre to post; safety RL | Muse Spark scaling laws[23] |
| xAI | Aggressive (Grok 5 10T)[3] | RL/post for reasoning | Colossus 1-2GW clusters[24] |
For entrants: Consensus democratizes via open RL (e.g., Open-Reasoner-Zero), but frontier data moats persist; focus post-training for 2026 parity.