Source Report | Understanding Sholto Douglas & Trenton Bricken's Frontier Model Training Thesis

Anthropic's Post-Training Thesis: RL Scaling as the Core Differentiator

Sholto Douglas and Trenton Bricken at Anthropic articulate a thesis where pre-training builds broad associative capabilities through massive compute-optimal scaling (consistent with Chinchilla-like laws), but post-training—particularly scalable RL—unlocks reliable reasoning by inducing meta-learning, error correction, and verified traces that feed back into data loops. This creates a compounding flywheel: RL generates high-quality reasoning data for future pre-training, turning post-training into the efficiency frontier for agentic reliability. Unlike pure pre-training's pattern-matching, RL refines sparse-reward behaviors (e.g., constitutional RL for helpfulness/harmlessness), with evidence from long-context perplexity gains showing no plateau but compute-bounded returns.[1][2]
- Douglas: Pre-training on code/long contexts yields "dramatic" next-token prediction gains equivalent to "huge increments in model scale," with positive transfer to reasoning.[1]
- Bricken: RL enables Pareto optimization (e.g., Anthropic's constitutional RL using LM judges), addressing reliability "nines" for agents.[1]
For competitors, this implies data moats (e.g., OpenAI's o-series RL) are vulnerable if RL scaling generalizes across labs, but Anthropic's interpretability edge (e.g., circuits for addition) accelerates RL stability.

Evidence on Pre-Training Scaling: Diminishing but Persistent (4-5x/Year Growth)

Frontier pre-training compute has grown 4-5x annually since 2020 (e.g., GPT-3 at 3e23 FLOPs to Gemini Ultra at 5e25), following Chinchilla-optimal regimes, but reports from OpenAI/Anthropic/Google indicate data exhaustion and modest gains beyond GPT-4 scale, shifting focus to post-training/inference. GPT-4 to o-series trajectory shows pre-training plateaus on raw loss, but capabilities jump via RL (e.g., o1's 78% MMMU vs. GPT-4o), with Epoch AI noting a post-2018 slowdown to 4.2x/year.[3][4]
- Chinchilla follow-ups (e.g., D-CPT laws) confirm optimal N/D ratios, but high-quality data constraints hit ~2024.[5]
- o-series: RL on CoT yields expert-level reasoning (e.g., AIME jumps), scaling log-linearly with train/test compute, distinct from pre-training constraints.[2]
Entrants must prioritize synthetic data/RL loops over raw pre-training FLOPs, as marginal returns favor verification (e.g., math/code) over general tokens.

Lab Post-Training Approaches: RL-Dominated, with Test-Time Emergence

OpenAI's o-series pioneered CoT RL post-training on base models (e.g., GPT-4o), scaling train-time RL for strategy refinement and test-time compute for thinking, yielding 2-3x benchmark gains (e.g., GPQA). Google's Gemini 2.5 allocates more RL compute for multi-step agents/tools, boosting Elo +110 via SFT/RM/RLHF. Meta's Llama 3.1 uses iterative SFT/rejection/DPO on synthetic data, enabling 405B to rival GPT-4 via post-training quality. Anthropic mirrors with constitutional RL; xAI emphasizes pre-training scale but integrates RL (e.g., Grok-4 >50% FLOPs post-training).[2][6][7]
- Mechanism: RL refines CoT (OpenAI/Google), generates data (Meta), or balances frontiers (Anthropic).
- Vs. Anthropic: All converge on RL for reasoning, but OpenAI leads test-time scaling; Meta democratizes via open-source.
Competitors need RL infra (e.g., verifiable rewards) to match, as post-training now 25-55% of total compute.[8]

Independent Estimates: Post-Training Marginal Returns Rising (2024-2026)

Epoch AI projects pre-training at 4-5x/year through 2026 but highlights post-training's underreported growth (e.g., o1/Grok RL at 10-50% of total FLOPs), with inference scaling revitalizing gains amid data limits. EleutherAI focuses pre-training scaling laws (e.g., contamination effects), but 2025 works extend to post-training quantization/inference. No explicit 2024-2026 ratios, but trends show post-training 40-100% uplift vs. pre-training's diminishing returns (e.g., data-optimal at ~15T tokens).[4][9]
- Epoch: R&D > training compute; post-training subsets (e.g., $500M of OpenAI's 2024 $5B).[10]
Estimate favors post-training for reasoning (high confidence from lab shifts); pre-training for generality (medium, data-bound).

Consensus Emergence: Post-Training as Shared Frontier

Anthropic's RL thesis is increasingly consensus: labs allocate 45-55% compute to post-training (vs. 75% pre-2024), with o1/Gemini/Llama validating scalable RL/test-time as the "new scaling domain." xAI follows (Grok RL-heavy); no lab disputes, but execution varies (OpenAI: CoT; Google: agents). Not Anthropic-specific—industry pivot post-GPT-4 data walls.[11][8]

Lab	Pre-Training View	Post-Training Emphasis	Key Evidence/Quote
Anthropic	Continues (Chinchilla holds); compute-bound	RL scaling (constitutional); meta-learning	"RL improves along Pareto frontier" [Douglas/Bricken][1]
OpenAI	Plateauing returns post-GPT-4	CoT RL; train/test compute scaling	"Improves with more RL... differs from pretraining"[2]
Google DeepMind	4-5x growth; long-context gains	RLHF/SFT for agents/tools	+110 Elo from RL compute [Gemini 2.5][6]
Meta AI	Optimal at 15T tokens; synthetic unlock	Iterative DPO/SFT; open distillation	405B rivals GPT-4 via post-training [Llama 3.1][7]
xAI	Aggressive scaling (Grok-5 10T params)	>50% FLOPs RL	"RL relative to pre-compute is better"[12]

To enter: Build RL pipelines on open models (e.g., Llama); focus synthetic RL data over tokens (high ROI asymmetry). Confidence: High on shift (lab docs); medium on exact ratios (proprietary).

Recent Findings Supplement (May 2026)

Pre-Training Scaling: Evidence Against Plateauing

Google DeepMind's Gemini 3 release in November 2025 provided the strongest recent evidence that pre-training scaling laws remain intact, with the model—trained on the same ~1T parameters as Gemini 2.5—achieving massive gains (e.g., first to break 1500 Elo on LMSYS Arena, beating GPT-5.1 on 19/20 benchmarks) via improved pre-training data quality, architecture, and compute allocation; this refuted 2025 claims of a "scaling wall" by showing 2-3x more effective pre-training compute than GPT-4o equivalents.[1][2] Mechanism: Better data curation (e.g., long-context handling) and optimization allowed scaling without parameter explosion, implying labs like xAI (planning 10T-parameter Grok 5 pre-training in ~2 months by early 2026) and Anthropic will see similar jumps on Blackwell hardware.[3] Implication: No Chinchilla-style plateau yet; GPT-4 to o-series trajectory continues via hybrid pre/post, but economic shifts (inference demand diverting GPUs) temporarily slowed pure pre-training at OpenAI.[4]

Gemini 3 Pre-Training Lead Sebastian Borgeaud: "Significant innovations in long-context processing" enabled continued scaling.[5]
xAI/Elon Musk: Pre-training for 10T Grok 5 (~2 months) signals aggressive scaling resumption in 2026.[6]
Epoch AI (Feb 2026): Software progress ~10x/year reduces compute needed for same capability, sustaining pre-training trends despite data walls.[7]

For competitors: Pre-training moat favors compute-rich labs (Google, xAI); others must optimize data/software or risk falling behind on base capabilities.

Post-Training as Efficiency Multiplier: OpenAI o-Series vs. Gemini

OpenAI's o-series (o1/o3/o4-mini by early 2026) pioneered RL-heavy post-training on base models like GPT-4o, using chain-of-thought RL to boost reasoning (e.g., o3-mini controllability scales with size but degrades with extra RL compute), while Google's Gemini post-training (instruction tuning + RLHF) complemented pre-training for multimodal gains; both outperform pure scaling, but Anthropic's thesis (per Sholto Douglas) emphasizes RL infra for complementary pre/post balance.[8][9][10] Mechanism: Post-training unlocks latent base knowledge via verifiable rewards (e.g., math proofs), with o-series enabling test-time scaling (longer CoT) that rivals 10x pre-training compute. Implication: Diminishing RL returns (Toby Ord: 100x RL ~ half of 100x inference) shift focus to inference/test-time, but efficiency holds (e.g., DeepScaleR-1.5B matches o1-preview on AIME2024 with fraction of compute).[11]

OpenAI o3: RL on reasoning traces; controllability drops >10x with heavy training.[12]
Gemini Deep Think (Feb 2026): Post-training + inference scaling hits 90% on IMO-ProofBench Advanced/PhD math.[13]
Meta Muse Spark (Apr 2026): >10x pre-training efficiency via post-training scaling laws.[14]

For entrants: Post-training lowers barriers (e.g., open-source RL like DeepSeek-R1-Zero replicates o1 scaling), but proprietary data/RL infra creates gaps.

Independent Estimates: Post-Training Edges Pre-Training Margins (2025-2026)

Epoch AI (Feb 2026) estimates software progress at ~10x/year (2-50x CI), making post-training (RL/inference) rival pre-training returns; inference 1OOM often matches 0.5-1OOM pre-training, with RL saturating faster (e.g., 10,000x RL for 20-80% reasoning jump).[15] EleutherAI/PleIAs: Even small models (350M-1.2B) hit Chinchilla-optimal via quality data, implying marginal pre-training gains diminish post-15T tokens.[16] Mechanism: Power-law predictability in RL post-training (Qwen2.5 0.5B-72B: larger models 10x more efficient), but data quality > volume. Implication: 2026 favors post-training hybrids amid data walls.

Epoch: Pre-training growth slowed to <5x/year; post-training/inference fills gap.[17]
USTC/Oxford/Shanghai AI Lab (Apr 2026): RL efficiency latent saturation at scale.[18]

For competitors: Independents validate post-training as high-ROI; pre-training viable only with 10x+ software gains.

Emerging Consensus: Post-Training as New Frontier

No longer Anthropic-specific: All labs (OpenAI o-series RL, Google Gemini RLHF/inference, Meta 10%+ pre-training to post, xAI Grok scaling RL) allocate 25-55% compute to post-training (up from 75% pre), driven by data limits and inference economics; Sholto Douglas: pre/post complementary for 2026 acceleration.[19][10] Mechanism: RL unlocks "aha" reasoning (DeepSeek-R1), test-time compute (o1/o3) amplifies without full retrain. Implication: Consensus on "post-training revolution" (Jan 2026 analyses), with pre-training foundational but post as efficiency lever.[20]

Lab	Pre-Training View	Post-Training Emphasis	Key 2025-2026 Evidence/Cite
Anthropic	Continues; complements RL infra[10]	RL lead (Douglas/Bricken); 2026 acceleration	Claude Code RL scaling[10]
OpenAI	S-curve top; inference diverted GPUs[4]	o-series CoT RL; test-time frontier	o3-mini controllability; ARGO rubrics[21]
Google DeepMind	Intact (Gemini 3 refutes wall)[22]	RLHF + inference scaling	Deep Think 90% PhD math[13]
Meta AI	10x efficiency gains[14]	>10% pre to post; safety RL	Muse Spark scaling laws[23]
xAI	Aggressive (Grok 5 10T)[3]	RL/post for reasoning	Colossus 1-2GW clusters[24]

For entrants: Consensus democratizes via open RL (e.g., Open-Reasoner-Zero), but frontier data moats persist; focus post-training for 2026 parity.

Research Question