Source Report
Research Question
Research the current state of evidence (2024–2026) that post-training methods — RLHF, RLAIF, reinforcement learning from outcome feedback, chain-of-thought, test-time compute scaling — are delivering capability gains that rival or exceed pre-training scaling. Pull from published papers (DeepSeek-R1, OpenAI o1/o3, Gemini 2.0 Flash Thinking, Anthropic's Constitutional AI updates), benchmark results on reasoning tasks (AIME, ARC-AGI, SWE-bench, GPQA), and public commentary from researchers. Assess whether the "post-training is the new frontier" thesis changes the compute bottleneck argument or merely shifts where compute matters. Conclude with a structured comparison of pre-training vs. post-training as capability levers.
OpenAI o1/o3: RL Trains Models to Scale Inference Compute Like Pretraining Scales Parameters
OpenAI's o1/o3 series demonstrates how reinforcement learning (RL) on chain-of-thought reasoning creates log-linear "test-time scaling laws," where performance on reasoning benchmarks improves predictably with additional inference compute—mirroring pretraining's parameter-data scaling but shifting compute from training to deployment. The mechanism: RL trains the model to generate longer, self-correcting reasoning traces during inference, effectively turning extra tokens into "thinking time" that boosts accuracy (e.g., o1's AIME score rises from ~50% at low compute to 74% at high).[1][2]
- o3 scaled RL training compute 10x over o1, pushing AIME from 83% (o1) to 96%+ while maintaining log-linear gains; both show train-time RL compute yielding similar curves to GPT pretraining.[3]
- Benchmarks: o3 hits 87.5% ARC-AGI (private eval), 87.7% GPQA Diamond (exceeding PhD experts at 65%), 71.7% SWE-bench; o1-mini at 83.3% AIME 2024.[4]
For competitors: OpenAI's RL moat lies in proprietary synthetic reasoning data from massive pretraining, enabling efficient scaling without public data exhaustion.
DeepSeek-R1: Pure RL Unlocks Reasoning from Base Pretraining (5% Compute Fraction)
DeepSeek-R1 applies RL from verifiable rewards (RLVR) directly to a V3 base model (2.8M H800-hours pretrain), using just 147K hours (~5%) for multi-stage RL—yet matches o1-preview on AIME (79.8% vs 83.3%), GPQA (71.5%), ARC-AGI (72.6%), and SWE-bench (49.2%). Mechanism: Group Relative Policy Optimization (GRPO) generates response groups per prompt, ranks by verifiable outcomes (e.g., code execution, math solvers), and optimizes without a separate reward model, eliciting emergent chain-of-thought from base capabilities.[5][6]
- RLVR stages: Cold-start SFT on synthetic traces, pure RL (R1-Zero: 71% AIME from 15.6% base), rejection sampling, final RL; total ~$1M post V3's $5M pretrain.[7]
- Open-sourced (MIT), beats o1-mini on math/code; distillation to 32B retains most gains, proving RL teaches transferable reasoning patterns.[8]
Implication: Strong pretraining provides "latent reasoning potential"; RL extracts it cheaply, challenging pretrain dominance but requiring verifiable tasks.
Gemini 2.0/3 Deep Think & Anthropic Claude: Hybrid RL/CoT for Frontier Benchmarks
Gemini 3 Deep Think uses "thinking levels" (low/medium/high) via internal CoT+search, hitting 84.6% ARC-AGI-2 (verified), 94.3% GPQA, while Claude Opus 4.7 reaches 94.6% GPQA, 87.6% SWE-bench via Constitutional AI (RLAIF+RLHF on 80-page principles).[9][10]
- Gemini: Adaptive test-time compute scales inference like o1 (e.g., 77.1% ARC-AGI-2 for 3.1 Pro); no public RL details, but "Deep Think" implies RL-trained traces.[11]
- Anthropic: 2026 constitution expands RLAIF (AI self-critique+revision per principles), combined with RLHF; Claude leads SWE-bench (76.8%), AIME ~83%.[12]
Non-obvious: These rival o1/o3 on reasoning but lag coding (Gemini 63.8% SWE vs o3 71.7%), showing RL specialization matters.
Post-Training Scaling Laws: Log-Linear Gains, But Cheaper Than Pretraining
2025 surveys confirm post-training (SFT+RLxF+TTC) follows Chinchilla-like laws: reward/accuracy ~ log(RL compute), robust across RLHF/RLAIF/DPO/GRPO, but saturates faster than pretrain (e.g., RL needs ~20x less data for same delta).[13][14]
- Nathan Lambert (2025): Post-training now 40%+ total compute (up from <10%), but still
Recent Findings Supplement (May 2026)
DeepSeek-R1 Pioneers RLVR as Post-Training Frontier
DeepSeek-R1 deploys Reinforcement Learning with Verifiable Rewards (RLVR) directly on base models without supervised fine-tuning (SFT), using Group Relative Policy Optimization (GRPO) to reward only final-answer correctness on math/code tasks; this elicits emergent chain-of-thought (CoT) reasoning—longer traces with verification/reflection—yielding o1-level performance at ~1/10th compute cost, as the policy explores freely rather than imitating human patterns.[1][2]
- RLVR paper updated Jan 2026 (86 pages): Details R1-Zero (pure RL from V3-Base) hitting 71% AIME 2024 pass@1 (from 15.6%), via self-evolution on MATH levels 4-5 (55%→90%); R1-0528 update May 2025 boosts AIME 2025 to 87.5% (+17.5pts), GPQA 81% (+9.5pts).[3][4]
- Benchmarks: 79.8% AIME 2024 (beats o1-mini), 80.2% HumanEval; distills to 7B/14B models rivaling 235B thinkers on MATH-500 (92.8%).[5][6]
For competitors: RLVR scales reasoning 10-20x cheaper than pre-training equivalents; open-weights democratize it, but lacks o3's multimodal/generalization edge—focus on verifiable domains.
OpenAI o3/o3-mini: RL-Enhanced Test-Time Scaling Hits Saturation
OpenAI's o3 (Apr 2025) internalizes RL-trained CoT via "reasoning effort" levels (low/medium/high), auto-allocating test-time compute for ~2x prior accuracy on hard tasks; o3-mini (Jan 2025, $1.1/$4.4/M tokens) optimizes STEM at 10x o1 cost-efficiency, but gains diminish on saturated evals like GPQA (near-PhD ceiling).[7][8]
- Benchmarks: o3 87.7% GPQA Diamond (+human experts), 71.7% SWE-bench Verified (+23pts o1), 83.3% AIME 2024, 2727 Codeforces Elo; o3-mini-high 87.3% AIME, 79.7% GPQA, but ARC-AGI-2 ~53% (behind Gemini).[9][10]
- Feb 2026 evals: o3-mini trails GPT-5.2 (100% AIME 2025, 92.4% GPQA) by 5-15pts on reasoning, but 6x cheaper inference.[11]
Implication: Shifts compute from pre-train FLOPs to inference tokens (billed as output), rivaling pre-training yields but exposing users to variable latency/cost; proprietary black-box limits replication.
Gemini 2.0 Flash Thinking: Efficient Inference Scaling via CoT Internalization
Gemini 2.0 Flash Thinking (exp. Feb-Mar 2025) embeds post-training CoT into fast MoE architecture, boosting reasoning ~20% over base Flash at 10x lower cost than GPT-4o ($0.25/$1.50/M); shows thought traces for transparency, but trails o3 on pure math (73.3% AIME 2024 vs 83.3%).[12][13]
- Benchmarks: 74.2% GPQA, strong multilingual/zero-shot (0.98 F1 NER); Deep Think mode (Gemini 3 lineage) hits 77.1% ARC-AGI-2 (2x prior), but SWE-bench ~76%.[11][14]
For entrants: Multimodal-native + cheap API excels agentic tasks; post-training amortizes test-time compute into base speed, but open benchmarks lag US leaders by 5-10pts.
Anthropic Constitutional AI Evolves to Value-Based RL
Anthropic's Jan 2026 Claude Constitution (79 pages, CC0) shifts RLHF/RLAIF from rule-lists to explained principles (e.g., "prioritize safety over usefulness"), generating synthetic data for RL that internalizes "character" (honest/curious/prosocial); reduces reward hacking vs. prior versions.[15]
- Claude 4.x (2026): Opus 4.6 91.3% GPQA, 80.8% SWE-bench Verified, 68.8% ARC-AGI-2 (leads commercial); Sonnet 4.6 89.9% GPQA w/adaptive thinking.[11][16]
Rivals RLVR by scaling AI feedback (RLAIF) on principles, not outcomes; strong safety (low jailbreaks) but verbose CoT hurts speed.
Post-Training Scaling Laws: Saturating Pre-Training Returns
Post-training (RLHF/RLAIF/RLVR + test-time compute) follows power-law compute-performance curves like pre-training but with distinct optima: RL post-training hits knee at ~10-100x less FLOPs for reasoning (e.g., GenRMs scale rewards but evaluator gap closes post-optimization); inference scaling (longer CoT) adds log-linear gains, shifting bottleneck to runtime energy/memory.[17][18]
- Evidence: ScaleRL (2025) shows sigmoidal RL curves (low-gain→sharp→saturate); Qwen3 experiments: thinking GenRMs +1-2% validation but reverse in policy opt; DeepSeek R1: RL alone matches o1 sans SFT data moat.[19]
Doesn't negate pre-training (base knowledge moat persists), but reallocates ~80% frontier compute to post/inference; entrants prioritize verifiable RLVR for 10x efficiency over data-hungry pre-train.
| Lever | Pre-Training | Post-Training (RLHF/RLVR + Test-Time) |
|---|---|---|
| Compute Scaling | FLOPs → params/data (diminishing post-10e24); knowledge ceiling GPQA~65% human.[11] | Inference tokens → CoT depth (log-linear, 10x cheaper); reasoning saturation AIME 95-100%.[9] |
| Key Gains | Broad capabilities (MMLU 90%+). | Specialized reasoning (SWE 80%, ARC-AGI-2 77%); emergent verification.[20] |
| Bottlenecks | Data/energy (Chinchilla-optimal hit). | Runtime latency/cost; unverifiable domains (RLVR limits).[21] |
| 2026 Frontier | ~1e26 FLOPs base (GPT-5/Gemini 3). | 50-80% gains RL/inference (o3/R1); hybrid wins (Claude Const. AI). |