Source Report | Understanding Dwarkesh Patel's AI Scaling Thesis: AGI Timelines, Compute, and What He Actually Believes

DeepSeek-V3: MoE Architecture Delivers Frontier Performance on 10x Less Compute Than Llama 3.1 405B

DeepSeek-V3, a 671B-parameter Mixture-of-Experts (MoE) model activating only 37B parameters per token, matches or exceeds Llama 3.1 405B (a dense 405B model) on key benchmarks like MATH-500, AIME 2024, Codeforces, and SWE-bench Verified—while using just 2.8M H800 GPU hours versus Llama's 30.8M H100 hours for similar token counts (~15T). This efficiency stems from multi-head latent attention (MLA) to compress KV cache, multi-token prediction for denser learning, mixed FP8/BF16 precision (doubling FFN speed), and custom MoE routing with fine-grained experts (reducing all-to-all communication overhead by 50%+). The result: ~250 GFLOPs/token versus 2,448 for Llama 405B, enabling training in under 2 months on a modest 2K-GPU cluster.[1][2][3]
- DeepSeek-V3: 14.8T tokens, 2.788M H800 hours (~$5.6M at $2/GPU-hr, final run only; excludes R&D ~2-4x more).[4]
- Llama 3.1 405B: ~15T tokens, 30.84M H100 hours (~$90M+ at $3/GPU-hr equivalent).[3]
- Benchmarks: DeepSeek-V3 tops Chatbot Arena top-10, beats GPT-4o/Claude 3.5 Sonnet pairs on hard evals; Llama lags on coding/math.[2]

Implications for Competitors: US labs can't replicate this without MoE retrofits (data moats don't transfer), but owning H100s allows 10x more experiments—key for discovery. New entrants need $1B+ CapEx for clusters; rent via cloud but face 2-3x markups.

Model	Total Params	Active Params	Tokens (T)	GPU Hours (M)	GPU Type	Est. Cost (Final Run, USD)	Key Benchmarks (e.g., MMLU-Pro/MATH)
DeepSeek-V3	671B (MoE)	37B	14.8	2.79	H800	$5.6M	71%/83% [2]
Llama 3.1 405B	405B (Dense)	405B	~15	30.8	H100	~$92M	65%/lower [3]

Alibaba Qwen 2.5: Dense Models Punch Above Weight via Data-Centric Scaling, No Compute Disclosure

Qwen 2.5-72B (dense) rivals GPT-4o on MMLU (86.1%), HumanEval (86.6%), and MATH (83.1%) through post-training on 18T tokens emphasizing code/math synthetics, multilingual data (29+ languages), and structured outputs—outpacing Llama 3.1-70B by 4-10pp on most evals without MoE. Efficiency comes from density improvements (e.g., 32B matches prior 72B), long-context packing (128K), and prompt resilience, but no GPU hours/costs released (unlike DeepSeek). Smaller variants (e.g., 7B Coder) beat larger peers via targeted distillation.[5]
- Qwen2.5-72B: Beats Llama-3-405B base on knowledge/coding; Qwen2.5-Math-72B tops GPT-4o on math via CoT/PoT.
- API inference: ~$0.23-0.40/M tokens (10x cheaper than GPT-4o), 200+ t/s on optimized setups.[6]

Implications for Competitors: Proves data quality > raw scale for mid-tier; US firms must match synthetic pipelines. Entrants: Leverage Qwen for cheap coding/math agents, but fine-tune for proprietary data.

Moonshot Kimi: Sparse MoE Scales to 1T Params on H800s, Matching DeepSeek Efficiencies

Moonshot's Kimi K2 series (1T MoE, 32B active) claims ~$4.6M training (unverified, similar to DeepSeek-V3's $5.6M), rivaling GPT-5/Claude on agentic benchmarks (HLE 44.9%, SWE-Bench 71%) via hybrid linear attention, INT4 quantization-native training, and 384-expert routing. Like DeepSeek, uses H800s; no exact GPU hours, but inference at 12% of dense peers via sparse activation.[7]
- Kimi K2.5: GSM8K 94.4% (beats GPT-4.1); API $0.60/$3M tokens (5-10x cheaper).
- Vs DeepSeek: Similar cost/performance; Kimi edges agent/tools.

Implications for Competitors: Validates MoE for constrained hardware. US: Adopt for inference savings; entrants: Run quantized on consumer GPUs.

Export Controls: H800 Loophole Closed, But Spurred Efficiency—Now Huawei Ascends

US controls (2022: block H100; 2023: H800/A800) forced DeepSeek/Qwen/Kimi to H800s (400GB/s vs H100's 900GB/s), yet they innovated (e.g., DeepSeek's comm protocols offset 44% bandwidth loss). DeepSeek claims pure H800 for V3; skeptics cite pre-ban A100 stockpiles (10K+) or smuggling (e.g., Malaysia shells for H100s). Controls accelerated MoE/FP8 (China leads open MoE), but tightened 2024-26 rules (H20 bans) delay R2/V4; Huawei chips enable GLM-5 (744B). Gap: US 74% global AI compute vs China's 14%.[8][9]
- Evidence: DeepSeek V3/R1 on H800; no H100 proof (Nvidia denies); delays noted for H20 shortage.
- Distillation: OpenAI/Anthropic accuse DeepSeek/Moonshot/MiniMax of API scraping (16M+ Claude queries).

Implications for Competitors: Controls buy time (China lags 7mo on ECI), but efficiency erodes lead—US must target distillation/IP.

Challenging "More Compute Wins": Algorithms + Data Now Paramount

Evidence qualifies scaling laws: DeepSeek/Qwen achieve 90% frontier perf on 10-20% compute via MoE (3-5% params active), distillation (R1 from V3), synthetics. "Compute wins" holds for discovery (US experiments 10x more), but inference/deployment favors efficiency—China's edge. Implication: US strategy shifts to software moats (e.g., o1 reasoning), but open-source diffusion (Qwen/DeepSeek Apache/MIT) commoditizes.[10]

For US Compute-Scalers: Double down on 100K+ H100 clusters for AGI breakthroughs; license MoE/distill to match efficiency. Entrants: Build on Chinese opens (e.g., Qwen for $0.2/M), focus verticals—risk: Security/distillation bans. Confidence: High on claims (papers verifiable); medium on full costs (excl. R&D); low on H100 evasion (anecdotal). More audits needed.

Recent Findings Supplement (May 2026)

Recent Model Launches and Efficiency Claims (April 2026)

DeepSeek released V4-Pro (1.6T total parameters, 49B active MoE) and V4-Flash (284B total, 13B active) on April 24, 2026, as open-weight models under MIT license with 1M token context. These use hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA), reducing single-token inference FLOPs to 27% and KV cache to 10% of V3.2 at 1M context (V4-Flash: 10% FLOPs, 7% KV). Pre-trained on 32-33T tokens; post-training via two-stage SFT+GRPO then distillation. Benchmarks claim near-parity with GPT-5.4/Claude Opus 4.6/Gemini 3.1 Pro (e.g., 80.6% SWE-bench Verified vs. Claude 80.8%; Codeforces 3206 rating). API pricing: V4-Pro $1.74/1M input, $3.48/1M output (~1/6-1/7 US frontier cost); V4-Flash $0.14/$0.28.[1][2][3]
- V4 validated on Huawei Ascend NPUs (1.5-1.73x speedup via fine-grained Expert Parallelism), signaling reduced Nvidia reliance.[4]
- No direct training compute/FLOPs/cost disclosed in tech report; inference efficiency stems from MoE sparsity + attention compression, enabling frontier capability at lower runtime compute.

Moonshot AI's Kimi K2.6 (1T MoE, ~32B active) launched ~April 2026, topping open-weight leaderboards (Intelligence Index 54; beats GPT-5.4/Claude 4.6 on reasoning/coding). Supports 100+ agent swarms, multimodal; pricing $0.95/1M input, $4/1M output. Prior K2 trained on 15.5T tokens; no new compute details, but emphasizes low-bit quant (INT4) for edge deployment.[5]

Alibaba's Qwen3.6 series (e.g., 35B-A3B MoE, 3B active; 27B dense) in April 2026 beats prior Qwen3-235B-A22B on agentic coding (SWE-bench Verified 75%; Terminal-Bench 51.5%). Hybrid dense/MoE for balance; no training FLOPs/cost, but emphasizes "more intelligence, less compute" via architecture/RL/data.[6]

Implications for entrants: Open-weights + MoE enable self-hosting frontier models on 8x H100 (~$300K cluster) at 1/6 inference cost of US APIs, eroding closed-model moats for high-volume apps.

Architectural Innovations Driving Efficiency (Dec 2025-Apr 2026)

DeepSeek-V3.2 (Dec 2025 arXiv) introduced DeepSeek Sparse Attention (DSA): compresses fine-grained tokens to coarse (1:16 ratio) + sliding window, cutting 128K inference cost 60%+ (prefill $0.2/M vs $0.7/M tokens; decode $0.8/M vs $2.4/M on H800). Manifold-Constrained Hyper-Connections replace residuals for stable scaling; scalable RL with >10% pre-train compute budget yields GPT-5 parity (e.g., gold IMO/IOI 2025). V4 extends with CSA/HCA for 1M context at 27% prior FLOPs.[7][3]

Chinese labs favor MoE (top-10 open models; e.g., Kimi K2.6 1T/32B active, Qwen3.6-35B-A3B 35B/3B) for sparse activation (10-30B active vs dense equivalent), slashing inference FLOPs 70-95% while matching dense knowledge capacity. Distillation from large MoE "teachers" to small dense (e.g., DeepSeek-R1-Distill-Qwen-1.5B); RL scales post-training 10%+ pre-train budget.[8]

Implications: Export constraints forced MoE/attention optimizations; US labs must co-design hardware/software for equivalent efficiency, as raw scaling yields diminishing returns.

Export Controls: Limited Constraint on Progress (Ongoing 2026)

US tightened H100/H800 bans (Oct 2023+); H20 (15% H100 perf) licensed limitedly until Apr 2025 (~1.5M units to China, 224K H100e equiv.); H200 approvals paused/reversed amid smuggling ($92M H100 cases). Nvidia zero China share; labs use H800 stockpiles/Huawei Ascend. DeepSeek V4 optimizes Huawei (fine-grained EP speedup); no slowdown—V4 rivals GPT-5.4 despite constraints.[9][4]

Implications: Controls spurred efficiency (e.g., DeepSeek $5.6M V3 train on 2.8M H800-hours vs GPT-4 $100M+ est.), narrowing US lead; future US strategy needs algo/data focus over hardware denial.

Capability-vs-Compute Table (Public Data Only)

Model	Total Params	Active Params	Pre-train Tokens	Train Compute/Cost (est.)	Key Benchmarks (Recent)	US Frontier Comparison
DeepSeek-V3 (2024 base, referenced 2026)	671B MoE	37B	14.8T	2.8M H800-hours (~$5.6M USD)[10]	AIME 2025: 96%; GPQA: 81%	GPT-5 math parity; 10x cheaper[11]
DeepSeek-R1 (RL on V3)	~70B	-	-	~$294K (512 H800s)[12]	SWE-bench: ~71%; AIME: 87.5%	GPT-4o+; 25-30x cheaper tokens[13]
DeepSeek-V4-Pro (Apr 2026)	1.6T MoE	49B	33T	Not disclosed	SWE-bench Verif: 80.6%; GPQA Diam: 90.1%; LiveCodeBench: 93.5%[2]	GPT-5.4/Claude 4.6 (3-6 mo lag est.)[1]
Kimi K2.6 (Apr 2026)	1T MoE	~32B	15.5T (K2 base)	Not disclosed (~$4.6M prior K2 est.)[14]	Intelligence Index: 54; deep reasoning/coding top open[5]	GPT-5.4/Claude 4.6; 10x inference leap on GB200[15]
Qwen3.6-35B-A3B (Apr)	35B MoE	3B	Not disclosed	Not disclosed	SWE-bench Verif: 75%; Terminal 2.0: 51.5%[6]	Beats Qwen3-235B-A22B; dense 27B rival[16]

Notes: H100e equiv. est. H800~0.8-1x H100 perf; costs at ~$2/GPU-hr. No exact V4/Qwen3.6 FLOPs; table uses verified claims. US est. (e.g., GPT-4 $100M+) from prior reports.

Challenging "More Compute Wins" (2026 Consensus)

Chinese labs qualify scaling laws: MoE/DSA/HCA deliver GPT-5 parity at 1/10-1/30 train cost/inference via sparsity (active <5% total params) + RL (10%+ pre-train budget). E.g., DeepSeek-V3.2 Speciale gold IMO/IOI on scaled post-train; V4 1M context at 27% FLOPs. Controls accelerated this—US hyperscale CapEx ($600B 2026) vulnerable if efficiency > raw FLOPs.[7]

US Strategy Implications: Pivot to inference optimization (KV compression, test-time scaling); open-weight distillation risks; compute moats erode—focus data moats/agents. Confidence: High on benchmarks/pricing (verified); medium on undisclosed train compute (inferences marked). Additional tech report browses needed for V4 FLOPs.

Sources:
- [web:60] Fortune on V4 launch/pricing
- [web:64] arXiv DeepSeek-V3.2
- [web:88] HF V4-Pro card
- [web:92] HF V4-Pro
- [web:95] VentureBeat V4 Huawei
- [web:113] IntuitionLabs DeepSeek costs
- [web:116] Stanford DeepSeek report (V3/R1)
- [web:117] Introl V3.2
- [web:143] Qwen3.6-35B-A3B blog

Research Question