Source Report 1

Research the most recent benchmark performance of open-source models — specifically Kimi…

Full research prompt

Research the most recent benchmark performance of open-source models — specifically Kimi (Moonshot AI), Qwen (Alibaba), and GLM (Zhipu AI) — against frontier closed models like GPT-4o, Claude 3.5/3.7, and Gemini 1.5/2.0 on MMLU, MATH, HumanEval, AIME, LiveCodeBench, and GPQA. Pull from official leaderboards (Hugging Face Open LLM Leaderboard, LMSYS Chatbot Arena), model cards, and recent technical reports published in the last 6-8 weeks. Produce a comparison table showing score deltas between open-source and frontier models over time to quantify gap closure.

From Are Open Source Models like Kimi & Qwen and GLM 5.2 closing the gap on the frontier?

Jon Sinclair using Luminix AI
Jon Sinclair using Luminix AI Strategic Research
Key Takeaway from Are Open Source Models like Kimi & Qwen and GLM 5.2 closi...

Assessments of whether open source models are closing the gap on frontier systems rest on a flawed premise. Differences with models like Kimi, Qwen, and GLM 5.2 have fractured into separate performance dimensions rather than narrowing uniformly. Convergence appears in isolated areas while shortfalls persist or widen in others.

Chinese open-weight models from Moonshot (Kimi K2.x series), Alibaba (Qwen3.5/3.7 series), and Zhipu AI (GLM-4.6/4.7/5.x series) have closed the performance gap with frontier closed models (GPT-5.x, Claude Opus/Fable/Sonnet 4.x–5.x, Gemini 3.x) to within a few points on many hard benchmarks as of mid-2026, with particular strength in math, coding, and graduate-level reasoning.[1][2]

This narrowing is visible on non-saturated benchmarks like GPQA Diamond, AIME, LiveCodeBench, HLE (Humanity’s Last Exam), and MMLU-Pro, where the best open models now routinely match or exceed older frontier releases and trail the absolute latest closed models by single-digit margins (or less on select tasks). LMSYS Chatbot Arena Elo ratings reflect the same trend, with top open models clustered within ~20–50 points of leaders.[3][4]

The Open LLM Leaderboard (Hugging Face) was archived in early 2025 and is no longer the primary live source; current comparisons draw from provider technical reports, Vellum’s July 2026 open-source leaderboard, LMSYS Arena, and independent aggregates.[5]

Current Standings on Key Benchmarks (Mid-2026)

Data reflect the strongest reported versions (e.g., GLM-5/5.2, Qwen3.5-397B-A17B or Qwen3.7-Max, Kimi K2.5/K2.6). Scores are from provider reports or aggregated evaluations unless noted. Frontier comparators are approximate latest closed-model figures.[1][6]

  • GPQA Diamond (graduate-level science reasoning): GLM-5.2 ~91.2%, Kimi K2.6 ~90.5%, Qwen3.5 variants ~87–88.4%; closed models (e.g., GPT-5.2/Claude/Gemini variants) often 87–92%. Open models competitive or leading on some aggregates.[1][6]
  • MMLU / MMLU-Pro / MMLU-Redux (knowledge & reasoning): Qwen3.5/3.7 series strong (MMLU-Pro ~87.8%, MMLU-Redux ~94.9%); GLM-5 ~85% MMLU / ~70% MMLU-Pro; Kimi K2.5 ~87.1% MMLU-Pro. Closed models frequently 87–90%+ on Pro variants. Qwen often edges or matches on knowledge subsets.[6][7]
  • MATH / AIME (math & contest problems): Kimi K2.5 ~96.1% AIME 2025; GLM-4.6/5 ~93–98.6% (with tools) on AIME 2025/2026; Qwen strong but specific numbers vary. Closed models (GPT-5.2 etc.) often 92–100%. Open models frequently within 2–5 points or better with tool use.[8][9]
  • LiveCodeBench / HumanEval / SWE-Bench (coding): GLM-4.6 ~82.8% LiveCodeBench v6 (matches mid-80s closed); GLM-5 ~52–77.8% on variants/SWE; Kimi/K2 variants competitive (upper 70s–80%+ on SWE). Qwen3.5 also strong on coding subsets. Gaps narrow on execution/debugging benchmarks.[9][7]
  • HLE (Humanity’s Last Exam): GLM-5.2 54.7%, Kimi K2.6 54% (top open); closed models vary but open models now in the same ballpark as some frontier entries.[1]
  • LMSYS Chatbot Arena Elo (crowdsourced preference): GLM-5.2 ~1488, Qwen3.7-Max ~1486, Kimi K2.6 ~1466; top closed (Claude/GPT variants) ~1504–1510+. Separation among top open labs often <5–10 points.[2][3]

HumanEval is largely saturated and less differentiating today; LiveCodeBench and SWE-Bench are the active coding proxies.

Gap Closure Over Time (Qualitative Trend, 2025–Mid-2026)

Exact month-by-month deltas are sparse in public aggregates, but the trajectory is clear from successive releases and leaderboards:
- Early/mid-2025: Open models (earlier Qwen2.5/GLM-4/Kimi K2) trailed frontier closed models by 5–15+ points on GPQA/MMLU-Pro/AIME and larger margins on Arena Elo or HLE.
- Late 2025–early 2026: Releases like Kimi K2.5, Qwen3.5, GLM-5 narrowed this to 2–8 points on most hard benchmarks, with occasional leads on math/coding subsets (e.g., Kimi AIME scores, GLM coding). Arena Elo gaps for top open labs shrank to single digits among themselves and ~20–50 vs. closed leaders.[8][6]
- Mid-2026 (current): Further compression—open models within 1–5 points or statistical ties on GPQA, select AIME/math tasks, and coding; HLE and Arena show open models at or near the frontier pack. Chinese labs’ MoE architectures, data scaling, and post-training (esp. code/math synthetics, tool use) drive much of the catch-up.[1][9]

Deltas have compressed most on reasoning/math/coding (often <5 points) and remain larger on some multimodal or long-tail knowledge tasks, though open models continue closing there too.

Implications for Competition and Entry

Open models now deliver near-frontier performance at lower or zero inference cost (self-hosting or cheap APIs), making them attractive for cost-sensitive or private deployments. Frontier closed models retain edges in consistency, agentic reliability, and certain creative/multimodal workflows, plus easier scaling via APIs.[1]

To compete or enter this space:
- Focus on non-saturated benchmarks (GPQA, HLE, LiveCodeBench, AIME variants) and real-world agentic/coding evals rather than saturated classics like basic MMLU or HumanEval.
- Leverage MoE efficiency, specialized post-training on math/code, and long context/tool integration—the mechanisms behind recent Chinese open-model gains.
- Monitor LMSYS Arena and provider reports for the fastest-moving signal; Vellum-style aggregates help track open-only progress.[1]
- Gap closure accelerates commoditization pressure on closed APIs for many workloads, favoring hybrid strategies (open base + proprietary fine-tuning or routing).

These trends are based on the latest available July 2026 snapshots; individual model cards and fresh Arena votes provide the most current verification.


Recent Findings Supplement (July 2026)

Chinese open-weight models from Alibaba (Qwen), Zhipu AI (GLM), and Moonshot AI (Kimi) have released major updates in Feb–June 2026 that place their flagship variants within ~20–50 Elo points of leading proprietary models on LMSYS/LMArena and within a few points (or occasionally ahead) on key academic/coding benchmarks.[1][2]

This reflects continued rapid gap closure on MMLU-Pro, GPQA, LiveCodeBench, SWE-bench Verified, and AIME-style math, driven by scaled post-training/RL, MoE efficiency, and agentic/tool-use focus—rather than raw parameter count. Proprietary leaders (Claude Opus/Fable variants, GPT-5.x series, Gemini 3.x) still hold the overall Arena crown (~1500+ Elo), but the open Chinese trio now consistently ranks in the global top 20–30 and dominates the open-source category.[3]

LMSYS/LMArena (Human Preference Elo) – July 2026 Snapshot

Qwen3.7-Max, GLM-5.2, and Kimi K2.6/2.7 variants sit at 1466–1488 Elo (open leaders or near-leaders), trailing top proprietary entries by ~20–40 points but far closer than earlier 2025 gaps.[2][1]

  • Qwen3.7-Max: ~1488 Elo (rank ~20 overall; May 2026).
  • GLM-5.2: ~1488 Elo (open leader in some snapshots; June 2026 release).
  • Kimi K2.6: ~1466 Elo (Apr 2026); Kimi K2.7 Code variant also prominent in coding/agent subsets (Jun 2026).
  • Context: Top proprietary (Claude Fable 5 / Opus 4.8 / GPT-5.5-high) at 1506–1510+; open models now routinely place inside the global top tier on blind battles.[2]

Implication for competitors: Arena Elo gains for these models come from strong agentic/coding performance; entrants must match long-horizon tool use and real-user preference signals, not just static benchmarks.

Qwen Releases (Feb–May 2026)

Qwen3.5 (Feb 15, 2026; open-weight 397B-A17B MoE) and Qwen3.6-Plus (Apr 1, 2026 API) show iterative gains in multimodal/agentic capabilities; Qwen3.7-Max followed in May.[4][5]

  • GPQA: Qwen3.5-397B-A17B at 88.4; Qwen3.6-Plus at 90.4; Qwen3.7-Max Diamond at 92.4 (exceeds some Claude 4.6 Opus reports at ~91.3).[6]
  • MMLU-Pro / MMLU-Redux: Qwen3.5 at 87.8 / 94.9; Qwen3.6-Plus at 88.5 / 94.5; Qwen3.7-Max ~89.6 (near parity with top closed models ~89.5–89.8).
  • LiveCodeBench v6: Qwen3.5 at 83.6; Qwen3.6-Plus at 87.1.
  • SWE-bench Verified: Qwen3.5 at 76.4; Qwen3.6-Plus at 78.8 (vs. Claude Opus 4.5 ~80.9).
  • AIME26 / MATH-related: Qwen3.5 at 91.3 on AIME26; strong MATH-500 leadership reported for the series.
  • Additional: Terminal-Bench 2.0 leadership for Qwen3.6-Plus at 61.6.

Implication: Qwen’s MoE + RL scaling on agent environments has closed the coding/agent gap fastest among the trio; Apache-2.0 licensing on many variants lowers barriers for fine-tuning/competition.

GLM Releases (Zhipu/Z.ai, Early–Mid 2026)

GLM-5 (Feb 2026 technical report) and GLM-5.1/5.2 (Mar–Jun 2026) emphasize agentic engineering and coding; GLM-5.2 notably leads certain design/arena subsets.[7][8]

  • SWE-bench Verified: GLM-5 at ~77.8% (approaching Claude Opus 4.5 ~80.9%).
  • GPQA / MMLU-Pro: High 80s on GPQA variants; MMLU-Pro in the mid-80s (e.g., GLM-5.2 ~84–86.7%).
  • Arena: GLM-5.2 at ~1488 Elo; strong on coding/design preference battles.
  • Other: Significant gains on HLE, Terminal-Bench, and cybersecurity coding vs. GLM-4.7 predecessor; open-weights MIT license on recent variants.

Implication: GLM’s strength in practical software engineering and tool-calling makes it a direct competitor for agentic workflows; open weights accelerate community iteration.

Kimi Releases (Moonshot, Apr–Jun 2026)

Kimi K2.6 (Apr) and K2.7 Code (Jun) variants target frontier quality at lower cost with agentic/coding focus.[2]

  • Arena Elo: Kimi K2.6 ~1466; K2.7 Code competitive in coding subsets.
  • Coding/Agentic: Frequently top-2–3 among open models on SWE-bench, LiveCodeBench, and tool-use; often paired with Qwen/GLM as the leading open trio.
  • Other: Strong long-context and agentic workflows noted in comparisons.

Implication: Moonshot’s efficiency focus (lower pricing, strong coding) pressures cost-sensitive segments; open variants enable self-hosting competition.

Recent reports (early–mid 2026) note that the MMLU gap between top open (including these Chinese models) and closed frontier models has effectively reached zero on standard MMLU, with open models leading or tying on select MATH/AIME and competitive on GPQA Diamond.[9] On coding (SWE-bench Verified, LiveCodeBench), the gap has narrowed to single-digit percentages in many cases. No single source provides a full multi-benchmark delta table spanning 2025–2026, but the pattern across Qwen/GLM/Kimi releases shows consistent 2–5+ point gains per iteration on STEM/coding metrics, sufficient to place them inside the global frontier envelope on several axes.

Implication for new entrants or competitors: Static benchmark chasing is insufficient—success now requires matching agentic RL scaling, long-context tool reliability, and real-user preference (Arena) signals. Chinese open models have set a high bar via efficient MoE architectures and open licensing; Western players must differentiate on safety, integration, or specialized domains to compete. Data current as of early July 2026; leaderboards evolve daily with new votes/releases.

Get Custom Research Like This

Start Your Research