Search X…

Claude Fable 5 (the generally available “safe” configuration of Anthropic’s new Mythos-class model) launched on June 9, 2026, giving practitioners immediate hands-on access. Early testers—developers, indie creators, evaluators, and analysts—quickly shared detailed accounts via X, Reddit, blogs, and video demos of building complex games, emulators, simulations, websites, and multi-hour autonomous workflows.[1][2]

These firsthand reports focus on capabilities (excluding all mentions of pricing tiers, usage caps, or safety classifiers/guardrails). Dominant sentiment is strongly positive on raw power for ambitious work, with mixed notes on workflow friction from its intensity; outright skepticism about capabilities is rare and limited to edge cases or comparisons showing it is not uniformly superior on every trivial task.

Agentic, Long-Horizon Coding and Autonomous Execution

Practitioners consistently highlight Fable 5’s ability to handle multi-hour, multi-file projects with planning, iteration, self-testing, and bug-fixing with minimal human intervention—often described as shifting from “prompt and tweak” to “assign and return later.”

One user gave a single detailed prompt (PRD + goal) for the “best game” and received a complete ink-wash calligraphy defense roguelike with skills, boss fights, end-game elements, custom art assets, and guqin music after ~5 hours of autonomous work; the model independently launched Playwright, took screenshots for testing, caught bugs, and fixed them.[3]
A reviewer prompted a full production-quality animated website; after 40 minutes of uninterrupted building it produced results described as “completely different league” and “next level” versus Opus 4.8 (same prompt took 23 minutes but yielded noticeably inferior output).[4]
Another built a playable 3D game in one prompt while stepping away for 4 hours, returning to finished work.[5]
Early GitLab Duo testers reported single-pass implementations of complex systems that previously required days of iteration with prior models.[6]

This implies competitors must match not just benchmark scores but reliable long-context reasoning + tool-use loops (code execution, screenshot analysis, self-editing) to compete on agentic workflows.

Creative Generation, Games, Simulations, and Visuals

Fable 5 excels at producing visually rich, playable experiences and creative assets in one or few shots, often generating its own art, music, physics, and rendering systems.

Built a full 3D Homelander Simulator in 5 prompts / 1 hour, sourcing and integrating 3D models autonomously.[7]
Completed a browser-based ray-tracing game with reflections, shaders, camera, and material systems via an agentic Claude Code workflow.[8]
Delivered the most complete realistic 3D water-globe simulation (lemon trees, gravity, hidden Easter eggs, atmospheric sound, UI) seen in repeated tests against other models.[9]
Nailed a tough visual challenge simulating fluid ink melting with expressive, playable results.[10]

For entrants, this suggests investing in strong vision + generative capabilities alongside coding is now table stakes for game/sim/demo use cases.

Benchmark and Large-Scale Project Performance

Testers and evaluators note state-of-the-art or record results on complex, long-running tasks.

Achieved 74.5% on GBA Eval (best to date); wrote an emulator playing nearly all games in the test set near-perfectly in under 2 hours (versus Opus 4.8’s 24-hour score).[11]
Stripe reportedly completed a full migration on a 50-million-line codebase in one day (human team estimate: 2+ months).[12]
Strong vision-only performance cited (e.g., playing through Pokémon FireRed with minimal harness).[13]
Analyst Ethan Mollick found it outperformed every prior public model by a considerable margin across experiments, sustaining work on multi-page specs for up to a dozen hours with “startling results.”[14]

Competitors need transparent long-horizon evals (not just short-context benchmarks) and real-world multi-file/agent demonstrations to match credibility.

Practitioner Sentiment and Workflow Shifts

Overall tone from launch-day users is excited and impressed, with phrases like “beast,” “wild,” “seismic shift,” and “makes GPT 5.5 feel like a toy.” Simon Willison called it a model where “the challenge is finding tasks that it can’t do.”[15] Mixed notes center on it feeling like a different category of tool—more autonomous “artist + engineer” than pure coder—rather than incremental improvement.[3]

Skepticism is minimal on capabilities themselves; some note it shines most on hard, well-specified problems and may feel like overkill for routine tasks.

For new entrants or incumbents: focus messaging and demos on transformative agentic use cases rather than raw chat quality; integration with IDEs, sandboxes, and testing tools will be key differentiators.

Non-Safety Limitations Cited by Testers

Beyond excluded topics, users noted high token consumption during long autonomous sessions and occasional slowness or higher inference cost for the depth of work. Some observed it defaulting or underperforming on very basic tasks compared to lighter models. These appear secondary to the dominant praise for frontier capabilities.

In summary, early Fable 5 testing reveals a model optimized for ambitious, extended autonomous projects in coding, games, and simulation—delivering results that feel qualitatively different from prior public models. Practitioners are rapidly integrating it into workflows where long-horizon agentic behavior provides the biggest edge. Competitors will need comparable sustained reasoning depth, vision/generation quality, and tool orchestration to keep pace.

Recent Findings Supplement (June 2026)

Claude Fable 5 (Anthropic’s first generally available Mythos-class model) launched on June 9, 2026, with pre-release early access granted to select customers and partners.[1][1]

This marks the primary recent development in the “Claude with Fable” space. All cited feedback and reactions below derive from testing in the days immediately before or on/after launch (June 2026 sources only). No earlier post-12/11/2025 information on this model exists in results.

Official Early-Access Tester Feedback (Anthropic Announcement)

Anthropic published direct quotes from customers who tested Fable 5 prior to the June 9 general release. These emphasize capability leaps in specific domains.[1]

Cursor reported it as state-of-the-art on CursorBench and noted it unlocked long-horizon problems previously out of reach.
GitHub testers highlighted superior autonomy and reliability on complex, long-horizon coding tasks versus prior benchmarks.
Another partner called the results the strongest of any Claude model tested, citing clear progress on agentic coding and prototyping.

Insight: These accounts position Fable 5 as qualitatively advancing beyond Opus-tier models in sustained, multi-step workflows rather than incremental benchmark gains. For competitors, this signals that matching “long-horizon autonomy” will require comparable scale or architectural shifts, not just fine-tuning.

Prominent Practitioner Reactions (Karpathy, Willison, Others)

High-profile early or launch-day testers provided qualitative assessments alongside benchmarks.[2][3]

Andrej Karpathy described a “major-version-bump-deserving step change forward” (comparable to Claude 4.5), especially for ambitious long problem-solving sessions where the model “gets it” and proceeds without excessive hand-holding.
Simon Willison (after ~5.5 hours of post-launch testing) called it a “beast” that handled every task attempted, with the main challenge being identifying limits; he noted it is slow and expensive.
Harvey’s internal BigLaw Bench evaluation yielded a new high of 93.4% for the Anthropic family.[4]

Insight: The dominant mechanism cited is improved self-direction and context retention over extended interactions. New entrants or rivals must prioritize reliable multi-turn/tool-use chains to compete on the tasks these users highlight.

Sentiment on Reddit, Blogs, and Tech Commentary

Early threads and reviews (June 9–10, 2026) show overwhelmingly positive sentiment on raw capabilities, tempered by operational notes.[5]

Coding and agentic users frequently praised better self-verification, tool use, efficient token consumption in some workflows, and willingness to produce complete projects or solutions rather than outlines.
Common theme: It feels like a tier above prior Claude models for difficult, sustained problems (e.g., large codebase migrations completed in a day per some reports; 80.3% on SWE-Bench Pro).[2]
Mixed notes center on speed (slower inference) and high resource demands (rapid plan exhaustion in heavy sessions).

Insight: Positive sentiment clusters around frontier-level agentic performance; skepticism is minimal on core intelligence and focuses on usability friction unrelated to gating or guardrails. This creates a narrow window where demonstrated long-horizon wins can drive adoption before alternatives catch up.

Strengths and Weaknesses Most Frequently Cited in Firsthand Accounts

Strengths (recurring across sources):
- Exceptional handling of long-horizon, autonomous coding and knowledge-work tasks.
- Qualitative “gets it” leap enabling more ambitious prompts with less intervention.
- Strong benchmark leadership and real-world project completion (e.g., full migrations, complete prototypes).

Weaknesses (non-excluded categories):
- Noticeably slower response times.
- Higher token usage/cost leading to faster consumption of limits in intensive sessions.

Insight: The model’s edge stems from sustained reasoning depth rather than speed or efficiency. Competitors entering this space should target either faster/cheaper equivalents or specialized optimizations for the same long-running workflows.

Implications for the Broader Landscape

Fable 5’s split release (general safe version + restricted Mythos 5 for vetted partners) and same-day Copilot integration underscore Anthropic’s strategy of tiered access for high-capability models.[6][7] Early feedback indicates rapid community testing and integration interest. For anyone building or competing, the bar for “state-of-the-art agentic coding” has shifted measurably upward as of June 9, 2026; catching up will likely require either similar-scale models or differentiated strengths in speed/cost. All data above is from sources published June 9–11, 2026.

Agentic, Long-Horizon Coding and Autonomous Execution

Creative Generation, Games, Simulations, and Visuals

Benchmark and Large-Scale Project Performance

Practitioner Sentiment and Workflow Shifts

Non-Safety Limitations Cited by Testers

Recent Findings Supplement (June 2026)

Official Early-Access Tester Feedback (Anthropic Announcement)

Prominent Practitioner Reactions (Karpathy, Willison, Others)

Sentiment on Reddit, Blogs, and Tech Commentary

Strengths and Weaknesses Most Frequently Cited in Firsthand Accounts

Implications for the Broader Landscape

Other reports in this analysis

Continue Reading

Understanding Demis Hassabis's AGI Roadmap: Gemini, AlphaFold, and DeepMind's Bet

Uses of AI at World Cup 2026

Climate Impact of Repeal of Endangerment Act

Get Custom Research Like This