Source Report 1

Research the Thinking Machines Data Science "Interaction Models" release announced in May 2026.

Full research prompt

Research the Thinking Machines Data Science "Interaction Models" release announced in May 2026. What exactly was released, what technical capabilities does it demonstrate, what modalities does it support (text, voice, vision, etc.), and what are the stated goals of the technology? Summarize the key technical specifications, announced features, and any available demos or benchmarks from official announcements, press releases, and credible tech coverage.

From Thinking Machines Latest Models -May 2026

Jon Sinclair using Luminix AI
Jon Sinclair using Luminix AI Strategic Research
Key Takeaway from Thinking Machines Latest Models -May 2026

Two different companies share the Thinking Machines name, and conflating them obscures the picture of their technologies. Thinking Machines Lab, founded in February 2025 by former O, maintains a separate identity and development trajectory from the other firm.

Thinking Machines Lab (founded by former OpenAI CTO Mira Murati) released a limited research preview of its first “interaction model,” TML-Interaction-Small, on May 11, 2026.[1][2]

This is not an incremental update to existing turn-based LLMs; it is a model trained from scratch so that real-time, full-duplex, multimodal interaction is a core architectural property rather than something added via external scaffolding such as voice-activity detection or separate dialog managers.

Core Release and Model Details

TML-Interaction-Small is a 276-billion-parameter Mixture-of-Experts model with only 12 billion active parameters at inference time. It processes continuous streams of audio, video, and text in 200 ms micro-turn chunks, enabling near-instantaneous perception and generation while delegating deeper reasoning to a parallel asynchronous “background” model that shares context.[1][3]

  • The model uses encoder-free early fusion: audio via dMel embeddings, video via 40×40 hMLP patches, and an audio decoder with a flow-matching head.
  • Inference runs on optimized streaming sessions with persistent GPU sequences and custom MoE kernels (gather + GEMV), upstreamed to SGLang.
  • A two-model split keeps the interaction model lightweight and always-on while the background model handles long-horizon tasks (tool use, search, generative UI) without breaking real-time flow.

What this means for competitors: Any company still bolting real-time features onto a standard transformer will face a widening capability gap as interaction quality and intelligence scale together in a single native architecture.

Supported Modalities and Real-Time Capabilities

The model natively ingests and generates across audio (primary), video/vision, and text simultaneously. It supports full-duplex operation (overlapping speech), interruptions, backchanneling, visual proactivity (reacting to on-screen changes without audio cues), time-awareness (e.g., correctly timing reminders or language switches), and implicit dialog-state tracking (thinking vs. yielding vs. inviting response).[1]

  • No separate VAD or turn-detection layers are required; the model directly tracks speaker intent across modalities in 200 ms chunks.
  • It can speak and listen at the same time (live translation), react to visual events (counting repetitions in video, describing actions in real time), and maintain context across long streams.
  • Output includes both text and synthesized speech; refusals are generated in the appropriate modality.

What this means for product builders: Interfaces can finally feel like collaborating with another person instead of issuing prompts and waiting. This unlocks fluid creative workflows, live coaching, simultaneous translation, and proactive assistance that traditional systems cannot match without heavy post-processing.

Benchmarks and Demonstrated Performance

On interaction-specific benchmarks, TML-Interaction-Small leads or matches frontier real-time systems while exceeding them on intelligence metrics.

Key results include:
- FD-bench v1 turn-taking latency (audio): 0.40 s (vs. 1.18 s for GPT-realtime-2.0 minimal)
- FD-bench v1.5 average interaction quality: 77.8 (vs. ~46–54 for GPT/Gemini live previews)
- Audio MultiChallenge APR: 43.4 % (highest among instant models)
- FD-bench v3 response quality / Pass@1 (with background agent): 82.8 % / 68.0 % (best in class)
- Harmbench refusal rate: 99.0 %
- Internal “interaction-native” benchmarks (TimeSpeak, CueSpeak, RepCount-A, ProactiveVideoQA, Charades) where competing systems score near zero while TML-Interaction-Small achieves meaningful results (e.g., TimeSpeak 64.7 macro-accuracy vs. 4.3).[1][3]

Demonstration videos (shared in the announcement) show natural interruptions, visual reactions, simultaneous speech, and fluid multi-turn collaboration.

What this means for evaluation: Existing leaderboards focused on turn-based or single-shot tasks will understate the advantage of native interaction models. New benchmarks that measure timing, proactivity, and full-duplex behavior are now essential.

Stated Goals and Strategic Direction

The explicit goal is to eliminate the “collaboration bottleneck” by making AI a true real-time partner rather than a turn-based tool. The team argues that interactivity should scale with intelligence (invoking the “bitter lesson”), allowing humans to stay in the loop through natural conversation, visual cues, and concurrent actions.[1]

  • Wider public release is planned for later in 2026.
  • A limited research preview is opening in the coming months; researchers can request access via interaction@thinkingmachines.ai.
  • The company is also launching a research fellowship program to develop new evaluation standards for interaction models.

What this means for the field: This release marks the beginning of a shift from “chatbot + voice layer” to genuinely conversational, embodied AI systems. Teams that adopt or replicate native interaction architectures will gain a structural edge in any application where timing, presence, and fluid collaboration matter—creative tools, education, live assistance, and real-world agentic workflows.

All quantitative claims above are drawn directly from the May 11, 2026 official announcement and contemporaneous coverage. No public API or open weights are available yet; access remains gated to the research preview.


Recent Findings Supplement (May 2026)

Thinking Machines Lab (Mira Murati’s startup) announced its first model release on May 11, 2026: a research preview of “Interaction Models,” with the debut implementation TML-Interaction-Small.[1]

This is a 276-billion-parameter mixture-of-experts model (12 billion active parameters) trained from scratch to treat real-time, full-duplex interaction as a core architectural feature rather than an add-on harness.[1]

The mechanism is a multi-stream “micro-turn” design that processes continuous 200 ms chunks of audio, video, and text input/output simultaneously.[1]

This enables the model to listen while speaking, interrupt naturally, backchannel, react to visual cues, and maintain time awareness without freezing perception during generation.[1]

Implication: it fundamentally changes the collaboration loop from sequential turn-taking to concurrent human-AI presence, making AI feel like an always-present partner instead of a reactive tool.

  • Official announcement date: May 11, 2026 (blog post titled “Interaction Models: A Scalable Approach to Human-AI Collaboration”).
  • Model name: TML-Interaction-Small (276B MoE / 12B active parameters).
  • Access status: Limited research preview opening in the coming months; wider release planned for later in 2026; researchers can request access via interaction@thinkingmachines.ai.
  • Company context: First public model after ~$2 billion raised at $12 billion valuation.

Core Architecture and Supported Modalities

Thinking Machines built an encoder-free early-fusion system where audio (dMel embeddings), video/images (hMLP patches), and text are co-trained directly into the transformer, with a flow-head audio decoder.[1]

This native multimodal design replaces bolted-on voice-activity detection and turn-taking logic, allowing continuous concurrent streams instead of alternating sequences.

  • Modalities: Continuous audio (input/output), video/images (input), and text; supports simultaneous speech, live translation, and visual cue reaction.
  • Key innovation: 200 ms time-aligned micro-turns streamed as persistent GPU sequences with custom low-latency kernels (gather+gemv for MoE, NVLS for deterministic all-reduce).
  • Dual-model setup: Front-end interaction model handles real-time exchange; delegates deeper reasoning/tool use to an asynchronous background model whose outputs interleave back into the conversation.
  • Training stability: Batch-invariant kernels ensure bitwise alignment between trainer and sampler (<5% overhead).

For competitors or new entrants, the data moat is the co-trained fusion + streaming inference stack; adding real-time capabilities to existing turn-based models will require similar from-scratch training or major re-architecture.

Key Capabilities and Interaction Features

The model demonstrates qualitatively new behaviors such as graceful interruptions, visual proactivity (e.g., counting push-ups from video), simultaneous speech handling, and time-aware responses (e.g., breathing reminders at exact intervals).[1]

Mechanism: Speaker-state tracking and backchanneling emerge implicitly from the micro-turn design rather than from separate dialog-management modules.

  • Supports full-duplex conversation: model can speak while user is still talking and vice versa.
  • Proactive responses triggered by visual or audio changes without explicit prompts.
  • Concurrent operations: tool calls, search, and generative UI run while maintaining live interaction.
  • Demos (public videos on the announcement page): real-time code debugging with visual bug detection, live translation with overlap, and contextually timed interruptions.

Implication: these behaviors move AI from “assistant that waits its turn” to “collaborator that shares the same temporal space.”

Benchmarks and Performance Results

On interaction-focused benchmarks, TML-Interaction-Small outperforms or matches larger turn-based models while delivering ~200 ms responsiveness.[1]

Key results (all from the official May 2026 announcement):

  • FD-bench v1.5 (Audio): 77.8 average (vs. GPT-realtime-2.0 minimal: 46.8).
  • FD-bench v3 (Audio + Tools): 82.8% response quality / 68.0% Pass@1 (best among instant models).
  • Audio MultiChallenge APR: 43.4% (vs. GPT-realtime-2.0: 37.6%).
  • BigBench Audio Accuracy: 75.7% (text mode: 96.5%).
  • IFEval (VoiceBench): 82.1% (text: 89.7%).
  • Harmbench Refusal Rate: 99.0%.
  • QIVD (video + audio streaming): 54.0% accuracy.
  • Internal metrics (TimeSpeak, CueSpeak, RepCount-A, ProactiveVideoQA, Charades) show strong temporal reasoning and visual tracking.

These scores establish a new combined intelligence + responsiveness frontier for “instant” models.

Stated Goals and Roadmap

The explicit goal is to eliminate the “collaboration bottleneck” so that interactivity scales with model intelligence, enabling natural human-AI collaboration with copresence, contemporality, and simultaneity.[1]

Future plans include releasing larger interaction models, improving long-session context, robustness to connectivity delays, safety/alignment for real-time use, and launching a research grant for new interaction benchmarks.

  • No public demos or enterprise access yet.
  • Emphasis on inviting community contributions to interactivity evaluation frameworks.

For anyone building competing systems, the takeaway is clear: the next competitive edge will come from native real-time architectures, not post-hoc scaffolding on existing models.

Get Custom Research Like This

Start Your Research