Map the competitive landscape of AI interaction and conversational AI model providers as of mid-2026, including OpenAI, Google…

Thinking Machines Lab (founded by ex-OpenAI CTO Mira Murati) launched its core thesis on May 11, 2026: interaction must be native to the model architecture, not bolted on via voice-activity detection (VAD), turn segmentation, or external harnesses. This directly challenges the dominant pipeline approach used by nearly every other player. The company’s TML-Interaction-Small (276B-parameter MoE with 12B active parameters) processes continuous 200 ms micro-turn streams of audio, video, and text in an encoder-free early-fusion design, enabling true full-duplex behavior (listen while speaking, react to interruptions, tone shifts, or visual cues in real time). A separate asynchronous “background” model handles deep reasoning and tool use, streaming results back into the live interaction context.[1][1]

This is the first publicly announced architecture that tokenizes time itself as a first-class input, making responsiveness and natural flow scale with intelligence rather than being capped by scaffolding overhead. A research preview is planned for the coming months, with wider release later in 2026.

1. The Broader Conversational & Interaction AI Landscape (Mid-2026)

The market has split into two camps: (1) frontier-scale models with improving real-time voice/multimodal layers, and (2) specialized voice or efficiency players. No other company has yet announced an end-to-end trained “interaction model” that treats micro-turn concurrency as a core architectural primitive.

OpenAI dominates consumer and developer mindshare with GPT-Realtime-2 / gpt-realtime (Realtime API). It delivers sub-second latency, strong tool calling, interruptions, and expressive speech, but still relies on optimized pipelines rather than native micro-turn streams from a single unified model.[2]
Google DeepMind is the closest technical peer via Gemini 3.1 Flash Live and the Multimodal Live API (launched March 2026). It supports continuous audio/video/text streams in persistent sessions with low-latency interruptions and vision input.[3]
Anthropic remains strongest in text/agentic reasoning (Claude Opus 4.7) but offers only limited voice features (dictation and emerging Code Voice) with no leading real-time multimodal interaction product.[4]
ElevenLabs has pivoted into full conversational agents (ElevenAgents / v3 Conversational, early 2026) with high-fidelity TTS, turn-taking, and speculative generation for perceived low latency, but it layers on top of underlying LLMs rather than replacing the interaction stack.[5]
Cohere and Mistral focus on efficient enterprise text/conversational models (Command R+, Mistral 3) with strong RAG and open-weight options, but minimal native real-time multimodal emphasis.

Valuations reflect this split: OpenAI (~$850B), Anthropic (~$380B), Google DeepMind (Alphabet scale), with smaller players like Mistral (~$10B+) and ElevenLabs trailing.[6]

2. How Thinking Machines Compares Technically

TM’s approach is distinguished by three mechanisms absent or only approximated elsewhere:

Native time tokenization — 200 ms micro-turns allow simultaneous input processing and output generation without waiting for turn boundaries.
Dual-model decoupling — The lightweight interaction model maintains presence and flow; the background model runs heavy reasoning asynchronously and injects results contextually.
From-scratch training for interaction — Audio enters as dMel features, video as patches; everything is early-fused without separate STT/TTS stages.

Benchmarks cited by the company show TML-Interaction-Small competitive with or beating larger models on combined intelligence + responsiveness while achieving ~0.40 s response latency.[7]

Implications for competitors: Existing real-time systems (OpenAI Realtime API, Gemini Live) can add incremental improvements (better VAD, speculative decoding, persistent KV cache), but they cannot retroactively achieve the same end-to-end optimization without rebuilding the core training objective.

3. Asian Labs and Regional Dynamics

Chinese labs lead on cost-efficiency and open-source momentum, creating a parallel track to Western closed models:

DeepSeek (V4 series) and Alibaba Qwen (Qwen 3/3.5/Omni) deliver frontier-level reasoning and multimodal capabilities at dramatically lower inference cost. Qwen-Omni explicitly supports real-time voice/video interactions (“see, hear, talk”).[8]
Tencent, Baidu, ByteDance integrate conversational agents deeply into super-apps (WeChat, etc.) and are driving rapid agentic adoption (OpenClaw-style systems).[9]

These labs prioritize open weights (where possible) and extreme efficiency, making them attractive for on-device or high-volume deployments, but they generally follow the same turn-based or lightly scaffolded interaction patterns as Western counterparts.

4. Are Other Companies Pursuing Similar “Interaction Model” Architectures?

No. Searches across announcements, papers, and coverage through mid-May 2026 reveal no other lab using native micro-turn time tokenization or an explicit dual interaction/background model trained from scratch for full-duplex multimodality.

Closest approximations:
- Google’s Gemini Live API (continuous streams).
- OpenAI’s Realtime API (optimized speech-to-speech).
- ElevenLabs v3 Conversational (speculative turn-taking).

None treat interaction as a first-class architectural primitive the way TM does.

5. Differentiators, Overlaps, and Thinking Machines’ Position

Overlaps: Every major player now offers voice/multimodal conversation, tool use, and agentic capabilities. Latency, naturalness, and interruption handling are the shared battlegrounds.

TM Differentiators:
- True full-duplex without scaffolding overhead.
- Explicit separation of real-time presence from deep reasoning (scalable to complex agents).
- Early-mover claim on “interaction models” as a new model class.

Commercial Positioning:
- TM is a pure startup with ex-OpenAI talent and a focused thesis; it lacks the distribution, data moats, or infrastructure of OpenAI/Google/Anthropic.
- Unique contested space: premium natural-interaction experiences (therapy, education, creative collaboration, embodied robotics) where 200 ms responsiveness and visual awareness create defensible UX advantages.
- Risk: execution on scaling the dual-model system and proving real-world benchmarks beyond demos.

Implications for new entrants or incumbents:
- Incumbents must either acquire or replicate native interaction architectures (costly) or continue layering improvements.
- New entrants can target narrow high-value verticals (e.g., real-time coaching, accessibility) where TM’s approach provides immediate differentiation before the giants catch up.
- The next 12–18 months will determine whether “interaction models” become a standard category or remain a TM-specific advantage.

Recent Findings Supplement (May 2026)

Thinking Machines Lab (founded by former OpenAI CTO Mira Murati) announced on May 11, 2026, a research preview of “interaction models”—a native full-duplex architecture that processes audio, video, and text in continuous 200-millisecond micro-turns rather than relying on external turn-detection scaffolding.[1][1]

This dual-system design pairs a lightweight, always-on Interaction Model (TML-Interaction-Small: 276B-parameter MoE with 12B active parameters) for real-time presence and responsiveness (0.40-second audio latency) with an asynchronous Background Model for deep reasoning and tool use. The result is true simultaneity: the AI can listen, speak, interject, react to visual cues (e.g., whiteboard writing or someone entering frame), and call tools mid-stream without artificial pauses.[2][3]

Benchmarks show superiority on interaction-specific metrics (FD-bench v1.5 audio: 77.8; FD-bench v3 Response Quality: 82.8%; ProactiveVideoQA and new TimeSpeak/CueSpeak tasks where prior models score near zero) while remaining competitive on intelligence benchmarks against GPT-realtime-2.0 (0.59s latency), Gemini-3.1-flash-live-preview, and Qwen variants.[1]
Limited research preview for partners now; wider release and larger models planned for later 2026.[4]

This positions Thinking Machines in a distinct technical niche—scaling native interactivity alongside intelligence—while still pre-product commercially (research-stage only, with $2B prior seed funding).

OpenAI advanced its realtime voice capabilities on May 7, 2026, with three new API models (GPT Realtime 2, Realtime Translate, and streaming transcription) that enable reasoning, translation, and transcription as users speak.[5][6]

These build on earlier Realtime API work but still rely on VAD-based partial overlap rather than fully native full-duplex processing.

Latency reported at ~0.59s in direct comparisons, with strong tool-calling and multi-language support but lower scores on pure interactivity benchmarks than Thinking Machines’ preview model.[3]
Commercial availability through the API gives OpenAI immediate product reach that Thinking Machines currently lacks.

Google DeepMind’s Gemini realtime/live-preview models (Gemini-3.1-flash-live) similarly offer low-latency multimodal streaming but use harness-style turn management and trail Thinking Machines on new proactivity and simultaneous-speech metrics.[1]

ElevenLabs expanded its voice-first conversational platform throughout early 2026, releasing Eleven v3 (generally available February 2026) with cinematic audio tags and 70+ language support, Eleven Flash v2.5 (75ms latency for real-time agents), Text-to-Dialogue for multi-speaker overlaps, and Scribe v2 Realtime STT.[7]

Voice is positioned as the “next interface,” with Meta partnerships for integration into Instagram, Horizon Worlds, and potential wearables.[8]
Strong on audio quality and low-latency agent deployment but still layered (TTS + LLM scaffolding) rather than end-to-end native interaction models; latency claims (~0.45s in comparisons) sit between OpenAI and Thinking Machines.[3]

Anthropic, Mistral, and Cohere remain focused on frontier reasoning and enterprise text/multimodal models (Claude 4.6, Mistral 3, Command R+), with no public native full-duplex interaction-model announcements in the last six months.[9]

Regional Asian labs emphasize agentic and multimodal deployment at scale: Chinese platforms (Doubao, Kimi, Tencent/Alibaba/Baidu integrations) show mass real-world adoption of agentic workflows, while Japan and South Korea leverage robotics/manufacturing expertise for enterprise multimodal conversational systems.[10][11]

No evidence of direct “interaction model” equivalents from these players.

Prior full-duplex prototypes (Kyutai Moshi, NVIDIA PersonaPlex/Nemotron-VoiceChat) exist but remain smaller-scale and latency-focused without the combined intelligence + native multimodality of Thinking Machines’ dual-model approach.[12]

Overall, Thinking Machines occupies a contested but differentiated position: the only announced architecture that makes full-duplex, time-aware, proactive multimodal interaction a first-class model property rather than an add-on harness. OpenAI and Google lead commercially in realtime voice today, ElevenLabs dominates specialized voice quality and agent tooling, and Asian labs lead in deployed scale. No other frontier lab has yet replicated the dual interaction + background model split or the new proactivity benchmarks introduced in May 2026.

1. The Broader Conversational & Interaction AI Landscape (Mid-2026)

2. How Thinking Machines Compares Technically

3. Asian Labs and Regional Dynamics

4. Are Other Companies Pursuing Similar “Interaction Model” Architectures?

5. Differentiators, Overlaps, and Thinking Machines’ Position

Recent Findings Supplement (May 2026)

Other reports in this analysis

Continue Reading

What will AI do to Accenture's Business?

Concentrix Company Overview: CX Outsourcing, AI Strategy, Business Model, and Market Position (2026)

Gilead Company Overview - 2026

Get Custom Research Like This