Deep-dive into how high-speed interconnects…

High-speed interconnects are the primary limiter on GPU utilization in frontier AI clusters, creating a 10-100x bandwidth hierarchy gap that forces GPUs to idle during collective operations like all-reduce.[1][2]

Intra-node/rack NVLink delivers TB/s-scale bandwidth (e.g., 1.8 TB/s bidirectional per GPU or 130 TB/s aggregate across a 72-GPU GB200 NVL72 rack), while inter-node fabrics top out at ~400 Gbps (50 GB/s) per link on current InfiniBand or Ethernet. This disparity turns communication—especially gradient synchronization for models with hundreds of billions of parameters—into the dominant bottleneck, with all-reduce traffic consuming up to 62% of execution time in production workloads. Clusters are scaling to 100k–300k GPUs today (with projections toward 1M), but legacy fabrics were never designed for the east-west, many-to-many traffic patterns of synchronous training.[3][4]

NVLink and Rack-Scale Scale-Up: Nvidia's Strength and Limit

Nvidia's NVLink (now 5th generation) creates coherent, all-to-all domains that treat dozens of GPUs as a single massive accelerator, eliminating intra-rack all-reduce bottlenecks. The GB200 NVL72 rack exemplifies this: 72 Blackwell GPUs linked at 130 TB/s aggregate via NVLink Switch System, with per-GPU NVLink bandwidth reaching 1.8 TB/s in multi-node configurations—far exceeding PCIe.[5][6]

This enables efficient training of 200B+ parameter models within a rack (or larger via extensions) while slashing latency and power overhead compared to network hops. However, NVLink remains proprietary, Nvidia-ecosystem locked, and currently capped at rack scale (tens to low hundreds of GPUs). Beyond that, clusters fall back to slower inter-node links, reintroducing the hierarchy gap.[7]

Implication for competitors: Any non-Nvidia accelerator ecosystem must either replicate NVLink-like density (difficult without similar vertical integration) or optimize software around heterogeneous fabrics. Hyperscalers are pushing for open alternatives to reduce lock-in.

InfiniBand vs. Ethernet: Performance vs. Scalability Trade-offs

InfiniBand (Nvidia/Mellanox) has dominated AI fabrics with RDMA-enabled lossless transport, ultra-low latency, and high effective bandwidth, historically holding ~90% share in high-end clusters. It sustains high GPU utilization in 10k–100k GPU setups but faces criticism for high cost (3–4x Ethernet), vendor lock-in, and complexity at extreme scale.[1][8]

Ethernet is closing the gap rapidly via RoCEv2, Nvidia Spectrum-X (purpose-built AI Ethernet with congestion management and performance isolation), and the Ultra Ethernet Consortium (UEC) Specification 1.0 (released June 2025). UEC adds AI/HPC optimizations including advanced congestion control, packet spraying, enhanced RDMA, in-network collectives, and better small-message performance—enabling Ethernet to approach or match InfiniBand in effective throughput for many workloads while offering lower cost, broader interoperability, and easier multi-vendor scaling.[9][10]

Broadcom's Tomahawk 6 (TH6) delivers up to 102.4 Tbps switching capacity with 1.6T ports, supporting dense AI fabrics. Arista and others provide high-radix Ethernet switches optimized for these patterns. Ethernet now powers major clusters (e.g., xAI Colossus references) due to simplicity and economics.[11][12]

Shortfall: Even optimized Ethernet/InfiniBand at 400–800 Gbps per link requires massive parallelism and careful topology (fat-tree, dragonfly) to avoid congestion hotspots. At 100k+ GPUs, tail latency, failure domains, and power draw (networking ~5–10% of IT load) become acute.[4]

Optical Interconnects and Startups: Addressing Power and Density

Copper/electrical links hit physical limits on reach, power per bit, and density for next-gen clusters. Optical solutions target this via silicon photonics and co-packaged optics (CPO).

Ayar Labs leads with in-package optical I/O (TeraPHY chiplets + SuperNova remote laser sources), delivering multi-Tbps per engine (e.g., up to 2+ Tbps full-duplex) at lower power/latency than electrical alternatives. It supports UCIe standards and is part of the Open Compute Interconnect (OCI) MSA (backed by AMD, Broadcom, Meta, Microsoft, Nvidia, OpenAI). This enables optical scale-up within racks or disaggregated memory/compute, reducing the intra/inter bandwidth gap and power walls.[13][14]

Enfabrica developed Accelerated Compute Fabric (ACF) SuperNICs for high-bandwidth Ethernet/RDMA + CXL memory pooling, targeting 100k+ GPU clusters with flat topologies and fault tolerance. Nvidia licensed the technology and acqui-hired key talent (CEO and team) in a ~$900M+ deal in 2025 to integrate into its ecosystem for more efficient large-scale fabrics.[15][16]

Broadcom complements with high-speed SerDes, DSPs, and optics roadmaps, while pushing Ethernet for scale-up.[12]

Shortfall: Optical is still early (deployments ramping 2026+); electrical remains dominant for cost/reliability today. Integration challenges (e.g., fiber management, laser reliability) persist.

The Gap: Required vs. Available

Needed for frontier training (trillion-parameter models on 100k–1M GPU clusters): Aggregate cluster bandwidth in the PB/s range with sub-microsecond latency at scale, <1 pJ/bit power efficiency, lossless many-to-many collectives, multi-DC spanning, and open/multi-vendor support. This would keep GPUs >90% utilized without custom per-workload tuning.

Available (mid-2026):
- Rack-scale NVLink excels locally but hits a wall beyond ~72 GPUs.
- Inter-node at 100–800 Gbps/link (InfiniBand/Ethernet/UEC) requires heavy software optimization (NCCL, topology-aware all-reduce) and still leaves 10–100x hierarchy gaps.
- Power and cost scale poorly; optical/CPO and advanced fabrics (Enfabrica-derived) are promising but not yet ubiquitous.
- UEC and OCI standards are closing interoperability gaps, but real-world perf at extreme scale remains vendor-optimized (Nvidia-heavy).[17]

Implications: The networking layer now rivals (or exceeds) raw GPU FLOPS as the competitive moat. Nvidia's vertical stack (NVLink + InfiniBand/Spectrum-X + acquisitions) maintains leadership, but open Ethernet/optical pushes from Broadcom, Arista, hyperscalers, and startups like Ayar Labs are eroding it on cost and flexibility. New entrants must solve disaggregation, power efficiency, or software-defined fabrics to compete; otherwise, they face idle GPUs and multi-year delays in effective cluster performance. Continued progress in UEC, silicon photonics, and CXL-over-fabric will be decisive for 2027+ exascale AI.

Recent Findings Supplement (June 2026)

Recent developments (post-December 2025) highlight accelerating adoption of co-packaged optics (CPO) and higher-speed Ethernet/InfiniBand fabrics to address interconnect bottlenecks in frontier AI training clusters, while exposing persistent gaps in scale-up domain size and per-GPU bandwidth relative to compute growth.[1][2]

Nvidia continues to lead with proprietary solutions (NVLink and InfiniBand), but Ethernet players (Arista, Broadcom) and optical startups (Ayar Labs, Lightmatter) are gaining traction with open or hybrid approaches for larger, more efficient clusters. Bandwidth demands for models requiring synchronization across tens of thousands of GPUs—hundreds of terabits of collective bisection bandwidth and sub-microsecond latency—continue to outpace electrical interconnect reach, power efficiency, and density beyond rack scale.[3]

Optical Interconnects Advancing for Multi-Rack Scale-Up

Lightmatter’s March 2026 milestone with its Passage CPO chiplet delivers a record 1.6 Tbps per fiber using 16-wavelength DWDM at 112G per SerDes lane—up to 8x more bandwidth per fiber than prior NPO/CPO solutions. This silicon-proven tech targets hyperscaler deployment to ease fiber cabling, space, and power constraints in growing AI clusters, with pathways to 100+ Tbps per package.[1][1]

Ayar Labs raised $500M in Series E funding (March 2026, ~$3.8B valuation) to scale production of its TeraPHY optical engines and SuperNova light sources. It joined NVIDIA’s NVLink Fusion ecosystem to enable CPO-based rack-scale optical fabrics, connecting thousands of GPUs across racks with higher bandwidth density, lower power (4-20x better throughput/watt vs. copper in some claims), and reduced latency. Partnerships (e.g., with Alchip) focus on UCIe-compatible chiplets for AI accelerators.[4][5]

Implications for competitors: Optical CPO is shifting from niche to critical for scale-up beyond ~72 GPUs (current NVLink electrical limits). Entrants must prioritize power efficiency and integration with existing ecosystems (e.g., UCIe, NVLink) or risk being sidelined in hyperscale bids.

Ethernet Gaining Ground in Scale-Out and Emerging Scale-Up

Broadcom’s Tomahawk 6 delivers 102.4 Tbps switching capacity (world’s first at this level in a single chip); Jericho4 fabric routers support multi-data-center scale with congestion-free RoCE and 3.2 Tbps HyperPort. Scale-Up Ethernet (SUE) is positioned as an open alternative for intra-rack or pod-level connectivity, with deployments ramping for 2026 rack-level products.[6][7]

Arista reported strong Q4 FY2025 results (announced Feb 2026) driven by AI Ethernet momentum across back-end, front-end, and scale-across fabrics, with Etherlink switches emphasizing RDMA-aware features, load balancing, and Ultra Ethernet Consortium (UEC) compatibility.[8]

DriveNets (Jan 2026) reported its fabric-scheduled Ethernet delivering up to 18% better NCCL performance than InfiniBand in a live 512-GPU production cluster, highlighting Ethernet’s improving viability for large-scale training.[9]

Market analyses (May 2026) note Broadcom’s third-generation CPO supporting 200 Gbit/s per lane (unveiled May 2025) and project silicon photonics optical interconnects growing rapidly, with AI data center fabrics as the fastest segment.[10]

Implications: Ethernet is closing the performance gap with InfiniBand (via RoCE enhancements and scheduling) while offering cost, openness, and ecosystem advantages. Competitors should target hybrid fabrics or UEC-compliant solutions for broader adoption in non-Nvidia ecosystems (e.g., AMD XPUs).

Nvidia’s Continued Dominance in Proprietary High-Bandwidth Fabrics

Nvidia’s Quantum-X800 InfiniBand and Spectrum-X Ethernet platforms (800G end-to-end) are shipping in volume, supporting trillion-parameter models with low latency and high consistency. NVLink 5 (Blackwell-era, ~224 Gbit/s per lane) enables NVL72 systems (72 fully connected GPUs, ~14.4 Tbps per GPU unidirectional in some configurations) shipping since 2025, with roadmaps toward larger domains (e.g., NVL288 or beyond) via Vera Rubin.[11][12]

Implications: Nvidia’s vertical integration (GPUs + NVLink + InfiniBand/Spectrum-X) maintains a performance edge for tightly coupled training, but openness pressures from Ethernet/optical alternatives create opportunities for multi-vendor clusters.

Persistent Gaps: Bandwidth Scaling Lags Compute; Scale-Up Domains Remain Small

A March 2026 analysis shows scale-out network bandwidth per GPU/XPU rising only 4-5x since 2022, while compute per GPU surged 10x+, making networks the “hidden limiter.” Scale-up domains advanced slowly to 72 GPUs (GB200 NVL72, shipping 2025) but frontier MoE models need hundreds to thousands of GPUs per pod for larger expert counts and active experts per token. Electrical solutions hit fundamental limits in reach, power, and radix beyond rack scale.[2]

Frontier training requires ~32 Tbps bisection for 16k-GPU clusters (example), with each new model generation demanding ~10x more interconnect bandwidth. Optical and higher-radix solutions are essential; pluggable optics and copper fall short for multi-rack scale-up reliability and density.[3][13]

Implications: The gap favors companies delivering optical CPO or advanced Ethernet for larger pods and better bandwidth-per-watt. Pure electrical or legacy pluggable approaches risk being constrained to smaller clusters; new entrants should focus on co-packaged or in-package optics integrated with accelerators.

Market and Technology Trajectory Through Mid-2026

Silicon photonics optical interconnects are projected for strong growth (e.g., AI fabric segment at 38.7% CAGR in one 2026 report), driven by the shift to CPO for chip-to-chip and board-level AI connectivity. 800G dominates 2025-2026 shipments, with 1.6T on the horizon as a requirement for frontier models.[3][10]

Overall gap summary: Needed—multi-rack scale-up with 10s of Tbps per GPU/pod, petabit-scale fabrics, pJ/bit efficiency, and sub-μs latency at 100k+ GPU scales. Available—mature 800G fabrics and early CPO sampling (1.6 Tbps/fiber demos), but scale-up domains and per-GPU bandwidth still lag compute by a widening margin, with full commercial multi-rack optical deployments ramping rather than widespread.

These developments signal a rapid pivot toward optical and Ethernet-hybrid solutions in the first half of 2026, with concrete product milestones and funding validating the shift.

NVLink and Rack-Scale Scale-Up: Nvidia's Strength and Limit

InfiniBand vs. Ethernet: Performance vs. Scalability Trade-offs

Optical Interconnects and Startups: Addressing Power and Density

The Gap: Required vs. Available

Recent Findings Supplement (June 2026)

Optical Interconnects Advancing for Multi-Rack Scale-Up

Ethernet Gaining Ground in Scale-Out and Emerging Scale-Up

Nvidia’s Continued Dominance in Proprietary High-Bandwidth Fabrics

Persistent Gaps: Bandwidth Scaling Lags Compute; Scale-Up Domains Remain Small

Market and Technology Trajectory Through Mid-2026

Other reports in this analysis

Continue Reading

Understanding Sholto Douglas & Trenton Bricken's Frontier Model Training Thesis

The US Federal Government's AI Strategy - June 2026 Update

Powering the AI Boom: Where the Grid Breaks First (2026-2030)

Get Custom Research Like This