Industry Analysis

Are networking and memory the two biggest constraints on the ai buildout...

Jon Sinclair using Luminix AI
Jon Sinclair using Luminix AI Strategic Research
Key Takeaway

The framing that pairs networking and memory as the two biggest constraints on the AI buildout contains a category error. These factors are moving in opposite directions. Memory ranks near the top of any honest assessment of limitations.

In this report 7 sections
  1. The Question Contains a Category Error
  2. The Real Ranking Depends on Who You Are
  3. The Most Underappreciated Bottleneck Is Squandered Capacity
  4. Where Power and Fab Capacity Genuinely Outrank Memory
  5. Where the Strategic Opportunities Concentrate
  6. Where the Two-Constraint Thesis Is Weakest — and What Would Flip It
  7. Questions the Research Leaves Open

The Question Contains a Category Error

The "networking and memory" framing pairs two constraints that are actually moving in opposite directions. Memory belongs near the top of any honest ranking — HBM from all three suppliers (SK Hynix, Samsung, Micron) is sold out through 2026 under multi-year fixed-price contracts, with AI consuming up to 70% of global memory production (Reports 1, 3). But networking is the most addressable of the named bottlenecks, not a co-equal apex constraint. It is being actively solved in real time: DriveNets reported fabric-scheduled Ethernet beating InfiniBand by 18% in a live 512-GPU cluster, the Ultra Ethernet Consortium shipped Spec 1.0, and co-packaged optics startups hit milestones like Lightmatter's 1.6 Tbps/fiber (Report 2).

The more complete picture is a stack of physical constraints — power, HBM, and fab/packaging capacity — sitting upstream of networking. Across five of six reports, power/grid access is the single most-cited binding constraint (Reports 1, 4, 6), and by mid-2026 a credible analyst camp argues chip manufacturing capacity at TSMC has overtaken even power as the rate-limiter (Report 6). Networking matters enormously at frontier scale, but it's the constraint with the clearest engineering path out.

The Real Ranking Depends on Who You Are

The binding constraint is different for each player, and conflating them is why "networking and memory" feels both true and incomplete.

Frontier training labs: interconnect bandwidth genuinely rivals or exceeds raw GPU FLOPS as the competitive moat (Report 2). Scale-out bandwidth per GPU rose only 4–5x since 2022 while compute per GPU surged 10x+, making the network "the hidden limiter," with all-reduce traffic consuming up to 62% of execution time (Report 2). For these players, networking and HBM genuinely are top constraints — but so is securing gigawatts of power and CoWoS allocation.

Inference (now ~2/3 of AI compute): the binding constraint is the memory wall, specifically memory bandwidth and latency in the decode phase, not FLOPS. A January 2026 Google DeepMind paper by Patterson and Ma declared "LLM inference is a crisis" driven precisely by memory bandwidth, not compute (Report 6). Here, memory dominates and networking is largely irrelevant.

Enterprise deployments: the binding constraint is neither networking nor memory — it's software and data movement. Enterprise GPU utilization averages roughly 5% (CAST AI, cited in Report 5), with accelerators idle ~95% of the time due to data staging and movement. The constraint here is self-inflicted waste, not hardware scarcity.

So the honest answer: memory is a top-two constraint nearly everywhere; networking is top-two only at frontier training scale; power is the binding constraint for getting anything energized at all; and software is the silent constraint for everyone outside the hyperscalers.

The Most Underappreciated Bottleneck Is Squandered Capacity

The 5% enterprise utilization figure (Report 5) reframes the entire debate. If realized utilization is that low, the "shortage" of accelerators is partly a software problem masquerading as a hardware problem. More striking: Model FLOPS Utilization has declined on faster hardware — roughly 40.8% on H100 versus 59.7% on A100 — because communication and data overheads grow relatively larger as chips get faster (Report 5). Every hardware generation buys less effective compute per dollar than the marketing suggests, because the software stack can't feed the silicon.

This is the non-obvious twist: the industry is pouring $630B+ in 2026 hyperscaler capex (Reports 1, 4) into raw silicon while leaving the majority of existing capacity stranded by data-movement inefficiency. Solutions targeting data liquidity and orchestration could unlock more effective capacity than additional fab output.

The second underappreciated bottleneck is the humble high-voltage transformer. Transformer lead times have stretched to 2.5–5 years (up from 24–30 months pre-2020), and Sightline Climate projects 30–50% of the planned 2026 capacity pipeline will face delay or cancellation (Reports 1, 4). The AI buildout is, at the margin, gated by a piece of century-old electrical equipment — not by anything in the chip stack.

Where Power and Fab Capacity Genuinely Outrank Memory

The strongest disconfirmation of a chip-centric view: data center build cycles run 12–36 months, but grid interconnection queues run 4–10 years in key markets like PJM and ERCOT (Reports 4, 6). ERCOT alone was tracking ~410 GW of large-load interconnection requests by early 2026, roughly matching its entire peak load (Report 4). You can procure GPUs and HBM in quarters; you cannot energize a site in under half a decade without behind-the-meter generation. Power is upstream of memory and networking in a way the two-constraint thesis ignores.

Then the genuine conflict in the research: Report 6 cites a May 2026 CNAS report and Bernstein commentary arguing chip manufacturing — TSMC wafers and CoWoS packaging — has become the tightest constraint, displacing power that dominated in 2024–2025, with TSMC's CEO naming wafers (not power) as the bottleneck and Google reportedly missing 2026 targets over insufficient foundry slots. Report 4, however, maintains that power/grid remains the dominant deployment gate, while noting "some 2026 analyses note a shift toward chips." Both can be true: fab capacity caps what gets manufactured; power caps what gets energized. Neither is "networking."

Where the Strategic Opportunities Concentrate

The constraint landscape points to four high-leverage plays, none of which is "build a faster switch":

Power as product, not utility. Behind-the-meter generation, nuclear/SMR offtakes, and pre-secured grid headroom now function as a moat and barrier to entry (Reports 1, 4). On-site power expectations among hyperscalers rose 22% in just six months prior to January 2026 (Report 4). Capital deployed fastest on power infrastructure captures disproportionate value (Report 6).

The utilization arbitrage. With enterprise utilization near 5% (Report 5), software that improves data liquidity, orchestration, and compiler efficiency can deliver more effective compute than new silicon — and faces none of the multi-year physical lead times. This is the rare opportunity that's capital-light and unconstrained by the physical bottlenecks gating everyone else (Reports 5, 6).

Optical interconnect at the inflection point. Co-packaged optics is shifting from niche to critical precisely because electrical links hit physical limits beyond ~72 GPUs (Report 2). Ayar Labs raised $500M at a ~$3.8B valuation and joined NVIDIA's NVLink Fusion ecosystem; Nvidia paid $900M+ to license Enfabrica's fabric technology and acqui-hire its team (Report 2). The smart money is already moving here.

Memory architecture, not just memory supply. Since HBM allocation favors incumbents with NVIDIA-tied contracts (Report 3), differentiation lies in reducing memory pressure: CXL pooling (Marvell's Structera reporting 4.8x inference throughput and 82.7% reduction in time-to-first-token), processing-in-memory, and SRAM-centric inference designs (Report 3). Quantization to 4–8 bit retains ~99% accuracy while cutting memory needs and energy 70–80% (Report 6).

Where the Two-Constraint Thesis Is Weakest — and What Would Flip It

The networking-and-memory thesis is weakest on networking, for three reasons. First, networking is being solved on a 12–24 month timeline (UEC, Spectrum-X, Tomahawk 6 at 102.4 Tbps, production Ethernet beating InfiniBand) while power and fab constraints carry multi-year lead times that no engineering sprint can compress (Reports 2, 4). Second, software efficiency gains — MoE activating only 10–30% of parameters, FP8 training, speculative decoding — are decoupling model capability growth from raw hardware scaling, which relaxes both networking and memory pressure simultaneously (Report 6). Third, both networking and memory sit downstream of power and fab capacity: optimizing interconnects is worthless if you can't energize the rack or buy the wafer (Reports 4, 6).

Memory holds up far better. HBM's sold-out status, three-supplier oligopoly, and 12–18 month expansion lead times make it a structural rather than cyclical constraint (Report 3), and the inference shift makes memory bandwidth the binding limit for the fastest-growing workload (Report 6).

What would change the answer: If the software efficiency curve (quantization, MoE, KV-cache compression) keeps outpacing model size growth, aggregate hardware demand growth slows and memory becomes the clear apex constraint while networking recedes (Report 6). Conversely, if power and fab capacity get resolved — through behind-the-meter generation at scale and TSMC capacity coming online in 2027+ — then memory and networking snap back to the top of the binding-constraint list precisely as the current thesis predicts. The thesis isn't wrong; it's describing the world that exists after the power and fab bottlenecks are cleared.

Questions the Research Leaves Open

Three gaps matter for anyone betting on this landscape. First, the reports don't reconcile whether power or fab capacity is the binding upstream constraint in 2026 — Report 6's CNAS/Bernstein "fab capacity won" claim and Report 4's "power still leads" position are presented as a live disagreement, and the answer determines whether you invest in foundry-adjacent or power-adjacent assets. Second, the 5% enterprise utilization figure (Report 5) is dramatic but comes from a single widely-cited source; if it's even directionally right, it implies the entire shortage narrative is partly a software failure — this deserves independent verification before betting on it. Third, none of the reports quantifies how much aggregate hardware demand the software efficiency gains actually absorb — the difference between "software buys 18 months" and "software fundamentally bends the demand curve" changes which constraints bind in 2027–2028.

Latest from the conversation on X
Jun 4, 2026
  • 01 Hardware memory forms the primary "memory wall" constraining GenAI, as model sizes have grown far faster than memory capacity and bandwidth per accelerator, leaving compute idle while bandwidth gaps dominate runtime and costs.
  • 02 AI inference is fundamentally a memory trade where KV cache size becomes the limiting factor beyond certain batch sizes, shifting the dominant bottleneck to memory bandwidth rather than FLOPS as inference scales 100x.
  • 03 Networking is emerging as the next AI infrastructure narrative after memory, with hyperscalers and NVIDIA flagging interconnects as a top bottleneck and driving opportunities across Ethernet switches, optics, and related chips.
  • 04 The AI bottleneck is rotating through phases—compute in 2023, networking in 2024, memory in 2025—with power, cooling, and energy expected to dominate later as demand outpaces supply in each layer.
  • 05 Rack power density is becoming the binding constraint in 2026 after prior HBM and CoWoS limits, necessitating liquid cooling, higher voltage architectures, and optical fabrics as cluster scales increase.

Get Custom Research Like This

Start Your Research

Source Research Reports

The full underlying research reports cited throughout this analysis. Tap a report to expand.

Report 1 Research the current publicly identified constraints on AI training and inference infrastructure as of 2025-2026. Survey analyst reports, hyperscaler earnings calls, and industry publications to catalog the full spectrum of bottlenecks — including networking, memory, compute, power/energy, cooling, and software stack limitations. Produce a ranked summary of which constraints are most cited and by whom.

Power and energy supply is the most widely cited and binding constraint on AI infrastructure scaling in 2025-2026. Hyperscalers (Microsoft, Google/Alphabet, Amazon, Meta) repeatedly flag it in earnings calls as the factor limiting capacity fulfillment despite massive capex ramps ($600B+ combined for 2026 in some projections). Utilities face strained grids, multi-year interconnection queues, “ghost capacity” reservations, and requirements for high utilization guarantees. This stems from AI workloads demanding 10x+ rack power density versus traditional IT, with data center power demand projected to surge +160% by 2030 from 2023 levels.[1][2][3]

  • Microsoft noted in 2026 calls that it expects to “remain constrained at least through 2026” due to GPU/CPU/storage capacity limits tied to power availability; similar commentary from Amazon (“growing faster if not for capacity constraints” in chips/power) and others.[4][5]
  • Analyst reports (Goldman Sachs, McKinsey, Deloitte) rank grid/power as the top challenge: 72% of power/data center executives in a Deloitte 2025 survey called it “very or extremely challenging,” with 7-year waits for some grid connections and PJM capacity prices spiking >10x in affected markets.[6][7]
  • SemiAnalysis and others project AI driving ~40 GW of the ~96 GW total datacenter IT power demand by 2026, with new generation (gas plants 5-7 years) unable to keep pace.[8]

Implication for competitors/entrants: Securing power-secured sites or behind-the-meter generation (natural gas, on-site renewables) now creates durable advantages; late entrants without grid access or utility partnerships will face multi-year delays or higher costs.

High-bandwidth memory (HBM) and related storage represent the next-most-cited hardware bottleneck after power. SemiAnalysis (Dylan Patel) emphasizes memory as a primary scaling limit alongside logic and power, with HBM sold out through 2026, prices spiking, and ~30% of hyperscaler AI capex flowing to memory. AI inference is also driving storage constraints (HDD lead times from weeks to >1 year; tight enterprise SSD supply into 2026).[9][10]

  • HBM demand crowds out commodity DRAM; transitions to HBM4/HBM4E will intensify wafer capacity pressure (HBM uses ~3x more wafers per bit than standard DRAM).[9]
  • Yole Group notes memory architecture evolution (DDR5, HBM, CXL) as essential to address bandwidth/capacity bottlenecks in AI servers.[11]
  • Inference workloads amplify real-time data access needs, shifting bottlenecks from pure compute to storage I/O and dense flash footprints where power/space/latency are constrained.[10]

Implication: Memory supply chain access (via NVIDIA ecosystem, SK Hynix, etc.) or alternatives (CXL disaggregation, custom silicon with optimized memory) differentiates winners; price volatility will favor those with long-term contracts.

Networking fabrics for east-west GPU-to-GPU communication are a critical but somewhat addressable constraint in large training clusters. Traditional north-south optimized networks fail under AllReduce-style synchronization across thousands of GPUs, leaving expensive accelerators idle. InfiniBand (NVIDIA-dominated, low-latency RDMA) has led but Ethernet (with RoCEv2 or Spectrum-X) is gaining share for cost/scalability as clusters exceed single-site limits.[12][13]

  • Communication time increasingly dominates training (vs. pure compute); clusters of 100k+ GPUs amplify fragility from congestion, misconfigurations, or link issues.[14]
  • By mid-2025, Ethernet surpassed InfiniBand in some AI back-end shipments; NVIDIA’s Spectrum-XGS targets distributed “giga-scale” factories amid power-driven site dispersion.[15]

Implication: Operators building multi-site or heterogeneous clusters need flexible, high-scale fabrics (Ethernet advantages in cost and vendor diversity); pure InfiniBand lock-in risks higher TCO at extreme scale.

Cooling (thermal management) is tightly coupled to power density and increasingly requires liquid solutions. AI racks (often 30-100 kW vs. traditional <10-25 kW) overwhelm air cooling; cooling can consume up to 40% of data center power. Water usage for evaporative systems adds scarcity and regulatory risks.[6][16]

  • Shift to direct liquid cooling (DLC), rear-door heat exchangers (RDHx), or advanced methods (spray, immersion) is accelerating; geothermal or heat-recovery integration emerging for efficiency.[17][18]
  • High-density AI workloads make power and cooling co-equal constraints in site selection.

Implication: New builds or retrofits must prioritize liquid-ready designs and water-efficient or closed-loop systems; operators in water-stressed regions face added hurdles.

Compute (accelerators, packaging like CoWoS) and software/orchestration layers complete the picture but are less universally binding than power/memory in 2025-2026. Early GPU shortages have eased somewhat via custom silicon (Trainium, TPUs) and efficiency gains, though supply chain limits (TSMC wafers, advanced packaging) persist. Software issues—synchronization stalls, stragglers, orchestration overhead—cause real throughput loss even when hardware is available.[19][20]

Ranked summary of most-cited constraints (by prevalence across hyperscaler commentary, Goldman Sachs/McKinsey/Deloitte/SemiAnalysis reports, and industry pubs as of mid-2026):
1. Power/energy/grid access (hyperscalers earnings, Goldman Sachs, McKinsey, Deloitte — dominant theme).
2. Memory (HBM/DRAM supply & cost) (SemiAnalysis/Dylan Patel primary emphasis; echoed in capex commentary).
3. Cooling/thermal + water (tied to power density in data center outlooks).
4. Networking fabrics (critical for training scale; transitioning dynamics noted in 2025-2026 analyses).
5. Compute/packaging & software stack (still relevant but secondary to physical infrastructure).

These bottlenecks interact: power limits site/build speed, which compounds memory/networking allocation challenges. Inference growth (projected to overtake training load share) shifts emphasis toward storage, latency-optimized networks, and distributed/edge deployments, while training remains cluster-scale power/network intensive.[21][22]

For new entrants or competitors, the durable moats lie in power-secured locations, memory supply contracts, and liquid-cooling expertise rather than raw GPU counts. Additional verification on exact 2026 GW deployments or specific earnings transcripts would further refine quantitative projections.


Recent Findings Supplement (June 2026)

Power and grid infrastructure constraints have emerged as the leading publicly cited bottleneck for AI data center expansion in 2026, shifting the gating factor from chips or capital to electrical equipment and interconnection timelines. Sightline Climate’s February 2026 outlook projects that 30–50% of the planned 2026 global pipeline (roughly 16 GW across ~140 projects, with ~12 GW in the US) will face delays or cancellations, primarily because only ~5 GW is under active construction and high-voltage transformers now carry 2.5–5 year lead times (up from 24–30 months pre-2020).[1][2]

This mechanism works through multi-year utility queues and supply-chain rigidity: data center build cycles (12–18 months) cannot outpace transformer/switchgear/battery procurement or grid upgrades, forcing hyperscalers to deprioritize grid-dependent training clusters in favor of sites with pre-secured or on-site power.[3]

  • Sightline Climate (Feb 2026) and Bloomberg-linked reporting highlight 11 GW of announced US capacity with no visible construction progress and 25% of projects lacking disclosed power strategies.[2]
  • Transformer and grid equipment shortages are repeatedly named as the binding constraint across analyst notes and industry coverage through May 2026.[2]
  • Hyperscalers continue guiding record 2026 CapEx ($630B+ combined across Amazon, Google, Meta, Microsoft), yet commentary and project tracking underscore power availability as the limiter on realized capacity.[4]

For competitors or new entrants, this means prioritizing locations with existing grid headroom, behind-the-meter generation, or long-lead equipment reservations now; late movers without power contracts will see projects slip into 2027–2028 regardless of GPU or capital access.

High-bandwidth memory (HBM) supply has become a tightly constrained second-tier bottleneck, with Micron confirming its entire 2026 HBM capacity sold out under multi-year fixed-price contracts, amplifying the “memory wall” where data movement between memory and processors consumes disproportionate time and energy.[5][6]

AI data centers are projected to consume up to 70% of global memory production in 2026, with HBM taking ~23% of DRAM wafer capacity (up sharply from prior years) as three vendors (Micron, Samsung, SK Hynix) reallocate cleanroom output.[7] This drives pricing power for memory suppliers and forces system-level optimizations (near-memory compute, KV-cache compression, sparsity) that deliver outsized ROI.

  • Micron’s 2026 HBM supply is fully booked with volume and pricing locked; new capacity additions (e.g., Idaho fab) do not contribute meaningfully until 2027.[6]
  • Reports from May 2026 note HBM3E as the baseline for current workloads, with thermal/warpage challenges in taller stacks (12–16 high) requiring co-designed cooling and power management.[8]
  • Broader DRAM/enterprise SSD supply is tightening into 2026–2027 due to AI inference demand, with some forecasts of 50%+ price spikes.[9]

Competitors must secure multi-year HBM allocations or differentiate via architectures that reduce memory pressure (e.g., compute-in-memory or sparsity techniques); those without locked supply face higher costs and delayed deployments.

Networking and chip-to-chip/server interconnect speed is increasingly cited as a performance limiter, prompting interest in photonics to replace copper for lower latency and energy use in high-density racks.[10]

Optical solutions are already used for longer links, but intra-rack copper remains a speed/energy tax; analysts note communication between chips and servers as a primary model-performance bottleneck.

  • CNBC (May 29, 2026) highlights photonics as an emerging route to ease data-transfer constraints alongside energy and memory issues.[10]
  • Optical transceivers consume 2–3× more power per port than copper at equivalent bandwidth, creating an energy-budget tradeoff at 100–200+ kW/rack densities.[11]

Entrants should evaluate photonics or advanced optical interconnect suppliers early, as copper-limited designs will hit scaling ceilings sooner in next-generation (Rubin-era) clusters.

Cooling and thermal management challenges are intensifying with rack power densities rising from 10–20 kW (CPU era) to 100–130 kW (Blackwell GB200) and 200+ kW projected for Rubin, directly linking to power and memory-stack thermal issues.[11]

  • High-density HBM stacks create hotspots and mechanical stress, necessitating integrated liquid cooling and system-level thermal co-design rather than bolt-on solutions.[8]
  • Power density increases redraw every layer of the stack (power delivery, interconnect, cabling).[11]

New deployments must budget for advanced cooling from day one; retrofits or air-cooled designs will be non-competitive at frontier densities.

Ranked summary of most-cited constraints (post-Dec 2025 sources): Power/grid/electrical equipment leads (Sightline Climate Feb 2026 report and multiple follow-on analyses through May 2026); memory/HBM supply is a close second (Micron earnings confirmations and industry reports); interconnect/photonics and cooling/thermal are frequently paired with power-density discussions; storage and packaging appear as secondary or enabling constraints. No major new regulatory or policy updates were prominently featured; focus remains on supply-chain and infrastructure realities. Hyperscalers (via CapEx guidance) acknowledge constraints but continue aggressive spending, underscoring that power and memory access will determine who captures value in the 2026–2027 buildout.

Report 2 Deep-dive into how high-speed interconnects (InfiniBand, Ethernet, NVLink, optical interconnects) are constraining large-scale AI cluster performance. Research bandwidth requirements for frontier model training, the role of companies like Nvidia, Arista, Broadcom, and startups like Enfabrica or Ayar Labs, and where current networking technology falls short. Summarize the gap between what's needed and what's available.

High-speed interconnects are the primary limiter on GPU utilization in frontier AI clusters, creating a 10-100x bandwidth hierarchy gap that forces GPUs to idle during collective operations like all-reduce.[1][2]

Intra-node/rack NVLink delivers TB/s-scale bandwidth (e.g., 1.8 TB/s bidirectional per GPU or 130 TB/s aggregate across a 72-GPU GB200 NVL72 rack), while inter-node fabrics top out at ~400 Gbps (50 GB/s) per link on current InfiniBand or Ethernet. This disparity turns communication—especially gradient synchronization for models with hundreds of billions of parameters—into the dominant bottleneck, with all-reduce traffic consuming up to 62% of execution time in production workloads. Clusters are scaling to 100k–300k GPUs today (with projections toward 1M), but legacy fabrics were never designed for the east-west, many-to-many traffic patterns of synchronous training.[3][4]

Nvidia's NVLink (now 5th generation) creates coherent, all-to-all domains that treat dozens of GPUs as a single massive accelerator, eliminating intra-rack all-reduce bottlenecks. The GB200 NVL72 rack exemplifies this: 72 Blackwell GPUs linked at 130 TB/s aggregate via NVLink Switch System, with per-GPU NVLink bandwidth reaching 1.8 TB/s in multi-node configurations—far exceeding PCIe.[5][6]

This enables efficient training of 200B+ parameter models within a rack (or larger via extensions) while slashing latency and power overhead compared to network hops. However, NVLink remains proprietary, Nvidia-ecosystem locked, and currently capped at rack scale (tens to low hundreds of GPUs). Beyond that, clusters fall back to slower inter-node links, reintroducing the hierarchy gap.[7]

Implication for competitors: Any non-Nvidia accelerator ecosystem must either replicate NVLink-like density (difficult without similar vertical integration) or optimize software around heterogeneous fabrics. Hyperscalers are pushing for open alternatives to reduce lock-in.

InfiniBand vs. Ethernet: Performance vs. Scalability Trade-offs

InfiniBand (Nvidia/Mellanox) has dominated AI fabrics with RDMA-enabled lossless transport, ultra-low latency, and high effective bandwidth, historically holding ~90% share in high-end clusters. It sustains high GPU utilization in 10k–100k GPU setups but faces criticism for high cost (3–4x Ethernet), vendor lock-in, and complexity at extreme scale.[1][8]

Ethernet is closing the gap rapidly via RoCEv2, Nvidia Spectrum-X (purpose-built AI Ethernet with congestion management and performance isolation), and the Ultra Ethernet Consortium (UEC) Specification 1.0 (released June 2025). UEC adds AI/HPC optimizations including advanced congestion control, packet spraying, enhanced RDMA, in-network collectives, and better small-message performance—enabling Ethernet to approach or match InfiniBand in effective throughput for many workloads while offering lower cost, broader interoperability, and easier multi-vendor scaling.[9][10]

Broadcom's Tomahawk 6 (TH6) delivers up to 102.4 Tbps switching capacity with 1.6T ports, supporting dense AI fabrics. Arista and others provide high-radix Ethernet switches optimized for these patterns. Ethernet now powers major clusters (e.g., xAI Colossus references) due to simplicity and economics.[11][12]

Shortfall: Even optimized Ethernet/InfiniBand at 400–800 Gbps per link requires massive parallelism and careful topology (fat-tree, dragonfly) to avoid congestion hotspots. At 100k+ GPUs, tail latency, failure domains, and power draw (networking ~5–10% of IT load) become acute.[4]

Optical Interconnects and Startups: Addressing Power and Density

Copper/electrical links hit physical limits on reach, power per bit, and density for next-gen clusters. Optical solutions target this via silicon photonics and co-packaged optics (CPO).

Ayar Labs leads with in-package optical I/O (TeraPHY chiplets + SuperNova remote laser sources), delivering multi-Tbps per engine (e.g., up to 2+ Tbps full-duplex) at lower power/latency than electrical alternatives. It supports UCIe standards and is part of the Open Compute Interconnect (OCI) MSA (backed by AMD, Broadcom, Meta, Microsoft, Nvidia, OpenAI). This enables optical scale-up within racks or disaggregated memory/compute, reducing the intra/inter bandwidth gap and power walls.[13][14]

Enfabrica developed Accelerated Compute Fabric (ACF) SuperNICs for high-bandwidth Ethernet/RDMA + CXL memory pooling, targeting 100k+ GPU clusters with flat topologies and fault tolerance. Nvidia licensed the technology and acqui-hired key talent (CEO and team) in a ~$900M+ deal in 2025 to integrate into its ecosystem for more efficient large-scale fabrics.[15][16]

Broadcom complements with high-speed SerDes, DSPs, and optics roadmaps, while pushing Ethernet for scale-up.[12]

Shortfall: Optical is still early (deployments ramping 2026+); electrical remains dominant for cost/reliability today. Integration challenges (e.g., fiber management, laser reliability) persist.

The Gap: Required vs. Available

Needed for frontier training (trillion-parameter models on 100k–1M GPU clusters): Aggregate cluster bandwidth in the PB/s range with sub-microsecond latency at scale, <1 pJ/bit power efficiency, lossless many-to-many collectives, multi-DC spanning, and open/multi-vendor support. This would keep GPUs >90% utilized without custom per-workload tuning.

Available (mid-2026):
- Rack-scale NVLink excels locally but hits a wall beyond ~72 GPUs.
- Inter-node at 100–800 Gbps/link (InfiniBand/Ethernet/UEC) requires heavy software optimization (NCCL, topology-aware all-reduce) and still leaves 10–100x hierarchy gaps.
- Power and cost scale poorly; optical/CPO and advanced fabrics (Enfabrica-derived) are promising but not yet ubiquitous.
- UEC and OCI standards are closing interoperability gaps, but real-world perf at extreme scale remains vendor-optimized (Nvidia-heavy).[17]

Implications: The networking layer now rivals (or exceeds) raw GPU FLOPS as the competitive moat. Nvidia's vertical stack (NVLink + InfiniBand/Spectrum-X + acquisitions) maintains leadership, but open Ethernet/optical pushes from Broadcom, Arista, hyperscalers, and startups like Ayar Labs are eroding it on cost and flexibility. New entrants must solve disaggregation, power efficiency, or software-defined fabrics to compete; otherwise, they face idle GPUs and multi-year delays in effective cluster performance. Continued progress in UEC, silicon photonics, and CXL-over-fabric will be decisive for 2027+ exascale AI.


Recent Findings Supplement (June 2026)

Recent developments (post-December 2025) highlight accelerating adoption of co-packaged optics (CPO) and higher-speed Ethernet/InfiniBand fabrics to address interconnect bottlenecks in frontier AI training clusters, while exposing persistent gaps in scale-up domain size and per-GPU bandwidth relative to compute growth.[1][2]

Nvidia continues to lead with proprietary solutions (NVLink and InfiniBand), but Ethernet players (Arista, Broadcom) and optical startups (Ayar Labs, Lightmatter) are gaining traction with open or hybrid approaches for larger, more efficient clusters. Bandwidth demands for models requiring synchronization across tens of thousands of GPUs—hundreds of terabits of collective bisection bandwidth and sub-microsecond latency—continue to outpace electrical interconnect reach, power efficiency, and density beyond rack scale.[3]

Optical Interconnects Advancing for Multi-Rack Scale-Up

Lightmatter’s March 2026 milestone with its Passage CPO chiplet delivers a record 1.6 Tbps per fiber using 16-wavelength DWDM at 112G per SerDes lane—up to 8x more bandwidth per fiber than prior NPO/CPO solutions. This silicon-proven tech targets hyperscaler deployment to ease fiber cabling, space, and power constraints in growing AI clusters, with pathways to 100+ Tbps per package.[1][1]

Ayar Labs raised $500M in Series E funding (March 2026, ~$3.8B valuation) to scale production of its TeraPHY optical engines and SuperNova light sources. It joined NVIDIA’s NVLink Fusion ecosystem to enable CPO-based rack-scale optical fabrics, connecting thousands of GPUs across racks with higher bandwidth density, lower power (4-20x better throughput/watt vs. copper in some claims), and reduced latency. Partnerships (e.g., with Alchip) focus on UCIe-compatible chiplets for AI accelerators.[4][5]

Implications for competitors: Optical CPO is shifting from niche to critical for scale-up beyond ~72 GPUs (current NVLink electrical limits). Entrants must prioritize power efficiency and integration with existing ecosystems (e.g., UCIe, NVLink) or risk being sidelined in hyperscale bids.

Ethernet Gaining Ground in Scale-Out and Emerging Scale-Up

Broadcom’s Tomahawk 6 delivers 102.4 Tbps switching capacity (world’s first at this level in a single chip); Jericho4 fabric routers support multi-data-center scale with congestion-free RoCE and 3.2 Tbps HyperPort. Scale-Up Ethernet (SUE) is positioned as an open alternative for intra-rack or pod-level connectivity, with deployments ramping for 2026 rack-level products.[6][7]

Arista reported strong Q4 FY2025 results (announced Feb 2026) driven by AI Ethernet momentum across back-end, front-end, and scale-across fabrics, with Etherlink switches emphasizing RDMA-aware features, load balancing, and Ultra Ethernet Consortium (UEC) compatibility.[8]

DriveNets (Jan 2026) reported its fabric-scheduled Ethernet delivering up to 18% better NCCL performance than InfiniBand in a live 512-GPU production cluster, highlighting Ethernet’s improving viability for large-scale training.[9]

Market analyses (May 2026) note Broadcom’s third-generation CPO supporting 200 Gbit/s per lane (unveiled May 2025) and project silicon photonics optical interconnects growing rapidly, with AI data center fabrics as the fastest segment.[10]

Implications: Ethernet is closing the performance gap with InfiniBand (via RoCE enhancements and scheduling) while offering cost, openness, and ecosystem advantages. Competitors should target hybrid fabrics or UEC-compliant solutions for broader adoption in non-Nvidia ecosystems (e.g., AMD XPUs).

Nvidia’s Continued Dominance in Proprietary High-Bandwidth Fabrics

Nvidia’s Quantum-X800 InfiniBand and Spectrum-X Ethernet platforms (800G end-to-end) are shipping in volume, supporting trillion-parameter models with low latency and high consistency. NVLink 5 (Blackwell-era, ~224 Gbit/s per lane) enables NVL72 systems (72 fully connected GPUs, ~14.4 Tbps per GPU unidirectional in some configurations) shipping since 2025, with roadmaps toward larger domains (e.g., NVL288 or beyond) via Vera Rubin.[11][12]

Implications: Nvidia’s vertical integration (GPUs + NVLink + InfiniBand/Spectrum-X) maintains a performance edge for tightly coupled training, but openness pressures from Ethernet/optical alternatives create opportunities for multi-vendor clusters.

Persistent Gaps: Bandwidth Scaling Lags Compute; Scale-Up Domains Remain Small

A March 2026 analysis shows scale-out network bandwidth per GPU/XPU rising only 4-5x since 2022, while compute per GPU surged 10x+, making networks the “hidden limiter.” Scale-up domains advanced slowly to 72 GPUs (GB200 NVL72, shipping 2025) but frontier MoE models need hundreds to thousands of GPUs per pod for larger expert counts and active experts per token. Electrical solutions hit fundamental limits in reach, power, and radix beyond rack scale.[2]

Frontier training requires ~32 Tbps bisection for 16k-GPU clusters (example), with each new model generation demanding ~10x more interconnect bandwidth. Optical and higher-radix solutions are essential; pluggable optics and copper fall short for multi-rack scale-up reliability and density.[3][13]

Implications: The gap favors companies delivering optical CPO or advanced Ethernet for larger pods and better bandwidth-per-watt. Pure electrical or legacy pluggable approaches risk being constrained to smaller clusters; new entrants should focus on co-packaged or in-package optics integrated with accelerators.

Market and Technology Trajectory Through Mid-2026

Silicon photonics optical interconnects are projected for strong growth (e.g., AI fabric segment at 38.7% CAGR in one 2026 report), driven by the shift to CPO for chip-to-chip and board-level AI connectivity. 800G dominates 2025-2026 shipments, with 1.6T on the horizon as a requirement for frontier models.[3][10]

Overall gap summary: Needed—multi-rack scale-up with 10s of Tbps per GPU/pod, petabit-scale fabrics, pJ/bit efficiency, and sub-μs latency at 100k+ GPU scales. Available—mature 800G fabrics and early CPO sampling (1.6 Tbps/fiber demos), but scale-up domains and per-GPU bandwidth still lag compute by a widening margin, with full commercial multi-rack optical deployments ramping rather than widespread.

These developments signal a rapid pivot toward optical and Ethernet-hybrid solutions in the first half of 2026, with concrete product milestones and funding validating the shift.

Report 3 Investigate the role of HBM (High Bandwidth Memory), DRAM, and on-chip SRAM as limiting factors in AI accelerator performance. Research HBM supply concentration (SK Hynix, Samsung, Micron), publicly estimated production capacity versus AI demand, the memory wall problem, and emerging alternatives like CXL memory pooling or processing-in-memory. Conclude with which memory bottlenecks are most acute.

HBM has become the primary performance limiter for AI accelerators because its high bandwidth (often 1+ TB/s per stack) is essential for feeding massive parallel compute units in GPUs/TPUs during training and inference, yet production is concentrated in three suppliers whose 2026 output is fully sold out.[1]

SK Hynix leads with 57-62% HBM market share (Q2/Q3 2025 data), driven by early exclusive NVIDIA HBM3E deals for H100/H200 and Blackwell platforms; Micron holds ~21% and Samsung ~17%. All three report 2026 capacity sold out via long-term hyperscaler contracts, with HBM demand growing 70% YoY in 2026 after 130% in 2025. This creates a structural shortfall (one estimate: ~8% deficit) as HBM now consumes 23-25% of total DRAM wafer capacity—up from single digits recently—while AI data centers absorb up to 70% of global memory production.[2]

  • HBM stacks multiple DRAM dies vertically with through-silicon vias for extreme bandwidth, but each bit displaces several bits of standard DRAM output due to lower density and complex manufacturing.
  • Expansions require 12-18 months lead time; new capacity (e.g., SK Hynix Cheongju fab, Micron Singapore/Taiwan facilities) arrives late 2026 or 2027+.
  • Resulting market: ~$35B in 2025, ~$58B in 2026, heading toward $100B by 2028.[3]

For competitors or new entrants, securing HBM allocation is now a prerequisite for scaling AI hardware; without it, even advanced silicon (e.g., custom ASICs) cannot reach full utilization, favoring established players with supply deals like NVIDIA.

The memory wall—where processor speed vastly outpaces memory bandwidth and capacity—remains the fundamental architectural constraint, with on-chip SRAM providing the fastest but scarcest tier, HBM the high-bandwidth bridge, and off-chip DRAM the capacity fallback.[4]

Modern accelerators idle waiting for data: weights, activations, and especially KV caches in LLM inference/decode must shuttle through the hierarchy. SRAM (e.g., 10-40 MB shared per GPU block or 384 MB on-chip in Google TPU v8i, a 3x increase) offers ~10x HBM bandwidth but limited capacity and poor scaling (more die area/power per bit in advanced nodes). HBM mitigates bandwidth starvation (NVIDIA stacks deliver TB/s-scale) but models grow to consume available capacity, and latency gaps persist. Standard DRAM supplements system memory but faces capacity cannibalization.[5]

  • SRAM-centric designs (Groq LPUs, Cerebras, d-Matrix) maximize near-compute memory for inference latency wins but trade off model size flexibility.
  • HBM solves part of the wall for training-scale workloads but hits physical limits (shoreline area for connections, power).
  • Implication: even with more HBM, efficiency tricks (quantization, checkpointing) reappear as models scale.

This means pure compute scaling (more FLOPS) yields diminishing returns without memory hierarchy advances; entrants must co-design silicon with memory or accept lower utilization.

CXL-enabled memory pooling and processing-in-memory (PIM) represent the most promising near-term relief valves by disaggregating capacity and moving compute closer to data, though both remain early-stage relative to the 2026 HBM crunch.[6]

CXL (especially 2.0/3.0 with switching and fabric support) allows multiple hosts (CPUs, GPUs, XPUs) to share pooled DRAM pools, boosting utilization from typical 40-60% to 80%+ and enabling terabytes of expandable memory (e.g., Marvell Structera switches targeting 48 TB shared). PIM (Samsung HBM-PIM, SK Hynix AiM/Accelerator-in-Memory) embeds logic in or near DRAM/HBM banks for in-situ matrix ops, claiming 2x+ performance and 70%+ energy reduction by slashing data movement—already seeing real-world deployment and commercial pushes.[7]

  • CXL market: ~$2.8B in 2025, projected strong growth (CAGR ~25-29% to 2034); hyperscalers integrating into new servers.
  • PIM: Samsung and SK Hynix actively commercializing HBM variants; prototypes show bandwidth-proportional efficiency gains in search/vector workloads.
  • Limitations: CXL adds latency vs. local HBM; PIM requires software/ISA changes and is not yet ubiquitous in flagship accelerators.

Competitors can differentiate by adopting CXL pooling for cost-efficient scaling or PIM for inference efficiency, but these will not fully displace HBM shortages before late 2026 or 2027; early movers in hybrid HBM+CXL or PIM-augmented designs gain advantage.

HBM supply shortage is the most acute near-term bottleneck (sold-out 2026 capacity directly caps accelerator deployments), followed by the persistent memory wall in inference workloads; on-chip SRAM scaling limits and standard DRAM reallocation are secondary but compounding constraints, while CXL/PIM offer longer-term architectural escape hatches.[1]

The oligopoly and physics of 3D stacking make HBM the immediate chokepoint, but the memory wall ensures that even abundant HBM will eventually require PIM-style or pooled innovations. Hyperscalers and chip designers prioritizing multi-year HBM contracts today, while piloting CXL/PIM, will best navigate the constraints.


Recent Findings Supplement (June 2026)

HBM supply remains the dominant acute bottleneck for AI accelerators in 2026, with the entire production capacity of the three major suppliers (SK Hynix, Samsung, Micron) sold out for the year amid 70% projected YoY demand growth.[1]

This concentration in a three-supplier oligopoly (controlling ~90-95% of advanced DRAM/HBM) has turned HBM into a structural constraint rather than a cyclical one, directly limiting deployment of next-gen accelerators like NVIDIA’s Rubin platform.

  • SK Hynix holds the largest share (~50-62% as of recent 2025-2026 data), followed by Samsung (~17-35%) and Micron (~5-21%, with reports of overtaking Samsung in some segments); all have allocated most or all 2026 HBM3E/HBM4 output to key customers like NVIDIA.[1]
  • Micron and SK Hynix explicitly confirmed sold-out 2026 HBM capacity, with multi-year agreements extending visibility; Samsung issued similar warnings of shortages persisting through at least 2027.[2]
  • HBM is expected to consume ~25% of total DRAM wafer production by 2026 as suppliers reallocate capacity from conventional DRAM.[3]

This sold-out status and oligopoly create multi-year lead times and elevated pricing, forcing hyperscalers to lock in supply years ahead while slowing overall AI infrastructure scaling.[4]

Suppliers are aggressively expanding capacity, but 12-18+ month lead times mean relief is limited until late 2026 or 2027, even as demand surges.[1]

Recent announcements highlight the scale of the response:

  • Micron raised 2026 capex to $20 billion (focused on Idaho mega-fabs for HBM and DRAM).[5]
  • SK Hynix announced a $15 billion U.S. advanced HBM packaging plant (Feb 2026) and, in June 2026, stated plans to double overall wafer capacity over the next five years.[6]
  • HBM demand grew 130% YoY in 2025 and is projected at +70% YoY in 2026; HBM revenue run rates for leaders are already in the billions annually.[1]

These investments signal recognition of structural AI-driven demand but underscore that near-term supply will remain tight, favoring early qualifiers (especially SK Hynix with NVIDIA) and disadvantaging new entrants or smaller players.[7]

The memory wall persists as a fundamental limiter, with HBM mitigating but not eliminating bandwidth and capacity constraints as models and parallelism scale.[8]

HBM4 (mass production starting 2026 at Samsung/SK Hynix) doubles interface width to 2048-bit and targets >2 TB/s per stack bandwidth with ~40% better energy efficiency, enabling platforms like NVIDIA Rubin (8 stacks, 288 GB, 22 TB/s total).[9] However, the wall shifts to inter-chip interconnects and larger models rather than disappearing.

  • AI accelerators spend increasing time on data movement; on-chip SRAM offers ultra-low latency but is die-area constrained (zero-sum with compute).[10]
  • Microsoft’s Maia 200 (announced Jan 2026) exemplifies hybrid approaches: 216 GB HBM3e at 7 TB/s plus 272 MB on-chip SRAM with specialized DMA/NoC for inference.[11]

HBM remains the primary near-term solution, but its supply constraints amplify the wall’s impact on overall system performance.[12]

CXL memory pooling and processing-in-memory (PIM) are advancing as disaggregated alternatives to ease capacity pressure, though they are not yet mature replacements for HBM in high-performance AI training.[13]

Recent developments include:

  • Marvell’s Structera S (Mar 2026) demonstrates CXL switching scaling the memory wall, with reported 4.8x inference throughput gains and 82.7% reduction in time-to-first-token via pooling.[14]
  • Samsung integrates PIM with CXL modules for near-memory neural operations; Intel advances coherent CXL 2.0/3.0 pooling across CPU/GPU/accelerators with heterogeneous memory support.[13]
  • Academic/industry work (e.g., DATE 2026 paper on Sage architecture) shows CXL pooled systems with predictive caching and selective NDP achieving 2.84x throughput on embedding-heavy workloads versus conventional management.[15]

These technologies improve utilization (potentially from 40-60% to >80%) and reduce stranded memory but add latency (tens of ns) and are most relevant for inference, recommendation models, or capacity expansion rather than peak training bandwidth.[16]

On-chip SRAM serves as a high-bandwidth complement or alternative in inference-focused designs but faces hard physical limits that make it unsuitable as a broad HBM replacement.[10]

SRAM-centric accelerators (e.g., Groq, Cerebras, d-Matrix) prioritize on-die memory for low-latency access, trading off compute area; bandwidth scales favorably with die size but remains capped.[17] Compute-in-memory (CIM) variants using SRAM arrays perform MAC operations in-place, reducing off-chip movement, but density and area trade-offs persist.[18]

SRAM excels in specific niches (e.g., decode-phase inference) but cannot scale to the capacities required for frontier training models.[10]

The most acute bottlenecks are HBM supply concentration and availability (sold out through 2026+), followed by the memory wall’s data-movement overhead; CXL/PIM and SRAM offer partial mitigations but do not yet alleviate the primary constraints.[5]

DRAM capacity is indirectly squeezed as wafers shift to HBM, while on-chip SRAM limits are architecture-specific rather than systemic. New research and announcements reinforce HBM as the near-term gating factor for AI accelerator performance and deployment scale.

Report 4 Research the degree to which power availability, grid constraints, water cooling capacity, and data center construction timelines are acting as AI buildout constraints in 2025-2026. Pull from utility filings, hyperscaler announcements, and energy analyst reports. Assess whether physical infrastructure may rival or exceed networking and memory as a binding constraint.

Power and grid constraints have become the dominant physical bottleneck for AI data center expansion in 2025-2026, often outpacing the pace of chip or networking hardware availability. Hyperscalers can procure GPUs and high-bandwidth memory (HBM) with lead times measured in quarters, but energizing facilities requires grid interconnections that routinely take 5–7 years due to queues, transmission upgrades, and permitting—creating a mismatch where announced capacity far exceeds operational megawatts.[1][2]

  • Deloitte’s April 2025 survey of 120 US power and data center executives found grid stress as the leading challenge, with 72% rating power/grid capacity as “very or extremely challenging.”[2]
  • As of early 2026, ~190 GW of hyperscale capacity had been announced across 777 projects (~148 GW planned), yet only ~12 GW was operational and ~21 GW under construction.[1]
  • US data center power demand is projected to rise from ~80 GW in 2025 to 150 GW by 2028 (Bloom Energy report).[3]
  • In Virginia alone, Dominion Energy reported ~70,000 MW of data center requests (roughly triple its system peak load), with 25,000 MW having projected connection dates through 2031.[4]
  • Sightline Climate analysis (widely cited in 2026 reporting) projected 30–50% of planned 2026 US data center capacity delayed or canceled due to power, permitting, and related constraints.[5][5]
  • Interconnection queues in key regions (PJM, ERCOT) stretch years; high-power transformer lead times have extended to as long as five years (from 24–30 months pre-2020).[5]

This forces hyperscalers and developers to prioritize “speed to power” in site selection, pursue behind-the-meter generation (e.g., natural gas peakers or small modular reactors), or shift to less-constrained but higher-latency or costlier locations. For new entrants or competitors, power access now functions as a de facto moat or barrier to entry in prime markets.

Water cooling capacity is a growing but generally secondary constraint, most acute in water-stressed regions and driving rapid adoption of liquid cooling systems. AI rack densities (projected to reach 50x traditional levels) overwhelm air cooling, increasing reliance on evaporative or hybrid systems that consume significant water—though closed-loop and direct-to-chip liquid cooling mitigate usage while introducing new infrastructure requirements.[1]

  • A 100 MW AI data center can use ~2 million liters of water daily (equivalent to ~6,500 households); projections show cooling water demand potentially rising sharply.[6]
  • Goldman Sachs forecasts liquid-cooled AI servers rising from 15% in 2024 to 54% in 2025 and 76% in 2026.[7]
  • Liquid cooling can reduce site energy use by 25–30% and improve PUE (near 1.1 in best cases), with some closed-loop designs cutting freshwater consumption materially.[8]
  • Concerns are heightened in the US West/Southwest and areas with municipal competition; however, power and grid issues are cited far more frequently as the primary limiter in utility filings and analyst reports.[9]

Implication: Cooling adds cost and complexity (especially retrofits), but does not appear to rival power as a nationwide deployment gate. Operators investing early in liquid-ready designs or alternative cooling gain an edge in high-density AI clusters.

Data center construction timelines are extended well beyond the 12–18 month build window by power infrastructure, equipment lead times, and permitting, amplifying the physical constraint. Shell construction is relatively fast, but full energization and fit-out for AI workloads face multi-year delays from transformers, substations, and local approvals.[1]

  • Over 25% of projects slated for 2025 online dates were delayed; 30–50% of 2026 pipeline capacity faces similar slippage.[1][10]
  • Average global construction costs rose to ~$10.7M per MW by 2025 and are forecast at $11.3M per MW in 2026 (6% increase).[11]
  • Between March 2024 and 2025, at least 16 developments were delayed or denied, often due to community opposition or permitting.[1]
  • In PJM territory, projects entering service around 2025 averaged more than seven years from initiation to operation (including ~3+ years in queue and ~4 years post-approval).[12]

For competitors: Speed advantages now hinge on pre-positioned power contracts, modular/on-site generation, or greenfield sites with faster utility processes rather than pure construction efficiency.

Physical infrastructure constraints (power/grid primary, water and timelines secondary) rival or exceed networking/memory as binding limits on the pace of energized AI capacity in 2025-2026, though the picture is nuanced by timeframe and specific bottleneck. Hardware supply (HBM, CoWoS advanced packaging, networking interconnects) constrains what can be manufactured and shipped, but without grid power, racks remain unenergized. Some 2026 analyses note a shift toward chips as the tighter near-term limit after earlier power dominance, yet deployment delays remain heavily infrastructure-driven.[13][14]

  • Multiple sources (Deloitte, Sightline, utility filings) consistently rank power/grid as the top or among the top challenges for actual buildout.[2][15]
  • Chip/memory constraints (e.g., HBM allocation prioritizing AI) affect production velocity, but power queues directly gate when capacity becomes usable.[16]
  • Hyperscaler capex remains high (~$650B+ projected for 2026 across major players) despite delays, indicating capital is not the limiter—conversion to operational MW is.[17]

Hyperscalers and utilities are adapting through rate reforms, on-site generation, and new market strategies, signaling that physical constraints will shape AI scaling trajectories through at least 2027–2028. Examples include Dominion’s GS-5 rate class (data centers commit to long-term payments for requested power) and broader shifts toward behind-the-meter solutions.[18][19]

Implications for entrants or competitors: Success requires integrated strategies combining power procurement (PPAs, co-generation, nuclear/SMR partnerships), liquid-cooling-ready designs, and geographic diversification beyond saturated markets like Northern Virginia. Those who treat infrastructure as a core product moat—rather than a commodity—will capture disproportionate share of deployable AI capacity. Additional research into specific 2026 utility integrated resource plans or hyperscaler earnings commentary would further refine regional variances.


Recent Findings Supplement (June 2026)

Power availability and grid constraints have emerged as the dominant physical bottlenecks for AI data center buildout in 2025–2026, often rivaling or exceeding networking and memory hardware constraints. Hyperscalers and developers face multi-year delays in securing reliable power, driving shifts to on-site generation, co-location models, and alternative siting—even as physical construction timelines remain comparatively short (12–36 months). Water cooling capacity adds localized pressure, prompting regulatory pushes for closed-loop systems and liquid cooling transitions. These infrastructure realities are reshaping project viability, with 30–50% of planned 2026 U.S. capacity at risk of delay or cancellation.[1][2]

1. Power Demand Surge Creating Acute Grid Strain

U.S. data center power demand is projected to more than double from 31 GW in 2025 to 66 GW by 2027 (Goldman Sachs, May 2026), driven by AI workloads that concentrate massive, constant loads. This pushes data centers' share of national peak summer demand from 4.1% to 8.5%, tightening markets and forcing utilities and developers into reactive planning.[3]

  • IEA analysis indicates ~20% of planned data center projects risk delays without grid fixes; transmission build times of 4–8 years and component lead times (e.g., transformers doubled in recent years) compound the issue.[4]
  • ERCOT is tracking ~410 GW of large loads seeking interconnection as of March 2026 (~87% data centers), with 198 GW applied in Q1 2026 alone—roughly matching current peak load.[5]
  • PJM's 2026 long-term load forecast shows accelerated growth from AI data centers, contributing to capacity market price spikes (e.g., 2026–2027 delivery year clearing at $329/MW vs. prior lows).[6]

Implications: New entrants or competitors must prioritize "speed to power" locations or hybrid models over traditional grid-dependent sites; failure to secure firm power early can strand projects despite available land or capital.

2. Interconnection Delays and Project Slippage

Grid connection queues now routinely exceed physical build times, with average waits of 5–7 years (or more) versus 12–18 months for data center construction itself. Over 25% of 110 projects slated for 2025 online were delayed due to power, permitting, and related issues; similar patterns persist into 2026.[2]

  • PJM data shows projects averaging >7 years total to operation, with more time post-approval than in queue; as of early 2026, >21 GW in engineering/procurement and 8.2 GW under construction.[7]
  • Sightline Climate tracking (early 2026) of 777 hyperscale projects (>50 MW, announced since 2024) projects 30–50% of 2026 pipeline capacity delayed or canceled.[1]
  • FERC-directed reforms (e.g., PJM compliance filings in Feb 2026 for co-located loads) and DOE emergency curtailment orders (May 2026) highlight ongoing strain, including rules for backup generation and expedited tracks.[8][9]

Implications: Developers able to navigate or bypass queues (via policy advocacy, alternative generation, or less-constrained regions like parts of Texas) gain decisive timing advantages; pure reliance on traditional interconnection is increasingly uncompetitive.

3. Shift to On-Site, Co-Located, and Alternative Power

Grid constraints are accelerating behind-the-meter generation, PPAs, and co-location, with hyperscalers accepting added complexity for timeline certainty. On-site power expectations among hyperscalers/colos rose 22% in the six months prior to Bloom Energy’s January 2026 report.[10]

  • Hyperscalers (Microsoft, Google, Amazon, others) pursuing nuclear offtakes and large PPAs; Alphabet announced energy innovation acquisitions to support on-site management.[11][2]
  • JLL (Jan 2026) notes average grid waits >4 years in primary markets, spurring colocated battery storage and natural gas as bridges (despite sustainability concerns for some tenants).[12]
  • PJM proposals (2026) include backstop generation procurement, connect-and-manage frameworks with earlier curtailment options, and expedited tracks for state-sponsored generation.[13]

Implications: Competitors with expertise in on-site generation, financing hybrids, or regulatory navigation can accelerate deployments where grid-dependent players stall; this favors well-capitalized or vertically integrated players.

4. Water Cooling Capacity Adding Localized Pressure

Direct water use for evaporative cooling, combined with indirect use via power generation, is straining resources in high-growth areas, prompting efficiency shifts and new regulations. A 100 MW AI data center can use ~2 million liters daily (on-site portion ~725,000 liters).[14]

  • March 2026 state legislation in South Carolina (HB 4583) and Kansas (SB 400) mandates closed-loop systems with zero net water withdrawal/discharge for data centers.[15]
  • UT Austin white paper (May 2026) projects Texas data centers could account for 3–9% of state water use by 2040 (cooling + power generation).[16]
  • Industry shift toward liquid cooling and no-water systems (e.g., new facilities in Arizona/Wisconsin saving ~125 million liters/year each starting 2026); AWS expanding reclaimed water use.[17]

Implications: Site selection must now incorporate water stress mapping and cooling tech roadmaps; regions or operators adopting closed-loop/liquid cooling early avoid regulatory or community pushback.

5. Construction Timelines and Supply Chain Realities

While physical builds are faster than grid connections, rising costs (global average to $11.3M/MW in 2026, +6% YoY), labor shortages (~439k workers industry-wide), and equipment lead times (transformers, generators) extend effective timelines.[12]

  • Data center construction costs rose at 7% CAGR 2020–2025; skilled labor shortages and peak crew sizes (now 4,000–5,000) add friction.[2]
  • Supply chain issues (e.g., transformers) cited as key 2026 delay factors alongside power.[18]

Implications: Modular or standardized designs, automation, and early procurement lock-ins help mitigate, but these are secondary to power access for overall speed-to-market.

Overall, physical infrastructure—especially power and grid access—has become a more binding constraint than networking or memory hardware for AI scale-up in the near term. Entities that treat power strategy as a core development input (rather than a downstream utility task) will capture disproportionate share of the buildout. Additional research into specific utility IRPs or hyperscaler earnings calls could further quantify regional variances.

Report 5 Analyze whether GPU/accelerator supply (Nvidia H100/H200/B200, AMD MI300, custom ASICs from Google, Amazon, Microsoft) remains a binding constraint, and whether software inefficiencies — compiler stacks, parallelism strategies, orchestration — are emerging as underappreciated bottlenecks. Include publicly estimated utilization rates and any expert commentary on software-hardware co-design gaps.

Nvidia's H100/H200 supply has eased substantially by mid-2026, with prices dropping sharply and wider availability, while Blackwell (B200/B300/GB200) remains tighter with lead times stretching into mid-2026 in places. This reflects ramped production, shifts away from Hopper, and hyperscaler buildouts, though CoWoS advanced packaging and HBM memory supply chains continue to constrain the absolute latest silicon.[1][2][3]

  • H100 cloud rental rates fell from peaks of ~$8/hour in early 2025 to $1.90–$3.50/hour (or lower in some cases) by early 2026, with further stabilization or declines reported later.[1][2]
  • H200 is now widely available; B200/Blackwell shows limited early availability with premiums and stretched lead times (e.g., into June/July 2026 for some deployments), though supply is increasing.[1][3][4]
  • Nvidia has phased down Hopper production to prioritize Blackwell; on-demand capacity for various SKUs has been sold out in periods, with 1-year contract prices rising ~40% in some windows due to sustained demand.[3][4]
  • Packaging (CoWoS) and HBM remain chokepoints, but capacity expansions (e.g., TSMC ramps) are helping; 2025–2026 production estimates show millions of H100-equivalent chips annually from Nvidia alone.[5]

This means Nvidia GPU supply is no longer the acute binding constraint it was in 2023–early 2025 for most workloads, though frontier-scale or latest-gen deployments still face friction. New entrants or smaller players benefit from falling H100/H200 prices and secondary markets, but must navigate power and networking alongside hardware.

Custom ASICs from Google (TPU v7 Ironwood), Amazon (Trainium2/3), and others are scaling rapidly and capturing meaningful share, especially for hyperscalers and partners like Anthropic, reducing overall dependence on Nvidia GPUs. Microsoft’s Maia lags relatively. These provide cost/performance advantages (e.g., 30–40% better price/performance claims for Trainium) and are production-deployed at massive scale.[6][7][8]

  • AWS Trainium2 capacity is fully subscribed with multi-billion-dollar run rates; Project Rainier (with Anthropic) involves hundreds of thousands of chips, expanding further in 2026. Trainium3 previewed late 2025 with fuller volumes in 2026.[6][9]
  • Google TPU v7 (Ironwood) ships externally (including to Anthropic commitments up to 1M chips); routinely deploys 10k+ chip clusters with strong perf/watt and price/perf vs. high-end Nvidia GPUs.[7][10]
  • Custom ASIC shipments projected to grow ~44.6% in 2026 vs. ~16% for GPUs; hyperscalers like Meta exploring TPUs alongside Nvidia.[11]
  • AMD MI300 series is available and ramping (MI325X/MI350 in 2025, MI400 expected 2026), but remains secondary to Nvidia’s ecosystem dominance.[12][13]

Implication: Hyperscalers and large labs have viable alternatives that ease Nvidia-specific supply pressure and improve economics for targeted workloads (especially inference). Smaller players or those needing broad ecosystem support (CUDA) still face Nvidia-centric constraints. Multi-vendor strategies are increasingly practical.

Public estimates of GPU/accelerator utilization remain low on average (<30–60% in many organizations) but can reach 80–98% with optimized storage, orchestration, and workload matching—pointing to software and systems as underappreciated levers. Inference now dominates compute share (~2/3 of AI compute in 2026).[14][15]

  • Typical reported figures: <30% utilization across ML workloads in many orgs; targets of >80% compute utilization for training and ~60% for inference.[14]
  • Optimized teams (e.g., with storage/architecture matching) sustain up to 98% GPU utilization.[15]
  • Hyperscalers and large clusters likely achieve higher effective utilization through scale, but public granular data is limited; custom ASICs (TPUs, Trainium) often emphasize efficiency metrics like perf/watt or price/perf.[10]
  • Shift to inference (memory-bandwidth bound at low batch sizes) changes optimization priorities vs. training (compute-bound).[16]

This suggests hardware supply improvements are outpacing realized efficiency; software, data movement, and orchestration gaps prevent full utilization even when accelerators are available.

Software inefficiencies—particularly compiler maturity, parallelism for distributed/MoE workloads, and orchestration—are emerging as meaningful bottlenecks, especially in heterogeneous or rapidly evolving hardware environments. CUDA remains dominant and mature for Nvidia; alternatives (ROCm, XLA, Triton) lag in ecosystem breadth or optimization for new silicon.[17]

  • Compiler/runtime fragmentation hinders performance portability and auto-tuning across accelerators; hardware-aware compilers and co-optimized kernels are needed but not fully mature.[18]
  • Parallelism challenges include high communication overhead (e.g., all-to-all in MoE models) and KV-cache/memory management for inference; deterministic execution (some TPUs, Groq LPUs) offers latency consistency but requires recompilation.[17]
  • Orchestration and CPU-side work (agents, tool use) create bottlenecks beyond GPUs; storage pipelines and interconnects often limit sustained throughput.[19]
  • New hardware (Blackwell, new TPUs/Trainium) sees delayed library optimizations, widening the gap between peak specs and achieved performance.[20]

Expert and research commentary highlights software-hardware co-design gaps as a core challenge: algorithmic/hardware iteration outpaces full-stack optimization, leading to underutilization in heterogeneous systems and the need for tighter model-compiler-architecture loops.[18][21]

  • Workshops and papers emphasize embedding AI into system design, cross-layer optimization, and unified frameworks to close efficiency gaps.[18]
  • X/Twitter commentary notes shifting bottlenecks toward coordination of distributed workloads, power, and verifiable orchestration rather than raw chip supply.[22]

Implication for competitors/entrants: Software moats (CUDA ecosystem, optimized stacks) remain powerful even as hardware diversifies. Investing in portable compilers, high-utilization orchestration, or co-design tools offers differentiation. Power/grid constraints are increasingly cited as the next hard limit after chips.[23]

Overall, GPU/accelerator supply constraints have moderated significantly for established Nvidia SKUs and are being offset by custom ASICs, but software stacks, utilization gaps, and emerging power issues represent the more dynamic frontiers. Full-stack co-design and systems-level optimization are where substantial gains remain available.


Recent Findings Supplement (June 2026)

GPU/accelerator supply remains a binding constraint into mid-2026, particularly for leading-edge Nvidia hardware, while software issues around data movement, compilers, and orchestration are surfacing as significant secondary bottlenecks with very low realized utilization in many deployments.[1][2]

Nvidia GPU Supply and Pricing Dynamics (Post-Dec 2025)

Nvidia has curtailed Hopper (H100/H200) production to prioritize Blackwell ramp-up, leaving net-new Hopper supply extremely limited. New cluster deployments remain booked through at least August 2026, with Blackwell lead times extending into June–July 2026. On-demand capacity is largely sold out, and one-year contract rental prices for H100 rose nearly 40% from October 2025 to March 2026 (e.g., from ~$1.70/hr to $2.35/hr in some indices). B200 availability is improving gradually via hyperscalers but remains tight overall.[1][2]

Persistent upstream constraints include advanced packaging (CoWoS) and especially HBM memory. HBM3E prices rose ~20% for 2026 deliveries amid demand from Nvidia H200 and custom ASICs; AMD cited memory costs as a driver for planned ~10% GPU price increases in 2026. These factors have sustained or increased effective pricing pressure even as some older GPU spot rates showed volatility or modest declines in certain reports.[3]

  • Implication for competitors/entrants: Hyperscalers and specialized clouds are prioritizing reserved or long-term capacity; spot or flexible access favors those with existing relationships or willingness to pay premiums. Alternatives (see below) are gaining traction precisely because Nvidia supply lags demand.

Custom ASIC Supply and Hyperscaler Alternatives

Hyperscalers are actively scaling custom silicon to bypass Nvidia bottlenecks:
- Microsoft announced and began deploying Maia 200 (TSMC 3nm, HBM3E, strong FP8/FP4 inference performance) in January 2026, with initial racks in Iowa (US Central) and Phoenix (US West 3) regions; it claims advantages over Trainium3 and TPU v7 in key metrics and faster internal validation-to-deployment cycles.[4][5]
- Google’s TPU v7 (Ironwood, announced ~Nov 2025) supports large clusters (10k+ chips); Anthropic committed to up to 1M TPUs (multi-gigawatt scale) in 2026 for price-performance reasons.[6]
- AWS has deployed 500k+ Trainium2 units, with ongoing Inferentia/Trainium scaling for inference-heavy workloads.

These ASICs are seeing meaningful internal and (selectively) external use, reducing pressure on Nvidia GPUs for inference and certain training tasks.[6]

  • Implication: Custom ASICs are no longer marginal; they provide a structural alternative for large operators, though availability remains hyperscaler-controlled and ecosystem maturity varies (e.g., software support lags Nvidia CUDA in many cases).

Utilization Rates: Low Realized Efficiency Highlights Software Limits

Public estimates show stark underutilization outside highly optimized hyperscale training clusters:
- Enterprise GPU utilization averages ~5% according to CAST AI analysis (widely cited in May 2026 reporting), primarily due to data staging, replication, and movement delays that leave GPUs idle ~95% of the time.[7][8]
- Even in production settings, Model FLOPS Utilization has declined with faster hardware (e.g., ~40.8% on H100 vs. ~59.7% on A100), as communication and data overheads become relatively larger.[9]

High utilization (~93% average in one controlled 8-GPU training study) is achievable in tightly managed, data-local environments, but this is not representative of broader deployments.[10]

  • Implication: Hardware supply constraints are compounded by massive effective waste; buyers over-provision dramatically, amplifying the economic impact of shortages. Solutions targeting data liquidity or orchestration can unlock far more capacity than additional silicon alone.

Software Inefficiencies Emerging as Bottlenecks

As hardware scales, software layers are increasingly exposed:
- Data movement and staging directly cause the low utilization cited above; inefficient access patterns idle expensive accelerators regardless of raw FLOPS.[7][9]
- Compilers and parallelism: Large models demand sophisticated operator fusion, memory layout optimization, tensor/pipeline parallelism, and kernel generation. Conferences and frameworks in 2026 (e.g., CGO discussions, new IRs like KernelTile) highlight ongoing challenges fitting complex kernels into accelerator constraints and maximizing utilization.[11][12][13]
- Orchestration: Agentic/multi-agent workflows (prominent in 2026 discussions) shift bottlenecks to coordination, tool use, error recovery, and scheduling across heterogeneous resources (CPU + accelerators). New frameworks and “orchestrator” paradigms are proliferating, but reliable production-scale execution remains a pain point.[14][15]

Expert and industry commentary implicitly points to hardware-software co-design gaps: faster chips (e.g., H100 vs. A100) expose communication/scheduling overheads that prior software stacks were not optimized for, necessitating autotuning, learned schedulers, and tighter integration.[9][16]

  • Implication for entrants: Pure hardware plays face diminishing returns without accompanying software (compilers, runtimes, data platforms, orchestration layers). Co-design opportunities—especially around data movement and agentic orchestration—represent high-leverage areas where software can materially improve effective utilization and reduce the need for raw accelerator volume.

Overall, supply constraints on Nvidia GPUs persist and are structurally reinforced by memory/packaging limits, though custom ASICs from hyperscalers provide meaningful relief for inference and select workloads. The ~5% enterprise utilization figure and declining FLOPS utilization on newer hardware indicate that software inefficiencies in data handling, compilation, parallelism, and orchestration are transitioning from underappreciated to binding in many real-world settings. This shifts competitive advantage toward integrated hardware-software solutions and better data/orchestration infrastructure.

Report 6 Research the strongest counterarguments to the thesis that networking and memory are the two dominant AI buildout constraints. Look for analyst views suggesting the bottlenecks have already shifted, that software/algorithmic improvements (e.g., quantization, mixture-of-experts, inference optimization) are rapidly relaxing hardware constraints, or that capital and geopolitical factors (export controls, supply chain risk) are actually the binding limits. Summarize credible evidence that the networking-and-memory framing may be incomplete or outdated.

Chip manufacturing capacity and export controls have emerged as a binding constraint on AI scaling, superseding or compounding networking and memory limits. In 2026, analysts note that AI chip production itself—not just downstream components like HBM or interconnects—now limits the pace of compute buildout, as fab capacity for advanced nodes and specialized packaging (e.g., CoWoS) cannot ramp fast enough despite demand.[1][1]

  • The Center for a New American Security (CNAS) May 2026 report states that chip production became the tightest constraint for U.S. AI companies, shifting from power shortages in 2024–2025. New manufacturing capacity takes years to build, making it the rate-limiting factor for at least the next year. Export controls divert chips (including older nodes like H200) to competitors such as China, reducing availability for U.S. and allied firms and raising prices.[1]
  • HBM and advanced packaging shortages persist as part of this, with memory vendors pre-allocating 2026 capacity and global DRAM/HBM demand from AI consuming a large share of production, but the upstream logic fab and packaging throughput set the overall ceiling.[2][3]
  • Geopolitical factors amplify this: U.S. controls on advanced semiconductors and equipment to China have prompted parallel Chinese supply chains (e.g., SMIC, YMTC expansions), while creating scarcity that treats AI chips as a strategic resource.[4]

This means competitors cannot simply buy more GPUs or optimize interconnects; access to allocated fab output and navigating export regimes becomes the decisive moat. Firms with priority allocation (e.g., via long-term deals or domestic policy) or alternative architectures less reliant on restricted nodes gain an edge.

Power grid connectivity and energy infrastructure have become the leading physical bottleneck, outpacing networking and memory in many markets. Data center build times (18–36 months) clash with grid interconnection queues (often 4–10 years), making “speed to power” the gating factor for 2026+ deployments.[5][6]

  • Deloitte’s 2025 survey of power and data center executives found grid stress as the top challenge (72% rating it very/extremely challenging).[6]
  • Reports from Uptime Institute, JLL, and others project power as the defining constraint for 2026, with 30–50% of planned 2026 AI data center capacity slipping to 2028 due to interconnection, permitting, and equipment shortages (e.g., transformers).[7][8]
  • ERCOT alone tracked ~410 GW of large-load requests (mostly data centers/AI) by early 2026; similar backlogs exist elsewhere. Behind-the-meter generation, co-location with renewables/storage, or “bring your own power” strategies are emerging responses.[9]

For entrants, this shifts competition from silicon procurement to site selection, power purchase agreements, and grid modernization partnerships. Capital deployed fastest on power infrastructure captures disproportionate value.

Algorithmic and software improvements (quantization, MoE/sparsity, inference optimizations) are measurably relaxing per-token hardware demands, enabling larger effective scale with existing or less hardware. These techniques reduce memory footprint, compute intensity, and interconnect pressure, challenging the assumption that hardware constraints dominate unchecked.[10]

  • Quantization to 8-bit or 4-bit often retains ~99% accuracy while cutting inference costs/energy by 70–80% and memory needs substantially.[11]
  • MoE architectures activate only a subset of parameters (e.g., ~10–30% per token in some models), yielding 2–3x or greater efficiency gains versus dense equivalents; combined with optimizations like DeepSeek-V3’s Multi-head Latent Attention (MLA), FP8 training, and multi-plane topologies, they directly target memory bandwidth, compute-communication trade-offs, and interconnect overhead.[12]
  • Inference engines (vLLM, TensorRT-LLM, etc.) plus techniques like KV cache compression, paged attention, and speculative decoding further amplify this, allowing efficient serving of massive models on fewer accelerators.[13]

The implication is that software co-design and model architecture choices can extend the usable life of current hardware generations and slow the required ramp in networking/memory capacity. Pure hardware-centric forecasts understate this mitigation; leaders in efficient training/inference stacks (or open-source ecosystems enabling them) reduce their exposure to physical bottlenecks.

Capital intensity, deployment speed, and financing frictions represent underappreciated economic limits on the buildout. Hyperscalers are committing hundreds of billions in capex ($700B combined 2025–2026 cited in one analysis), but realizing returns depends on timely grid/power infrastructure and supply chain execution amid rising costs.[14]

  • Reports highlight that electricity infrastructure (substations, transmission) often accounts for a growing share of total project costs and timelines, with financing risks tied to uncertain monetization of AI workloads.[15]
  • Broader supply chain issues (e.g., helium for fabs, specialized materials) compound capital lockup.[2]

Competitors succeed by securing low-cost capital or offtake agreements early, or by focusing on capital-light software/services layers that leverage existing infrastructure more efficiently. Those reliant on spot hardware markets face higher effective costs.

Overall, the networking-and-memory framing, while still relevant (optics and HBM remain acute in scaling clusters), is incomplete for 2026 because multiple independent constraints—fab output, power grids, geopolitics, and software efficiency—now interact as co-equal or higher-order limits. Analyst views from CNAS, Deloitte, Uptime Institute, and infrastructure reports converge on this multi-factor reality, with software providing a countervailing force that decouples model capability growth from raw hardware scaling to some degree.[16]

For those entering or competing in AI infrastructure, the winning strategies involve vertical integration around power/fab access, heavy investment in efficiency software, or positioning in adjacent layers (e.g., orchestration, storage optimization) that benefit from—but are not gated by—the hardware bottlenecks. Additional research into 2026 earnings calls or updated Epoch AI-style scaling analyses would further quantify software’s aggregate impact.


Recent Findings Supplement (June 2026)

Chip manufacturing capacity (not intra-cluster networking or memory) has emerged as the binding constraint on AI compute buildout in 2026.[1][1]

In a May 2026 Center for a New American Security (CNAS) report, analysts state that AI chip production at foundries like TSMC has become the rate-limiting factor, shifting from power constraints dominant in 2024–2025. Hyperscalers and AI firms report being unable to secure enough wafers despite demand, with new fab capacity requiring years to bring online. TSMC’s CEO noted wafer supply—not power—as the bottleneck, and executives from Broadcom and others confirmed capacity limits extending into 2027. This upstream supply ceiling caps overall scaling regardless of networking or HBM improvements within deployed clusters.[1]

  • Bernstein analyst commentary in May 2026 reinforced that the constraint has moved below NVIDIA to TSMC and equipment suppliers (ASML, Lam, KLA), all running at maximum capacity.[2]
  • NVIDIA and others requested additional TSMC capacity but were turned down; Google reportedly missed 2026 targets due to insufficient manufacturing slots.[1]

For competitors or new entrants: Securing or influencing advanced-node foundry access (or alternative processes) is now more strategic than optimizing cluster interconnects. Those without priority allocations face allocation rationing and higher effective costs.

U.S. export controls and the AI Diffusion Framework have made geopolitical allocation of scarce chips a primary limiter on global buildout.[3][4]

The January 2025 Framework (with ongoing enforcement) imposes compute caps on Tier 2 countries (e.g., ~270,000 H100-equivalents per company per country by end-2026) and requires 75%+ of compute for Tier 1 firms to remain in approved jurisdictions. Chinese firms like DeepSeek have publicly noted needing 2–4× more power for comparable results due to restricted access to frontier chips. Anthropic and others argue these controls are the “single biggest differentiator” preserving U.S. advantage while slowing rivals.[4]

  • Every chip exported to competitors reduces availability for U.S. firms, raising prices and slowing domestic progress (per CNAS).[1]
  • Tariffs and controls have already shifted server assembly supply chains away from China toward Taiwan, Mexico, and Vietnam.[5]

Implication: Capital and policy access to controlled supply chains can outweigh technical networking/memory solutions. New entrants or non-aligned players face structural limits on scale that software tweaks cannot fully bypass.

Software and algorithmic optimizations are delivering measurable efficiency gains that relax raw hardware demands, particularly for inference.[6]

A January 2026 arXiv paper by Google DeepMind’s Xiaoyu Ma and David Patterson (“Challenges and Research Directions for Large Language Model Inference Hardware”) declared “LLM inference is a crisis,” driven by memory bandwidth and latency in the decode phase rather than FLOPS. They highlight mismatches in current hardware but note industry responses via co-design.[7]

  • April 2026 reports detail concrete wins: Alibaba’s FlashQLA kernels achieved 2–3× forward and 2× backward speedups for long-context workloads; vLLM on Blackwell with NVFP4 quantization, EAGLE3 + MTP speculative decoding, and kernel fusion delivered top throughput (e.g., 230 tok/s on DeepSeek V3.2).[6]
  • Broader trends include mixed-precision quantization for MoE models and speculative techniques reducing effective memory pressure.

Implication: Inference-focused players can achieve higher effective utilization or lower hardware requirements through software, making pure hardware scaling less dominant than previously assumed. Edge or cost-sensitive deployments benefit most.

Storage, advanced packaging (CoWoS), and power infrastructure are rising as co-equal or primary bottlenecks alongside or instead of pure networking/memory.[8]

May 2026 analyses note global AI infrastructure spending exceeded $250B in 2025, with >50% of organizations citing data/storage bottlenecks; storage throughput and bandwidth are now “hard ceilings comparable to power and cooling.” HBM/DRAM shortages are forecast through 2027, but packaging capacity (TSMC CoWoS oversubscribed into 2026–27) and grid/transformer lead times (2+ years) constrain deployment more broadly.[8][9]

Overall, the networking-and-memory framing appears incomplete for 2026 realities: upstream fab capacity, export policy, and inference software co-design are shifting the binding constraints, while storage/packaging add new pressure points. Evidence from CNAS, analyst reports, and technical papers (post-Dec 2025) shows algorithmic progress and supply/policy limits relaxing or redefining hardware bottlenecks faster than cluster-level interconnect/memory alone would predict. New research or policy updates in this period directly support these shifts.

Report