Investigate the role of HBM…

HBM has become the primary performance limiter for AI accelerators because its high bandwidth (often 1+ TB/s per stack) is essential for feeding massive parallel compute units in GPUs/TPUs during training and inference, yet production is concentrated in three suppliers whose 2026 output is fully sold out.[1]

SK Hynix leads with 57-62% HBM market share (Q2/Q3 2025 data), driven by early exclusive NVIDIA HBM3E deals for H100/H200 and Blackwell platforms; Micron holds ~21% and Samsung ~17%. All three report 2026 capacity sold out via long-term hyperscaler contracts, with HBM demand growing 70% YoY in 2026 after 130% in 2025. This creates a structural shortfall (one estimate: ~8% deficit) as HBM now consumes 23-25% of total DRAM wafer capacity—up from single digits recently—while AI data centers absorb up to 70% of global memory production.[2]

HBM stacks multiple DRAM dies vertically with through-silicon vias for extreme bandwidth, but each bit displaces several bits of standard DRAM output due to lower density and complex manufacturing.
Expansions require 12-18 months lead time; new capacity (e.g., SK Hynix Cheongju fab, Micron Singapore/Taiwan facilities) arrives late 2026 or 2027+.
Resulting market: ~$35B in 2025, ~$58B in 2026, heading toward $100B by 2028.[3]

For competitors or new entrants, securing HBM allocation is now a prerequisite for scaling AI hardware; without it, even advanced silicon (e.g., custom ASICs) cannot reach full utilization, favoring established players with supply deals like NVIDIA.

The memory wall—where processor speed vastly outpaces memory bandwidth and capacity—remains the fundamental architectural constraint, with on-chip SRAM providing the fastest but scarcest tier, HBM the high-bandwidth bridge, and off-chip DRAM the capacity fallback.[4]

Modern accelerators idle waiting for data: weights, activations, and especially KV caches in LLM inference/decode must shuttle through the hierarchy. SRAM (e.g., 10-40 MB shared per GPU block or 384 MB on-chip in Google TPU v8i, a 3x increase) offers ~10x HBM bandwidth but limited capacity and poor scaling (more die area/power per bit in advanced nodes). HBM mitigates bandwidth starvation (NVIDIA stacks deliver TB/s-scale) but models grow to consume available capacity, and latency gaps persist. Standard DRAM supplements system memory but faces capacity cannibalization.[5]

SRAM-centric designs (Groq LPUs, Cerebras, d-Matrix) maximize near-compute memory for inference latency wins but trade off model size flexibility.
HBM solves part of the wall for training-scale workloads but hits physical limits (shoreline area for connections, power).
Implication: even with more HBM, efficiency tricks (quantization, checkpointing) reappear as models scale.

This means pure compute scaling (more FLOPS) yields diminishing returns without memory hierarchy advances; entrants must co-design silicon with memory or accept lower utilization.

CXL-enabled memory pooling and processing-in-memory (PIM) represent the most promising near-term relief valves by disaggregating capacity and moving compute closer to data, though both remain early-stage relative to the 2026 HBM crunch.[6]

CXL (especially 2.0/3.0 with switching and fabric support) allows multiple hosts (CPUs, GPUs, XPUs) to share pooled DRAM pools, boosting utilization from typical 40-60% to 80%+ and enabling terabytes of expandable memory (e.g., Marvell Structera switches targeting 48 TB shared). PIM (Samsung HBM-PIM, SK Hynix AiM/Accelerator-in-Memory) embeds logic in or near DRAM/HBM banks for in-situ matrix ops, claiming 2x+ performance and 70%+ energy reduction by slashing data movement—already seeing real-world deployment and commercial pushes.[7]

CXL market: ~$2.8B in 2025, projected strong growth (CAGR ~25-29% to 2034); hyperscalers integrating into new servers.
PIM: Samsung and SK Hynix actively commercializing HBM variants; prototypes show bandwidth-proportional efficiency gains in search/vector workloads.
Limitations: CXL adds latency vs. local HBM; PIM requires software/ISA changes and is not yet ubiquitous in flagship accelerators.

Competitors can differentiate by adopting CXL pooling for cost-efficient scaling or PIM for inference efficiency, but these will not fully displace HBM shortages before late 2026 or 2027; early movers in hybrid HBM+CXL or PIM-augmented designs gain advantage.

HBM supply shortage is the most acute near-term bottleneck (sold-out 2026 capacity directly caps accelerator deployments), followed by the persistent memory wall in inference workloads; on-chip SRAM scaling limits and standard DRAM reallocation are secondary but compounding constraints, while CXL/PIM offer longer-term architectural escape hatches.[1]

The oligopoly and physics of 3D stacking make HBM the immediate chokepoint, but the memory wall ensures that even abundant HBM will eventually require PIM-style or pooled innovations. Hyperscalers and chip designers prioritizing multi-year HBM contracts today, while piloting CXL/PIM, will best navigate the constraints.

Recent Findings Supplement (June 2026)

HBM supply remains the dominant acute bottleneck for AI accelerators in 2026, with the entire production capacity of the three major suppliers (SK Hynix, Samsung, Micron) sold out for the year amid 70% projected YoY demand growth.[1]

This concentration in a three-supplier oligopoly (controlling ~90-95% of advanced DRAM/HBM) has turned HBM into a structural constraint rather than a cyclical one, directly limiting deployment of next-gen accelerators like NVIDIA’s Rubin platform.

SK Hynix holds the largest share (~50-62% as of recent 2025-2026 data), followed by Samsung (~17-35%) and Micron (~5-21%, with reports of overtaking Samsung in some segments); all have allocated most or all 2026 HBM3E/HBM4 output to key customers like NVIDIA.[1]
Micron and SK Hynix explicitly confirmed sold-out 2026 HBM capacity, with multi-year agreements extending visibility; Samsung issued similar warnings of shortages persisting through at least 2027.[2]
HBM is expected to consume ~25% of total DRAM wafer production by 2026 as suppliers reallocate capacity from conventional DRAM.[3]

This sold-out status and oligopoly create multi-year lead times and elevated pricing, forcing hyperscalers to lock in supply years ahead while slowing overall AI infrastructure scaling.[4]

Suppliers are aggressively expanding capacity, but 12-18+ month lead times mean relief is limited until late 2026 or 2027, even as demand surges.[1]

Recent announcements highlight the scale of the response:

Micron raised 2026 capex to $20 billion (focused on Idaho mega-fabs for HBM and DRAM).[5]
SK Hynix announced a $15 billion U.S. advanced HBM packaging plant (Feb 2026) and, in June 2026, stated plans to double overall wafer capacity over the next five years.[6]
HBM demand grew 130% YoY in 2025 and is projected at +70% YoY in 2026; HBM revenue run rates for leaders are already in the billions annually.[1]

These investments signal recognition of structural AI-driven demand but underscore that near-term supply will remain tight, favoring early qualifiers (especially SK Hynix with NVIDIA) and disadvantaging new entrants or smaller players.[7]

The memory wall persists as a fundamental limiter, with HBM mitigating but not eliminating bandwidth and capacity constraints as models and parallelism scale.[8]

HBM4 (mass production starting 2026 at Samsung/SK Hynix) doubles interface width to 2048-bit and targets >2 TB/s per stack bandwidth with ~40% better energy efficiency, enabling platforms like NVIDIA Rubin (8 stacks, 288 GB, 22 TB/s total).[9] However, the wall shifts to inter-chip interconnects and larger models rather than disappearing.

AI accelerators spend increasing time on data movement; on-chip SRAM offers ultra-low latency but is die-area constrained (zero-sum with compute).[10]
Microsoft’s Maia 200 (announced Jan 2026) exemplifies hybrid approaches: 216 GB HBM3e at 7 TB/s plus 272 MB on-chip SRAM with specialized DMA/NoC for inference.[11]

HBM remains the primary near-term solution, but its supply constraints amplify the wall’s impact on overall system performance.[12]

CXL memory pooling and processing-in-memory (PIM) are advancing as disaggregated alternatives to ease capacity pressure, though they are not yet mature replacements for HBM in high-performance AI training.[13]

Recent developments include:

Marvell’s Structera S (Mar 2026) demonstrates CXL switching scaling the memory wall, with reported 4.8x inference throughput gains and 82.7% reduction in time-to-first-token via pooling.[14]
Samsung integrates PIM with CXL modules for near-memory neural operations; Intel advances coherent CXL 2.0/3.0 pooling across CPU/GPU/accelerators with heterogeneous memory support.[13]
Academic/industry work (e.g., DATE 2026 paper on Sage architecture) shows CXL pooled systems with predictive caching and selective NDP achieving 2.84x throughput on embedding-heavy workloads versus conventional management.[15]

These technologies improve utilization (potentially from 40-60% to >80%) and reduce stranded memory but add latency (tens of ns) and are most relevant for inference, recommendation models, or capacity expansion rather than peak training bandwidth.[16]

On-chip SRAM serves as a high-bandwidth complement or alternative in inference-focused designs but faces hard physical limits that make it unsuitable as a broad HBM replacement.[10]

SRAM-centric accelerators (e.g., Groq, Cerebras, d-Matrix) prioritize on-die memory for low-latency access, trading off compute area; bandwidth scales favorably with die size but remains capped.[17] Compute-in-memory (CIM) variants using SRAM arrays perform MAC operations in-place, reducing off-chip movement, but density and area trade-offs persist.[18]

SRAM excels in specific niches (e.g., decode-phase inference) but cannot scale to the capacities required for frontier training models.[10]

The most acute bottlenecks are HBM supply concentration and availability (sold out through 2026+), followed by the memory wall’s data-movement overhead; CXL/PIM and SRAM offer partial mitigations but do not yet alleviate the primary constraints.[5]

DRAM capacity is indirectly squeezed as wafers shift to HBM, while on-chip SRAM limits are architecture-specific rather than systemic. New research and announcements reinforce HBM as the near-term gating factor for AI accelerator performance and deployment scale.

Recent Findings Supplement (June 2026)

Other reports in this analysis

Continue Reading

Understanding Dwarkesh Patel's AI Scaling Thesis: AGI Timelines, Compute, and What He Actually Believes

The US Federal Government's AI Strategy - June 2026 Update

Are Open Source Models like Kimi & Qwen and GLM 5.2 closing the gap on the frontier?

Get Custom Research Like This