AI Infrastructure Primer: GPUs, Data Centers, and Who Captures the $500B Buildout
Executive Summary
The buildout of AI infrastructure is the largest capital investment cycle in technology history. Hyperscalers (AWS, Azure, GCP, Meta) and specialized AI clouds (CoreWeave, Lambda Labs) are collectively spending $250B+ in capital expenditure in 2024-2025, primarily on NVIDIA GPUs, custom silicon, data center construction, and electrical infrastructure. NVIDIA captures the most obvious value — H100/H200/Blackwell GPU clusters are sold at ~70% gross margins. But the full value chain extends from TSMC's leading-edge nodes and SK Hynix's HBM memory to the concrete and copper of data center construction and the electrical utilities supplying power. This primer maps the entire stack and identifies where pricing power, margin concentration, and investable opportunity cluster.
The Stack: From Silicon to Application
Understanding AI infrastructure requires mapping the full stack from raw silicon to deployed application:
Layer 6: Application (ChatGPT, Copilot, Gemini, Claude) Layer 5: Model (GPT-4o, Gemini 1.5, LLaMA 3, Mistral) Layer 4: Cloud (AWS Bedrock, Azure AI Foundry, GCP Vertex) Layer 3: Orchestration (Kubernetes, Ray, SLURM) Layer 2: Server (DGX H100, HGX B100, custom OEM) Layer 1: Chip (NVIDIA H100/H200/Blackwell, AMD MI300X, Google TPU) Layer 0: Foundry + Memory (TSMC N4/N3, SK Hynix HBM3e, CoWoS packaging)Value is not uniformly distributed across this stack. The highest margins concentrate at Layer 0 (TSMC, SK Hynix — structural bottlenecks), Layer 1 (NVIDIA — near-monopoly on AI training silicon), and Layer 4 (hyperscalers — cloud inference markup on raw GPU costs). Layers 2, 3, and 5 are more competitive, with lower margin structures.
NVIDIA's Position: Why H100/H200/Blackwell Became the Standard
NVIDIA's data center revenue grew from $15B in FY2023 to approximately $115B in FY2025 — one of the fastest revenue ramps in corporate history. The H100 GPU, launched in 2022, became the de facto standard for AI training through a combination of performance leadership, software ecosystem depth, and supply scarcity that created its own demand momentum.
Why CUDA is the real moat: NVIDIA's competitive advantage is not the H100 chip itself — the hardware could theoretically be replicated. The moat is CUDA (Compute Unified Device Architecture), NVIDIA's proprietary parallel computing platform and API, which has 20+ years of optimization for machine learning frameworks (PyTorch, TensorFlow, JAX). The CUDA ecosystem represents hundreds of thousands of developer-hours of optimization work, thousands of pre-trained model checkpoints tuned for NVIDIA hardware, and a $2B+ CUDA software ecosystem (NCCL, cuDNN, TensorRT, Triton). Competitors must match CUDA's software stack to be competitive — hardware parity is insufficient.
H100 economics: The H100 SXM5 (server form factor) has a list price of $25,000-$30,000, with compute cluster packages (DGX H100 with 8 GPUs) listing at $200,000+. NVIDIA's data center gross margins are approximately 70-73%. At $115B in data center revenue, NVIDIA generates $80B+ in gross profit from this segment alone.
Blackwell (B100/B200): NVIDIA's Blackwell architecture (announced March 2024, volume production in late 2024) delivers 2.5x training performance and 5x inference performance versus H100. The B200 GPU is priced at $30,000-$40,000+ — NVIDIA is using architectural generation improvement to maintain ASP while delivering more value, avoiding the price competition that would erode its margin structure. GB200 NVL72 racks (72 B200 GPUs + 36 Grace CPUs in a single rack) are priced at $3M+ per rack.
The supply constraint as a strategic moat: NVIDIA's production is constrained by TSMC's N4 capacity (the advanced node used for H100) and CoWoS packaging capacity (more on this later). Through 2024, demand for H100s exceeded supply by 3-5x. Customers placed multi-year purchase commitments. This scarcity created pricing power that NVIDIA has used to maintain ASPs while transitioning to Blackwell.
The Challengers: AMD MI300X, Google TPUs, AWS Trainium, Microsoft Maia
The challengers to NVIDIA's dominance have made real progress but remain structurally behind on software ecosystem:
AMD MI300X: AMD's most competitive AI GPU, the MI300X, offers superior HBM3 memory capacity (192GB vs. H100's 80GB) at comparable performance, priced at $10,000-$15,000 — a meaningful discount to H100. AMD's ROCm software stack (CUDA-equivalent) has improved dramatically, and PyTorch/TensorFlow support is now functional. Microsoft Azure and Meta have deployed MI300X at scale for inference (where memory capacity advantages are most useful). AMD's data center GPU revenue grew from essentially zero in 2022 to approximately $5B in FY2025 — real but modest vs. NVIDIA's $115B.
Google TPUs v5: Google's Tensor Processing Units are customized for Google's JAX framework and matrix multiplication workloads common in transformer models. TPUs are not available to customers as isolated chips — Google Cloud customers access TPU capacity via Cloud TPU VMs. The advantage: TPUs offer 3-4x better price/performance than H100 for specific JAX-based training workloads. The limitation: they require rewriting models in JAX (not PyTorch), creating a real migration barrier.
AWS Trainium2: AWS's custom training chip, Trainium2 (announced 2023), is designed for transformer model training and is claimed to offer 4x better performance/price than H100 for compatible workloads. AWS uses Trainium internally for its own model training and offers it to customers via EC2 Trn2 instances. The key challenge: like TPUs, Trainium requires AWS's Neuron SDK rather than CUDA, creating migration friction.
Microsoft Maia 100: Microsoft's first AI chip, the Maia 100 (announced November 2023, deployed in Azure in 2024), is used for inference of OpenAI models within Azure data centers. Microsoft is not selling Maia as a standalone product — it's cost reduction for Azure's AI inference business. The goal: reduce dependency on NVIDIA GPUs for inference-heavy workloads where CUDA's training optimizations are less critical.
Data Center Economics: Power, Cooling, and Why Location Matters
A modern AI data center has fundamentally different economics than a 2015 hyperscale facility:
Power density: A 2015 hyperscale rack consumed 3-5 kW of power. A rack of NVIDIA H100s consumes 10-15 kW. A Blackwell GB200 NVL72 rack consumes 120 kW — 25x the power density of a 2015 rack. This transforms location selection: proximity to cheap, reliable, large-scale power sources (hydroelectric dams, nuclear plants) is more important than proximity to fiber networks.
Cooling: Traditional air cooling fails at 30+ kW/rack densities. Liquid cooling — direct liquid cooling (DLC) that pipes coolant directly to chip heat spreaders, or immersion cooling (servers submerged in dielectric fluid) — is required for AI clusters. The liquid cooling infrastructure represents 15-20% of data center construction cost.
Power cost as operating expense: At $0.04/kWh (cheap industrial electricity) and 100% utilization, a 1,000-rack H100 cluster (10 MW of IT load, 14 MW total with overhead) incurs $5M/month in power costs alone. At $0.08/kWh (US national average), power costs are $10M/month. Location selection with access to cheap hydro or nuclear power directly impacts unit economics for inference services.
Geographic considerations: Data center construction is accelerating in: Northern Virginia (ample power, existing fiber, AWS/Azure/Google already present), Iowa and Texas (cheap wind power, low land cost), Dubai and Singapore (hub for MENA/APAC AI demand), and Scandinavia (cheap hydro, cool climate reducing cooling costs). Supply chain constraints on electrical equipment (transformers, switchgear) are creating 2-3 year lead times for new data center power infrastructure.
Hyperscaler Capex: $250B+ in 2024 — Who's Spending What and Why
| Company | 2024 Capex (approx.) | 2025 Capex Guidance | Primary AI Allocation |
|---|---|---|---|
| Microsoft | ~$55B | $80B+ | Azure GPU clusters, OpenAI infrastructure |
| Meta | ~$35B | $60-65B | LLaMA training, AI recommendation systems |
| Alphabet | ~$48B | $75B | GCP TPU clusters, Gemini training |
| Amazon | ~$75B | $100B+ | AWS GPU clusters, Trainium/Inferentia, warehouses |
Total four-company capex in 2025: $315B+. Approximately 40-50% of this is AI-related (GPU procurement, data center construction for AI workloads). The non-AI portion (logistics, traditional cloud compute, office buildings) is relatively stable; AI is the marginal driver of capex growth.
The NVIDIA supply chain impact: hyperscaler GPU procurement commitments are multi-year. Microsoft has committed to multi-billion dollar GPU purchases through 2026-2027. This creates revenue visibility for NVIDIA that is unusual for hardware vendors and supports analysts' revenue estimates 18-24 months out.
The Supply Chain: TSMC, SK Hynix HBM, CoWoS Packaging as Bottlenecks
Three components are the critical constraints on AI GPU supply:
TSMC N4/N3 advanced nodes: NVIDIA's H100 uses TSMC's N4 process; Blackwell uses N4 (some components) and N3. TSMC's N3 capacity is also demanded by Apple (A18 chips), Qualcomm, AMD, and Intel. TSMC is expanding N3 capacity at fabs in Taiwan, Arizona (TSMC Arizona Phase 1 operational, Phase 2 under construction), and Japan (熊本, Kumamoto). Capital intensity of leading-edge fab expansion: $20-25B per facility. No competitor can realistically build a TSMC-equivalent fab — Samsung's 3nm yield issues and Intel Foundry's 18A delays underscore TSMC's manufacturing supremacy.
SK Hynix HBM3/HBM3e: High Bandwidth Memory (HBM) is the memory stacked directly on AI GPU dies. H100 uses 80GB of HBM2e; H200 uses 141GB of HBM3e; B200 uses 192GB of HBM3e. The performance of HBM3e (3.35 TB/s bandwidth) is critical for the memory-bandwidth-limited operations common in large model inference. SK Hynix has approximately 50% market share in HBM, with Samsung and Micron competing for the remainder. SK Hynix's HBM yield advantage gives it pricing power — NVIDIA has reportedly prioritized SK Hynix HBM for H200/B200 production.
CoWoS packaging: Chip-on-Wafer-on-Substrate (CoWoS) is TSMC's advanced packaging technology that enables the side-by-side integration of GPU dies and HBM stacks. CoWoS is the physical manufacturing step that makes HBM integration possible. TSMC CoWoS capacity is the most acute bottleneck in the entire AI supply chain — demand has exceeded capacity by 2-3x through 2024-2025, directly limiting H100/H200 unit volumes. TSMC is adding CoWoS capacity aggressively (investing $3B+ in packaging expansion), but lead times for new packaging equipment are 12-18 months.
Networking: Infiniband vs. Ethernet, Spectrum-X, and Why 400G/800G Matters
AI training clusters require ultra-low-latency, high-bandwidth interconnects between GPUs. A 1,000-GPU training cluster doing all-reduce communication (synchronizing gradients across all GPUs) needs to move terabytes of data in microseconds.
InfiniBand (NVIDIA/Mellanox): InfiniBand HDR/NDR (400Gbps/800Gbps) is the dominant AI cluster networking standard. NVIDIA acquired Mellanox (InfiniBand leader) for $6.9B in 2020 — a prescient acquisition. Mellanox InfiniBand is purpose-built for AI cluster communication with features like congestion control, adaptive routing, and in-network computing. NVIDIA's Q4 FY2025 networking revenue (Mellanox + NVLink) exceeded $13B annually.
Ethernet alternatives: Broadcom's Spectrum-X (Ethernet-based AI networking) and standard 400G/800G Ethernet with AI-optimized protocols are positioning as alternatives to InfiniBand. The advantage: Ethernet is the standard for all non-AI networking, so combining AI and general compute networking reduces infrastructure complexity. Broadcom has won significant hyperscaler design wins for AI clusters (Meta, Google) preferring Ethernet over InfiniBand.
NVLink (intra-node): For GPU-to-GPU communication within a single server, NVIDIA's NVLink (600 GB/s bidirectional in Blackwell NVL72) is 7-10x faster than PCIe. NVLink is NVIDIA-proprietary and only available between NVIDIA GPUs — another element of vendor lock-in.
Who Actually Captures Value in the Stack?
Value capture in AI infrastructure follows a predictable pattern — it concentrates at structural bottlenecks:
| Layer | Key Players | Gross Margin | Why They Capture Value |
|---|---|---|---|
| Advanced node fab | TSMC | ~55% | Irreplaceable manufacturing capability; 2-3 year lead over Samsung |
| HBM memory | SK Hynix, Samsung | ~45-50% for HBM | Oligopoly on specialized memory; CoWoS integration complexity |
| GPU | NVIDIA (70%+ market share) | ~73% | CUDA software lock-in + hardware performance + brand |
| Networking | NVIDIA/Mellanox, Broadcom | ~60-65% | AI cluster specialization, protocol dominance |
| Power/Cooling | Vertiv, Eaton, Schneider, CoolIT | ~30-40% | Supply constrained; engineering specialization for liquid cooling |
| Data center REIT | Equinix, Digital Realty, Iron Mountain | ~60% EBITDA margin | Location monopolies, power access rights |
| Cloud hyperscaler | AWS, Azure, GCP | ~37-43% op margin | Scale, existing customer relationships, model access |
The weakest value capture is at Layers 2-3 (server assembly — Dell, HP, Supermicro) and Layer 5 (model providers, which face open-source commoditization pressure from LLaMA/Mistral).
Investment Implications: Pure Plays vs. Diversified Exposure
Highest-conviction pure plays:
- NVIDIA: The primary beneficiary of AI infrastructure buildout. 70%+ gross margins on $115B+ revenue is unprecedented. Risk: AMD/Google/custom silicon eroding market share; CUDA open-source alternatives (Triton); China export controls limiting ~$10B in annual revenue.
- TSMC: Manufacturing monopoly for leading-edge AI chips. Every NVIDIA, AMD, Apple, Qualcomm AI chip requires TSMC. Geopolitical risk (Taiwan) is the primary overhang.
- SK Hynix: HBM memory pricing power, 50% market share in the critical component for H200/B200 GPUs.
Data center infrastructure:
- Vertiv: Liquid cooling systems, power distribution. Revenue growing 25-30% as AI data center density increases.
- Eaton/Schneider Electric: Electrical infrastructure for data centers. Less AI-specific but benefits from overall capex cycle.
- Equinix / Digital Realty: Data center REITs with AI colocation demand driving pricing power.
Diversified hyperscaler exposure:
- Amazon / Microsoft / Alphabet: Capex is a cost, but the AI cloud services revenue (AI inference, managed model access) will ultimately generate superior returns on that capex. Long-term, these are the best risk-adjusted AI infrastructure investments.
Takeaways for Investors
- CUDA is NVIDIA's real moat, not the GPU: Competing with NVIDIA requires not just better hardware (AMD/Google have made progress) but 20+ years of software ecosystem replication. The moat is software, not silicon.
- Supply chain bottlenecks are investable: TSMC and SK Hynix sit at structural chokepoints with pricing power. Vertiv and data center REIT names benefit from capex cycle tailwinds that are 3-5 year themes.
- The capex cycle is real but finite: $250B+ in annual AI capex cannot continue indefinitely. When AI infrastructure buildout normalizes (likely 2027-2028), demand for H-series GPU successors must be driven by inference scaling rather than new model training — a smaller market than training at peak cycle.
- Power is the underrated constraint: The physical limitation on AI infrastructure buildout is not silicon or capital — it's electricity. Companies with access to cheap, abundant, grid-stable power (in proximity to data center demand) have an asymmetric advantage.
- Watch for margin compression in model providers: The Layer 5 (foundation model) market is facing open-source commoditization (LLaMA 3, Mistral Large 2). Value will flow down the stack to infrastructure, not up to models. Invest in picks-and-shovels, not gold-rushers.
Want to research companies faster?
Instantly access industry insights
Let PitchGrade do this for me
Leverage powerful AI research capabilities
We will create your text and designs for you. Sit back and relax while we do the work.
Explore More Content
