Energy Is All You Need
Energy is all you need
There are two crucial currencies for the future:
- FLOPS (computational power)
- Watts (energy efficiency)
We need more intelligence and more energy.
Computation power
-
Training OpenAI’s GPT-4 required an estimated 21 billion petaFLOP (petaFLOP is 10¹⁵ floating point operations).
-
iPhone 12 is capable roughly 0.01 petaFLOP. It would take 60,000 years to train GPT-4 on iPhone 12.
-
NVIDIA GB200 NVL72 is capable of 1440 petaFLOP in single rack.
Training compute of frontier AI models grows by 4-5x per year
Notable models include:
- AlexNet (c. 2012, ~10¹⁸ FLOP)
- AlphaGo Master (c. 2016, ~10²² FLOP)
- AlphaGo Zero (c. 2017, ~10²³)
- GPT-3 (c. 2020, just below 10²⁴ FLOP)
- PaLM (c. 2022, ~10²⁴ FLOP)
- GPT-4 (c. 2023, above 10²⁴ FLOP)
- Gemini Ultra (c. 2023, near the top right, ~10²⁵ FLOP)
The overall trend shows a massive and consistent increase in the computational power required to train frontier AI models over the last decade.
Key needs of training and infrence
-
AI training workloads have unique requirements that are very dissimilar to those of typical hardware deployed in existing data centers.
-
First, models train for weeks or months, with network connectivity requirements being relativity limited to training data ingress. Training is latency insensitive and does not need to be near any major population centers.
-
AI training clusters can be deployed essentially anywhere in the world that makes economic sense, subject to data residency and compliance regulations.
-
The second major difference to keep in mind is – AI training workloads are extremely power hungry and tend to run AI hardware at power levels closer to their Thermal Design Power (TDP) than would a traditional non-accelerated hyperscale or enterprise workload.
-
Additionally, while CPU and storage servers consume on the order of 1kW, each AI server is now eclipsing 10kW. Coupled with the insensitivity towards latency and decreased importance of proximity to population centers, this means that the availability of abundant quantities of inexpensive electricity (and in the future - access to any grid supply at all) is of much higher relative importance for AI training workloads vs traditional workloads.
-
Inference on the other hand is eventually a larger workload than training, but it can also be quite distributed. The chips don’t need to be centrally located, but the sheer volume will be outstanding.
AI infrastructure demand
Estimates of power
-
The IEA’s recent Electricity 2024 report suggests 90 terawatt-hours (TWh) of power demand from AI data centers by 2026, which is equivalent to about 10 Gigawatts (GW) of data center Critical IT Power Capacity, or the equivalent of 7.3M H100s.
-
It’s estimate that Nvidia alone will have shipped accelerators with the power needs of 5M+ H100s (mostly shipments of H100s, in fact) from 2021 through the end of 2024.
-
Datacenter power capacity growth will accelerate from a 12-15% CAGR to a 25% CAGR over the next few years.
-
Power demand will surge from 49 Gigawatts (GW) in 2023 to 96 GW by 2026, of which AI will consume ~40 GW.
-
The need for abundant, inexpensive power, and to quickly add electrical grid capacity while still meeting hyperscalers’ carbon emissions commitments, coupled with chip export restrictions, will limit the regions and countries that can meet the surge in demand from AI data centers. This will also push carbon emission commitments as less important.
Global data center critical IT power (Megawatts - MW)
| Year | Non-AI Data Center Critical IT Power (MW) | AI Data Center Critical IT Power (MW) | Total Critical IT Power (MW) |
|---|---|---|---|
| 2022 | 40,000 | 3,000 | 43,000 |
| 2023 | 42,000 | 7,000 | 49,000 |
| 2024 | 45,000 | 14,000 | 59,000 |
| 2025 | 50,000 | 20,000 | 70,000 |
| 2026 | 56,000 | 40,000 | 96,000 |
| 2027 | 60,000 | 60,000 | 120,000 |
| 2028 | 62,000 | 85,000 | 147,000 |
Estimated hyperscaler data center capacity (MW)
| Company | 2022 Capacity (MW) | Projected future capacity (MW) | Total (MW) |
|---|---|---|---|
| 3024 | 2905 | 5929 | |
| Microsoft | 2176 | 3344 | 5520 |
| Amazon | 2480 | 2533 | 5013 |
| Meta | 1790 | 2595 | 4385 |
| Apple | 600 | 1403 | 2003 |
| Alibaba | 1350 | 487 | 1837 |
| Huawei | 494 | 192 | 686 |
| Baidu | 608 | 36 | 644 |
| Tencent | 487 | 152 | 639 |
Estimated data center construction by region (MW)
| Region | 2023 Inventory (MW) | Under construction (MW) | Total (MW) |
|---|---|---|---|
| Northern Virginia | 2499 | 1237 | 3736 |
| Atlanta | 310 | 733 | 1043 |
| Dallas-Ft. Worth | 565 | 287 | 852 |
| Chicago | 560 | 118 | 678 |
| Hillsboro | 262 | 281 | 543 |
| Silicon Valley | 428 | 113 | 541 |
| Phoenix | 360 | 164 | 524 |
| New York Tri-state | 190 | 126 | 316 |
Current overview of huperscalers
-
OpenAI has plans to deploy hundreds of thousands of GPUs in their largest multi-site training cluster, which requires hundreds of megawatts of Critical IT Power.
-
Meta discusses an installed base of 650,000 H100 equivalent by the end of the year. They have currently 1M H100 in order.
-
GPU Cloud provider CoreWeave has big plans to invest $1.6B in a Plano, Texas facility, implying plans to spend for construction up to 50MW of Critical IT Power and install 30,000-40,000 GPUs in that facility alone with a clear pathway to a whole company 250MW datacenter footprint (equivalent to 180k H100s).
-
Microsoft has the largest pipeline of datacenter buildouts pre-AI era. They have been grabing any and all colocation space they can as well aggressively increasing their datacenter buildouts.
-
Amazon have made press releases about nuclear powered datacenters totaling 1,000MW. They were the last of the hyperscale’s to wake up to AI.
-
Google, and Microsoft/OpenAI both have plans for larger than gigawatt class training clusters in the works.
-
From a supply perspective, sell side consensus estimates of 3M+ GPUs shipped by Nvidia in calendar year 2024 would correspond to over 4,200 MW of datacenter needs, nearly 10% of current global data center capacity, just for one year’s GPU shipments.
-
AI is only going to grow in subsequent years, and Nvidia’s GPUs are slated to get even more power hungry, with 1,000W, 1,200W, and 1,500W GPUs on the roadmap.
-
Nvidia is not the only company producing accelerators, with Google ramping custom inference accelerator production rapidly.
-
Going forward, Meta and Amazon will also ramp their in-house inference accelerators.
NVIDIA systems
DGX H100 SYSTEM
| Specification | Details |
|---|---|
| GPUs | 8x NVIDIA H100 Tensor Core GPUs |
| GPU Memory | 640GB total |
| Performance | 32 petaFLOPS |
| System Power Usage | ~11.3kW max |
| Rack Units | 8U |
| CPU | Dual 56-core 4th Gen Intel® Xeon® Scalable processors |
DGX B200 SYSTEM
| Specification | Details |
|---|---|
| GPUs | 8x NVIDIA Blackwell GPUs |
| GPU Memory | 1,440GB total GPU memory |
| Performance | 72 petaFLOPS training and 144 petaFLOPS inference |
| System Power Usage | ~14.3kW max |
| Rack Units | 10U |
| CPU | 2 Intel® Xeon® Platinum 8570 Processors 112 Cores total, 2.1 GHz (Base), 4 GHz (Max Boost) |
NVIDIA GB200 NVL72
| Specification | Details |
|---|---|
| Configuration | 36 Grace CPU : 72 Blackwell GPUs |
| GPU Memory | 1,440GB total GPU memory |
| Performance | 1,440 petaFLOPS |
| System Power Usage | ~120.0kW max |
| Rack Units | Full rack |
| CPU | 2,592 Arm® Neoverse V2 cores |
Data centers
Global data centers market
| 74 Countries | 1004 Cities | 3362 Data Centers | 304 Providers |
Top 10 Colocation Providers
| Provider | Locations |
|---|---|
| China Telecom | 382 locations |
| Equinix | 246 locations |
| Digital Realty | 231 locations |
| Amazon AWS | 165 locations |
| Lumen | 105 locations |
| Zenlayer | 97 locations |
| MOD Mission Critical | 94 locations |
| DataBank | 69 locations |
| Cogent Communications | 51 locations |
| Hivelocity | 45 locations |
North America data center trends H2 2023
-
In 2023, primary market supply grew 26% year-over-year to 5,174.1 MW.
-
An all-time high of 3,077.8 MW was under construction in primary markets, a 46% year-over-year increase. Construction increased most in Atlanta, growing by 211% to 732.6 MW under construction.
-
Northern Virginia had a 42% year-over-year price increase, the largest among primary markets.
-
Preleasing activity in primary markets is strengthening, with 2,553.1 MW (83%) of the 3,077.8 MW under construction preleased.
-
The overall vacancy rate for primary markets remains near a record low, at 3.7%. With few relocation options, most tenants are renewing existing leases rather than seeking new facilities.
-
Power availability continued to influence data center operators’ location decisions more than geography did.
H2 2023 wholesale primary market fundamentals
| Market | Inventory (MW) | Y-o-Y Change (MW) | Available MW/Vacancy Rate | 2023 Net Absorption (MW) | Rental Rates (kW/mo)** | | :— | :— | :— | :— | :— | :— | | Northern Virginia | 2,499.1 | ▲ 439.0 | 34.7 / 1.4% | 424.4 | $150-$190 | | Dallas-Ft. Worth | 565.3 | ▲ 173.1 | 41.6 / 7.4% | 155.2 | $135-$170 | | Chicago | 559.6 | ▲ 217.4 | 11.7 / 2.1% | 226.8 | $145-$155 | | Silicon Valley | 427.7 | ▲ 48.1 | 31.0 / 7.3% | 25.7 | $155-$250 | | Phoenix | 360.0 | ▲ 35.5 | 14.2 / 3.9% | 48.8 | $170-$200 | | Atlanta | 310.0 | ▲ 57.5 | 41.1 / 13.3% | 18.0 | $120-$130 | | Hillsboro | 262.4 | ▲ 94.0 | 6.5 / 2.5% | 93.3 | $125-$170 | | New York Tri-State| 190.0 | ▲ 12.5 | 12.3 / 6.5% | 14.1 | $170-$180 |
H2 2023 wholesale secondary market fundamentals
| Market | Inventory (MW) | Available MW/Vacancy Rate | 2023 Net Absorption (MW) | Rental Rates (kW/mo)** | | :— | :— | :— | :— | :— | | Central Washington | 186.4 | 0.4 / 0.2% | 20.9 | $135-$175 | | Austin/San Antonio | 162.2 | 2.8 / 1.8% | 6.7 | $140-$175 | | Southern California| 160.5 | 34.6 / 21.6% | 6.8 | $135-$160 | | Seattle | 138.9 | 10.0 / 7.2% | 10.6 | $135-$175 | | Houston | 134.1 | 26.5 / 19.7% | 19.2 | $140-$175 | | Denver | 92.9 | 17.2 / 18.5% | 8.2 | $135-$145 | | Minneapolis | 59.6 | 14.7 / 24.7% | -3.2 | $120-$175 | | Charlotte/Raleigh | 52.1 | 13.6 / 26.0% | -1.2 | $115-$130 |
Average asking rental rate with Y-o-Y % change for primary markets
- 2014 to 2017: A gradual decline from ~$145 to ~$130.
- 2017 to 2022: Rates hovered between ~$125 and ~$130.
- 2022 to 2023: A significant jump in price, with a Y-o-Y change of +18.6%. The average rate rose to over $150/kW/month.
Data center layout and constraints
-
While the DGX H100 server requires 10.2 kilowatts (kW) of IT Power, most colocation data centers can still only support a power capacity of ~12 kW per rack. Typical Hyperscale datacenter can deliver higher power capacity.
-
Server deployments will therefore vary depending on the power supply and cooling capacity available, with only 2-3 DGX H100 servers deployed where power/cooling constrained, and entire rows rack space sitting empty to double the power delivery density from 12 kW to 24 kW in colocation data centers.
-
As data centers are increasingly designed with AI workloads in mind, racks will be able to achieve power densities of 30-40kW+ using air cooling by using specialized equipment to increase airflow.
-
The future use of direct to chip liquid cooling opens the door to even higher power density by potentially reducing per rack power usage by 10% by eliminating the use of fan power, and lowering PUE by 0.2-0.3 by reducing or eliminating the need for ambient air cooling.
Density
-
The trend towards higher power density per rack is driven by networking, compute efficiency and cost per compute. Roughly 90% of colocation data center costs are from power and 10% is from physical space.
-
The data hall where IT equipment is installed is typically only about 30-40% of a data center’s total gross floor area, so designing a data hall that is 30% larger will only require 10% more gross floor area for the entire data center.
-
Considering that 80% of the GPU cost of ownership is from capital costs, with 20% related to hosting (which bakes in the colocation data center costs) the cost of additional space is 2-3% of total cost of ownership for an AI cluster.
-
Most existing colocation data centers are not ready for rack densities above 20kW per rack.
-
Chip production constraints will meaningfully improve in 2024, but certain hyperscale’s and colocation data centers run straight into a data center capacity bottleneck.
-
Limits of 12-15kW power in traditional colocation will be an obstacle to achieving ideal physical density of AI super clusters.
Data center constrains list
Some of the constrains are mentioned before, but there is more to consider:
- Energy (3-5+ years to build capacity)
- Energy density per rack (designing and cabling for 1MW+ of energy per 42U size rack)
- Capabilities to build new data centers and execution (18-24 months for delivery)
- Grid network connection and power distribution (transformers, switchgears, substation, transmission lines)
- Thermal dissipation/cooling of server racks (challenge of cooling over 120kW and not long into the future 1MW per rack which opens two-phase cooling)
- Networking (proximity between GPUs and positioning of racks, close is better and easier but not always possible)
- Load/weight per square meter/square foot (static weight and point load)
Current pain points
Data center cunstructoin delays
| Delay Duration | Percentage of Respondents |
|---|---|
| 3-6 months | 38.80% |
| 6-9 months | 16.30% |
| 9-12 months | 15.50% |
| 12-18 months | 14.70% |
| 0-3 months | 9.30% |
| 18+ months | 5.40% |
Insufficient power planning
| Factor | Importance |
|---|---|
| Availability of secure grid power | 16.40% |
| Cost of power | 13.30% |
| Access to renewable energy | 11.50% |
| Availability of workforce and skills | 11.20% |
| Capital expenditure | 10.40% |
| Network proximity | 10.20% |
| Tax breaks/financial incentives | 8.10% |
| Favorable regulation | 6.80% |
| Proximity to client populations | 4.70% |
| Community acceptance | 4.20% |
| Availability of water | 3.40% |
Data center math
Simple calculation based on 20,480 GPUs (H100)
-
1x H100 GPU = ~$30,000
20,480 x H100 = ~$614.40M capex for GPUs
-
1,389W per GPU -> 20,480 GPUs = 28.4 MW power
1MW can power ~ 1,000 homes
28.4 MW can power ~28,400 homes
-
28.4 MW * 0.083 USD/kWh = ~$20.70M electricity per year
Applying same formula, 100k cluster consumes ~140-150MW and will cost ~$100-130M in electricity per year.
Data center power usage in the US
- Data center Critical IT Capacity in the US will need to triple from 2023 to 2027.
| Metric | Units | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 | 2026 | 2027 | 2028 |
|---|---|---|---|---|---|---|---|---|---|---|
| AI Data Center Critical IT Power | MW | 318 | 640 | 1,102 | 3,332 | 8,499 | 16,356 | 28,140 | 41,337 | 56,280 |
| Non-AI Data Center Critical IT Power | MW | 14,231 | 16,395 | 18,376 | 19,221 | 19,798 | 21,382 | 23,520 | 25,637 | 27,175 |
| Critical IT Power | MW | 14,550 | 17,035 | 19,478 | 22,553 | 28,297 | 37,738 | 51,660 | 66,974 | 83,455 |
| Utilization Rate | % | 65% | 66% | 66% | 67% | 70% | 72% | 73% | 74% | 75% |
| Critical IT Power Consumed | MW | 9,505 | 11,169 | 12,826 | 15,159 | 19,668 | 26,983 | 37,800 | 49,733 | 62,688 |
| Power Usage Effectiveness (PUE) | Ratio | 1.59 | 1.56 | 1.53 | 1.47 | 1.4 | 1.34 | 1.3 | 1.26 | 1.22 |
| Data Center Utility Power Consumed | MW | 15,142 | 17,407 | 19,660 | 22,323 | 27,538 | 36,263 | 48,957 | 62,521 | 76,684 |
| Data Center Actual Power Usage, per year | TWh | 133 | 152 | 172 | 196 | 241 | 318 | 429 | 548 | 672 |
| As % of United States Power Generation | % | 3.30% | 3.70% | 4.00% | 4.50% | 5.50% | 7.10% | 9.50% | 12.00% | 14.60% |
Infrastructure and energy
US, East Asia, Western Europe and Middle East - overview
-
The energy supply situation in the US stands in stark contrast to East Asia and Western Europe, which host about 15% and 18% of global data center capacity, respectively. While the US is self-sufficient in natural gas, countries such as Japan, Taiwan, Singapore, and Korea import well over 90% of their gas and coal needs.
-
In Western Europe, electricity generation has been slowly declining, with a 5% drop cumulatively over the past five years. One reason for the drop is that nuclear power has become a political non-starter, causing nuclear power generation to decline massively.
-
A strong focus on the “environment” has led to dirty fuel sources such as coal also declining dramatically over the same time, although the cleanest power in the world nuclear has been replaced with coal and natural gas in some instances.
-
Renewable energy is increasing within Europe’s power mix, but not fast enough, leaving many Europeans countries to scramble to pivot more towards natural gas, which now stands at 35-45% of the power generation mix for major Western European countries.
-
Europe is slowed building with many regulations and restrictions on the data center and manufacturing industries already in place. While small projects and pipelines for data centers are in progress, especially in France who at least has somewhat realized the geopolitical necessity, no one is planning to build Gigawatt class clusters in Europe.
-
Europe has less than 4% of globally deployed AI Accelerator FLOPs based on estimates.
-
Middle East is opening as a new strategic hub in geopolitical relationship with US. ASIC companies like Groq and Cerebras are having tight relationships with Middle East and most revenue coming from that region.
Electricity generation - selected regions
-
China: Shows dramatic and steep growth, starting below the US in 2000 and rising sharply to become the world’s largest electricity generator, exceeding 8,000 TWh by 2023.
-
United States: Generation has remained relatively flat, hovering around 4,000 TWh for the entire period.
-
European Union (27): Shows a slight decline over the period, starting above 2,500 TWh and ending slightly below it.
-
Japan, South Korea, France, Germany, United Kingdom, Taiwan: All show relatively flat or slightly declining generation, clustered much lower on the chart (all below 1,000 TWh).
Building out AI infrastructure at scale
The AI data center industry is going to need the following:
-
Inexpensive electricity costs given the immense amount of power to be consumed on an ongoing basis, particularly since inference needs will only compound over time.
-
Stability and robustness of energy supply chain against geopolitical and weather disturbances to decrease likelihood of energy price volatility, as well as the ability to quickly ramp up fuel production and thus rapidly provision power generation at great scale.
-
Power generation with a low carbon intensity power mix overall, and that suitable to stand up massive quantities of renewable energy that can produce at reasonable economics.
Countries that can provide this are contenders to be Real AI Superpowers.
This underscores a strong case for nuclear energy as a future source, capable of providing hundreds of megawatts to over gigawatt of electrical energy.
No energy source surpasses nuclear in terms of energy density, stability, predictability over extended periods, and safety.
Conclusion
-
Europe has less than 4% of global AI Accelerator FLOPs and is falling behind. Politics limit energy to support new builds or parts to be an AI superpower.
-
Most people knew about chip or GPU shortages over the last few years. That segment is slowly recovering and may recover first. Chip production is mainly done in Taiwan by TSMC, but is resolving by building chip fabs on US ground.
-
The biggest problem is energy, then energy density. The US and Europe had little growth in energy production over the last decade. Getting new energy sources and increasing production is hard. The energy grid is the next problem since it wasn’t designed for those loads and lacks investment to upgrade (on average 40-50 years old).
-
After solving the issue of delivering over 1MW per rack in data centers, power is needed to cool equipment with two-phase cooling. Gigawatt data centers will be a huge challenge. Swings in power can blow up grid and will need stabilization.
-
Networking in data centers is the next step. If GPU servers are far apart, more fiber/cabling is needed, increasing costs. The short reach enables lower cost multimode optical transceivers as opposed to expensive single mode.
-
Parts of the problems can be solved using nuclear energy and building data centers near nuclear power plants. Solar and BESS (Battery Energy Storage System) are much faster deployed and more scalable in current environment.
-
Most currently built data centers are powered with natural gas and gas turbine supply chain can’t follow demand.
-
We have not resolved any part of the supply chain to meet AI demand for the next decade, with energy being the most challenging issue.