Data Centres: We Have a Cooling Problem

An Amazon data centre in Virginia got too hot on 7 May and took down Coinbase, FanDuel, and CME Group with it. The outage lasted more than seven hours. The real problem started long before the temperature spiked.

Joseph Adebayo

Data centre cooling failures are no longer theoretical risk scenarios. On 7 May 2026, Amazon Web Services confirmed an overheating event at one of its facilities in Northern Virginia — specifically in Availability Zone use1-az4 of the US-East-1 region, the company’s oldest, largest, and most heavily used data centre hub. The thermal failure knocked out EC2 instances and EBS volumes across affected hardware racks. Recovery took more than 12 hours. Coinbase lost core trading services for nearly seven hours. FanDuel blocked users from accessing accounts and cashing out bets. CME Group‘s CME Direct trading platform also reported disruptions. The incident is not just a data centre story. It is a warning about what happens when AI-driven infrastructure growth outpaces the physical engineering needed to support it.

What’s Happening & Why It Matters

What Actually Failed — and When

AWS engineers detected thermal problems in Availability Zone use1-az4 at approximately 5:25 p.m. Pacific Time on Thursday, 7 May. The overheating triggered a power disruption, damaging server hardware across the affected racks. AWS posted its first status update at 8:25 p.m. Eastern Time — nearly three hours after the problem began. By that point, every service dependent on the affected EC2 instances and EBS volumes in that zone was already experiencing impairment.

At 9:00 p.m. Eastern, FanDuel posted on X that it was “aware and investigating” technical difficulties. Two hours later, the company confirmed the issue was traced to the AWS outage. Coinbase posted on X Friday that failures in multiple AWS zones had caused an extended outage of core trading services. CME Group reported problems with its CME Direct platform. By Friday morning, AWS confirmed: “Full recovery is still expected to take several hours — efforts are slower than we had previously anticipated.”

AWS’s Statement and the Cooling Problem

AWS described the failure plainly. “We are actively working to bring additional cooling system capacity online, which will enable us to recover the remaining affected hardware in the impacted zone,” the company stated in its Friday update. “Some customers will continue to see their affected EC2 instances and EBS volumes as impaired until we achieve full recovery.” That statement reveals the core of the problem. AWS needed to bring additional cooling capacity online — meaning the existing cooling infrastructure could not handle the heat load. That is not a one-off equipment failure. It is a capacity mismatch between heat generation and heat removal.

At full recovery, AWS confirmed the issue had been confined to a single Availability Zone. That containment is significant. It means AWS‘s multi-AZ architecture performed as designed — limiting the blast radius. At the same time, the fact that a single-zone failure took down Coinbase, FanDuel, and CME Group for hours reveals how many mission-critical services still depend on a single zone without adequate failover.

The Engineering Problem Behind the Thermal Event

The AWS cooling failure fits a pattern the data centre industry has been tracking for years. AI hardware generates heat at densities that older facilities were never designed to manage. A standard compute rack from a decade ago drew between 5 and 10 kilowatts of power. A single AI server rack today draws between 30 and 100 kilowatts. That power converts almost entirely to heat. Traditional air cooling cannot carry heat away fast enough at those densities. Air does not move thermal energy efficiently enough at manageable airflow volumes.

Precision cooling systems — chilled water loops, computer room air handlers (CRAHs), and direct liquid cooling (DLC) — maintain server inlet temperatures within a narrow band. The ASHRAE recommended range is between 64°F and 80°F (18°C and 27°C). When ambient or equipment temperatures exceed those limits, servers throttle performance first. If that fails to resolve the condition, they shut down entirely to prevent hardware damage. In AWS‘s case, the cooling system could not keep pace with the heat load — and the shutdown followed.

AI Infrastructure Is Making the Problem Worse

The data centre cooling problem is not static. It is accelerating. Amazon, Microsoft, and Google are under intense pressure to bring new AI-capable capacity online. That pressure means older facilities — designed for the thermal profiles of conventional cloud workloads — are being repurposed or overloaded to meet AI demand. At the same time, the industry is deploying newer AI hardware into those same facilities. The thermal gap between what the buildings can handle and what the equipment demands widens with each GPU generation.

By contrast, solutions exist. Direct liquid cooling (DLC) circulates coolant through cold plates mounted directly on processors. Immersion cooling submerges entire servers in dielectric fluid. Both approaches move heat far more efficiently than air. Both require significant retrofitting of existing facilities — expensive, operationally complex, and time-consuming. The economics of AI infrastructure expansion press against the time investment required for proper thermal redesign.

What Fintech and Crypto Learned the Hard Way

The Coinbase disruption carries a pointed irony. Coinbase is a cryptocurrency exchange — a platform often described as running on “decentralised” finance infrastructure. In practice, the exchange’s core trading services ran on a single AWS Availability Zone in Northern Virginia. A cooling failure in that zone disabled crypto trading for nearly seven hours. That dependency is not unique to Coinbase. Most major fintech and crypto platforms use AWS, Google Cloud, or Microsoft Azure for their operational infrastructure. The abstraction of “cloud” obscures the physical reality: somewhere, a building in Virginia got too hot.

CME Group‘s involvement adds further urgency. CME operates markets for derivatives, futures, and options — financial instruments where timing is not merely commercial but legally and contractually significant. A seven-hour disruption to trading infrastructure at a regulated exchange has regulatory implications that persist well after the servers come back online. CME Group confirmed users could log in again after what it called “essential maintenance” — without addressing the root cause.

The Climate Irony That Nobody Missed

The overheating story carries an irony that several media outlets noted immediately. AWS data centres collectively consume gigawatts of power globally. That consumption converts to heat. That heat contributes — at a large scale — to the warming ambient conditions that make cooling harder. Data centres already account for approximately 0.5% of global carbon emissions. Cornell University researchers found that at current AI growth rates, data centres could emit between 24 and 44 million metric tons of CO₂ by 2030 — equivalent to adding between 5 and 10 million cars to US roads. One study found that data centres raise ambient temperatures for miles around their perimeters. A facility that contributes to regional warming faces higher cooling loads as a direct result. The cycle is self-reinforcing.

External climate also influences operationally. AWS has not confirmed whether external temperatures contributed to the 7 May event. Air-side economisers — systems that use outside air for free cooling — operate with reduced headroom when ambient temperatures rise. Northern Virginia’s climate is temperate but increasingly variable. As heat events are more frequent, the assumptions underlying legacy cooling designs are less reliable.

TF Summary: What’s Next

AWS confirmed full service restoration for the US-East-1 region by 11:30 a.m. Eastern Time on Friday, 8 May — approximately 18 hours after the thermal event began. A root cause analysis has not been published publicly at the time of writing. Coinbase confirmed customer funds were not at risk and that the primary issue was fully resolved. AWS accounts for approximately one-third of the global cloud infrastructure market — meaning the event affected millions of downstream services simultaneously. The US-East-1 region in Northern Virginia is the single largest concentration of AWS infrastructure on the planet.

MY FORECAST: The incident accelerates two outcomes. First, enterprise customers running mission-critical financial services on single-zone AWS configurations will face regulatory and internal pressure to implement genuine multi-AZ and multi-region redundancy — not just in architecture diagrams but in tested, exercised failover procedures. Second, AWS, Microsoft Azure, and Google Cloud will accelerate their liquid cooling deployment programmes for AI-density workloads. The 7 May event is not the last data centre cooling failure the industry will experience at scale. It is the first one loud enough to reach mainstream financial media — and that attention will produce operational and design changes that a quieter technical failure would not.


[gspeech type=full]

Share This Article
Avatar photo
By Joseph Adebayo “TF UX”
Background:
Joseph Adebayo is the user experience maestro. With a degree in Graphic Design and certification in User Experience, he has worked as a UX designer in various tech firms. Joseph's expertise lies in evaluating products not just for their technical prowess but for their usability, design, and consumer appeal. He believes that technology should be accessible, intuitive, and aesthetically pleasing.
Leave a comment