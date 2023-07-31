When it comes to high-performance computing (HPC), liquid-cooled systems are often used in facilities specifically designed to meet their power and thermal requirements. However, when deploying artificial intelligence (AI), the situation becomes more complex. While the AI systems themselves use similar CPUs, GPUs, and NICs, the environments in which they are deployed can vary significantly.

In some co-location facilities, direct-liquid cooling and large power racks may be the norm. However, many customers do not have access to these luxuries. The choice of where to deploy AI compute often depends on where the data is already stored, which poses a challenge for co-location providers. They must support not only conventional deployments but also denser configurations that require more advanced thermal management and power delivery.

The rise of generative AI, in particular, has changed the game. Large language models with billions of parameters require significantly more compute power than traditional workloads like image classification. Despite the substantial compute requirements, the potential for these models to drive profits and gain a competitive edge motivates many enterprises to invest in AI infrastructure.

Finding the right balance between evolving compute technology and customer demands is challenging for real estate investment trusts like Digital Realty. Overbuilding for power and cooling may make the facility uncompetitive, while under-building could mean turning away demanding customers. Flexibility is key, and Digital Realty has been approaching facility construction in a modular fashion to accommodate changing needs.

Co-location facilities typically plan for the lowest common denominator in terms of power consumption. Most co-location cabinets have a capacity of 7-10 kilowatts, which is inadequate for modern AI systems. For example, a single Nvidia DGX H100 can consume over 10.2 kilowatts. To train large language models on a reasonable timescale, companies often require multiple systems, resulting in a significant power demand.

Digital Realty addressed this challenge at its KIX13 site in Osaka, where it designed its facility to handle Nvidia’s DGX H100 systems and the 32-node SuperPOD configuration. Each compute rack demands over 42 kilowatts of power, challenging the cooling capabilities of the facility. Power delivery can also be a challenge, as the DGX H100 requires three discrete power sources, unlike previous models.

While not every AI system deployed will be as power-hungry as the DGX H100, the design choices for Nvidia’s DGX-GH200 cluster appear to consider the limitations of enterprise and co-location data centers in terms of power and cooling. The cluster is spread out over a larger area, making it easier to deploy in older co-location facilities with space constraints.

In conclusion, deploying AI compute in co-location facilities presents unique challenges. The increasing power and thermal demands of AI systems require careful planning and flexibility in facility design to meet customer needs while remaining competitive in the market.