Nvidia's GH200 Is A Power-efficiency Beast, Green500 Shows

Republished By Plato

Followers: 0

Analysis Despite growing alarm over spiraling datacenter power consumption, this spring’s Green500 ranking of the world’s most sustainable publicly known supercomputers shows that the same energy-hungry server accelerators behind the AI boom are also driving sizable improvements in efficiency.

Seven of the ten most power-efficient systems in the ranking employed Nvidia GPUs, while half, including the top three, were powered by the Silicon Valley giant’s Grace-Hopper Superchips (GH200).

We’ve covered that processor in depth in the past; at a high level, these franken chips combine a 72-core Grace CPU based on an Arm Neoverse V2 design with 480 GB of LPDDR5x memory and an H100 GPU with between 96 GB and 144 GB of HBM3 or HBM3e memory using a high-speed NvLink chip-to-chip connection.

According to Nvidia, the GH200 now makes up the bulk of Nvidia’s HPC-centric deployments and even claims a spot in the Top500’s 10 most powerful publicly known systems, with the debut of the Alps machine. In fact, the preAlps testbed claimed the number five spot in the Green500 ranking with an efficiency of 64.4 gigaFLOPS/watt. But, as we alluded to earlier, this is far from the best the GH200 was able to manage.

At the top of the heap was Germany’s Jedi. The system is the precursor to the GH200 and SiPearl Rhea-powered Jupiter machine, and is slated to be Europe’s first exascale system when it’s completed later this year. The smaller Jedi system was able to eke out a whopping 72.7 GFLOPS/watt.

The similarly configured Isambard-AI Phase 1 and Helios GPU systems, meanwhile, followed close behind at 68.8 and 66.9 GFLOPS/watt, respectively. The key difference aside from the Isambard family sporting larger machines, substantially so in the case of the 19.14 petaFLOPS Helios GPU system, is the two computers used HPE’s Slingshot-11 interconnect while Jedi is using Nvidia’s own InfiniBand NDR200 networking.

Regardless, these figures represent a sizable improvement in system efficiency compared to just a two years ago.

In early 2022, the most efficient system on the Green500 was the Frontier TDS testbed, a smaller version of the Top500’s number-one-ranked 1.2 exaFLOPS Frontier machine that made its debut that same year. In fact the full Frontier system ranked number two at 52.2 GFLOPS/watt behind its the testbed at 62.7 GFLOPS/Watt.

To be clear, scaling compute is a tricky endeavor and prone to inefficiencies as a system grows larger. This is one of the reasons that the most efficient systems also tend to be the smallest.

This was true for Frontier and Frontier TDS in ’22 and is also the case with the GH200-based systems on this spring’s Green500. While preAlps managed 64.8 GFLOPS/watt, the full Alps system only about 52, putting it on par with Frontier in terms of efficiency.

This spring’s Green500 also sees AMD’s Instinct-based systems slip further down the list. Just three of the top 10 most efficient supers are powered by AMD accelerators including Frontier TDS in seventh, Adastra in ninth, and Setonix GPU in tenth.

All of these systems share a common architecture and are powered by AMD’s third-gen Epyc Milan processors, MI250X GPUs, and HPE’s Slingshot-11 interconnects. Two years ago, AMD’s parts were the state of the art when it came to FLOPS/watt, and while still potent, Nvidia has clearly pulled ahead.

Busting bottlenecks

The GH200’s strong performance demonstrates the advantage of a highly integrated packaging design. The superchip’s NvLink-C2C interconnect bridging the CPU and GPU allows for highly efficient communications between the two at up to 900 GB/s. That’s a substantial uplift over the 128 GB/s that a PCIe 5.0 x16 interface is capable of today and certainly much better than the 200 Gb/s interconnect fabrics commonly seen in these clusters.

When it comes to high performance computing, and AI for that matter, bandwidth, whether it’s at the network level, socket level, or compute and memory level, is an ever present bottleneck that has to be reassessed each generation. So, it shouldn’t come as a surprise that Nvidia’s GH200 is so efficient.

This suggests we may see improvements in efficiency moving to faster networking gear. Until recently, PCIe 4.0 based CPUs limited interconnect bandwidth to 200 Gb/s per port. However with PCIe 5.0-based CPUs from AMD, Intel, and others now widely available, it’s now possible to have 400 Gb/s networking and with the right tricks you can even get around this limit to support 800 Gb/s per port as we’ve seen Nvidia do with its latest ConnectX-8 and BlueField SuperNICs.

Of course, Nvidia isn’t the only one getting creative with how CPUs and GPUs are meshed. AMD’s MI300X is arguably more sophisticated, co-packaging three compute dies totaling 24 cores alongside six CDNA 3 accelerators and up to 128 GB of HBM3 memory.

These chips are slated to power Lawrence Livermore National Lab’s El Capitan system in the US, which could make its debut as early as this fall. The lab did put forward some early results, however at 32 petaFLOPS of peak theoretical performance, the “early delivery” system represents a fraction of the full system and the submission didn’t include power consumption, making it difficult to judge how efficient the chips actually are in HPC workloads.

Cloud systems leave much to desired

While we fully expect Livermore to disclose power consumption when it submits performance figures for the full El Capitan system, not everyone is so forthcoming.

Since last November we’ve started to see more cloud-based compute clusters, most notably Microsoft’s number-three-ranked Eagle system, make their way onto the Top500. Unfortunately, these submissions either don’t track or don’t release power consumption data for these benchmark runs making it difficult to determine just how efficient these machines actually are.

However, because these systems are deployed in more traditional datacenters, we suspect these clusters probably aren’t particularly efficient. This is because unlike the highly tuned liquid-cooled chassis you typically see from Lenovo, HPE, and Eviden in supercomputer deployments, the GPU nodes powering these cloud systems are often air cooled, meaning a sizable chunk of power ends up being consumed by high-RPM fans.

We’re told on these high density systems, between 15 and 20 percent of power consumption can be directly attributable to fans, which works out to 1,500 to 2000 watts, roughly equivalent to three H100s or MI250X accelerators at the high end. Compared to purpose-build HPC clusters, these cloud clusters probably look quite wasteful.

The fact Microsoft and others are so hesitant to reveal power data for their runs isn’t all that surprising. Cloud and hyperscalers are historically quite secretive about the power requirements of their systems, racks, and datacenters as their entire business model revolves around these narrow margins.

However, for those trying to build more efficient clusters, the lack of data here leaves much to be desired. ®