Multiple processors become one: Revolutionary technology from Nvidia

At the recent exhibition held in Taipei on May 49th, Nvidia unveiled an impressive new technology that has the potential to redefine the concept of an AI supercomputer. Named the DGX GH200 AI supercomputer.

"We [designed] DGX GH200 as a new tool for the development of the next generation generative AI models and capabilities", - said Ian Buck, VP and general manager of hyperscale and HPC at Nvidia, in a press briefing held earlier.

The new technology from Nvidia, represented by the DGX GH200 AI supercomputer, has the potential to revolutionize the GPU server market. With its advanced capabilities and innovative features, this technology promises to reshape the landscape of AI computing. Nvidia's DGX GH200 is designed to accelerate the development of next-generation generative AI models, offering unprecedented performance and efficiency. 

Who will be the first to try new technology

Not surprisingly, tech giants Google, Meta, and Microsoft have already expressed their interest in exploring the new technology, while AWS is not among the early access participants.

"[These hyperscalers] will be the first to get access to the DGX GH200 to understand the new capabilities of Grace Hopper, and the multi-node NVLink that allows all those GPUs to work together as one," - said Buck.

Simple about DGX GH200

NVIDIA DGX GH200 AI is the latest supercomputer that greatly speeds up and improves the work with artificial intelligence.

The DGX GH200 AI uses NVIDIA's NVLink switching system to ensure that all GPUs can work seamlessly together as a unit rather than individually. Thanks to this approach, the speed and efficiency of the supercomputer increases significantly.

Thus, in the modern world, scientists and developers will be able to significantly accelerate the creation of innovations using artificial intelligence.

Nvidia showed graphs comparing the performance of a DGX H100 cluster with InfiniBand (old supercomputer) and a DGX GH200 system with full NVLink support. The previous Nvidia DGX H100, equipped with 2 Intel CPUs and 8 H100 GPUs, offered only 640 gigabytes of memory. For each workload, the number of GPUs is the same for the gray and green columns.



"The foundation of the DGX GH200 lies in the hardware connection through our NVLink Switch system and software integration with CUDA primitives and communication libraries. This seamless combination allows customers to effectively operate the system, despite the presence of 256 discrete computers and operating systems. Our software enables users to launch a single job that utilizes the entire memory space, leveraging the full capacity of the GPUs. This empowers customers to accomplish previously unattainable tasks or significantly accelerate existing ones", - said Charlie Boyle, the President of DGX.

Hard about DGX GH200

Nvidia is presenting the DGX GH200 as a 256-GPU system, which represents the fully-configured version. However, customers have the flexibility to start with 32, 64, or 128 nodes and upgrade as needed.

"If someone begins with 32 nodes, they can purchase an additional 32 nodes; all the switching is already in place. Simply connect a few cables, and you have 64 nodes and beyond", - Boyle explained.

The DGX GH200 is Nvidia's first multi-rack DGX system. Each rack accommodates 16 Grace Hopper GH200 nodes, and a total of 16 racks house 256 nodes. It is also possible to create larger systems by connecting multiple DGX GH200s using InfiniBand.

Helios: Nvidia's Mega-System based on DGX GH200

In fact, Nvidia is constructing its own mega-system, named Helios, based on the DGX GH200 AI supercomputer. Helios aims to advance research and development and facilitate the training of large-scale AI models. It will connect four DGX GH200 systems, totaling 1,024 Grace Hopper Superchips, using Nvidia's Quantum-2 InfiniBand networking. The system is expected to go online by the end of the year, delivering approximately 4 exaflops of AI performance (FP8) and around 30 petaflops of traditional FP64 performance. If Nvidia chooses to submit Helios to the Top500 list, it could potentially rank among the top ten percent of the list.

While no HGX version of the new DGX has been officially announced, it appears that one is under development. Similar to the HGX H100 design that serves as the foundation for Nvidia's DGX H100 and is customizable by hyperscalers and other system partners, the DGX GH200 is expected to be available in an "HGX" form. However, Nvidia has not made a specific announcement regarding this at the present time.

Regarding hyperscalers and their custom designs, Nvidia plans to make the components, building blocks, and pieces inside the DGX available to them so they can further optimize and expand on the design for their own datacenters and server configurations. This offering is referred to as HGX.

Introduction of MGX Server Specification

In addition, Nvidia is introducing the MGX server specification with a focus on modularity. MGX is an open, flexible, and forward-compatible system reference architecture for accelerated computing. It standardizes server designs in terms of mechanical, thermal, and power aspects and allows for the incorporation of GPUs, CPUs, and DPUs from Nvidia and other vendors, including x86 and Arm architectures. According to Buck, the adoption of the new MGX reference architecture can lead to the creation of a new design within two months and at a reduced cost compared to the current design process, which can take up to 18 months.

MGX will support the following form factors:

  • Chassis: 1U, 2U, 4U (air or liquid cooled);
  • GPUs: Full Nvidia GPU portfolio including the latest H100, L40, L4;
  • CPUs: Nvidia Grace CPU Superchip, GH200 Grace Hopper Superchip, x86 CPUs;
  • Networking: Nvidia BlueField-3 DPU, ConnectX-7 network adapters;

“MGX differs from Nvidia HGX in that it offers flexible, multi-generational compatibility with Nvidia products to ensure that system builders can reuse existing designs and easily adopt next-generation products without expensive redesigns. In contrast, HGX is based on an NVLink-connected multi-GPU baseboard tailored to scale to create the ultimate in AI and HPC systems”, - Nvidia.

In simpler terms, HGX is considered as a basic platform, while MGX represents a comprehensive reference architecture.

Compatibility and Support for MGX Architecture

MGX is designed to be compatible with the Open Compute Project and Electronic Industries Alliance server racks and is fully supported by Nvidia's software suite, including Nvidia AI Enterprise.
Leading companies like ASRock Rack, ASUS, GIGABYTE, Pegatron, QCT, and Supermicro have started incorporating MGX into their product design process. Two specific products were recently announced and are expected to be available in August: QCT's S74G-2U system, which is built around the Nvidia GH200 Grace Hopper Superchip, and Supermicro's ARS-221GL-NR system, which utilizes the Nvidia Grace CPU Superchip.

Softbank, another launch partner, plans to use MGX to develop customized servers for deployment in its hyperscale datacenters across Japan. Softbank's design, based on the provided blueprints, will enable dynamic allocation of GPU resources in multi-use environments, supporting tasks such as generative AI and 5G workloads.

Production and System Configurations of GH200 Grace Hopper Superchip

As Nvidia confirms the full production of the GH200 Grace Hopper Superchip, it's essential to consider not just the advancements but also the lifecycle of server technologies. Over 400 system configurations are based on Nvidia's latest CPU and GPU architectures, targeting the demand for generative AI. In this context, understanding the significance of server decommissioning becomes crucial, as it plays a vital role in maintaining efficiency and sustainability in data centers. These systems align with Nvidia's software stack, including Nvidia AI Enterprise, Omnivere, and the RTX platform, offering a comprehensive solution for the most demanding AI and HPC applications.

While no external system wins for the DGX GH200 have been announced yet, several systems incorporating the Grace Hopper Superchip have been previously revealed, utilizing the same GH200 Superchips. For example, the upcoming Alps supercomputing infrastructure at CSCS in Switzerland will debut a hybrid Arm-GPU architecture, and the first Grace Hopper system, named "Venado," is set to arrive at the Los Alamos National Laboratory in the United States.This delivers up to 900GB/s total bandwidth — 7x higher bandwidth than the standard PCIe Gen5 lanes found in traditional accelerated systems, providing incredible compute capability to address the most demanding generative AI and HPC applications.

Additionally, Grace Hopper Superchips will power the new Shaheen III supercomputer at KAUST. All three supercomputers are being constructed by HPE and are expected to be fully operational and available to researchers next year.

“Generative AI is rapidly transforming businesses, unlocking new opportunities and accelerating discovery in healthcare, finance, business services and many more industries,” said Ian Buck, vice president of accelerated computing at NVIDIA. “With Grace Hopper Superchips in full production, manufacturers worldwide will soon provide the accelerated infrastructure enterprises need to build and deploy generative AI applications that leverage their unique proprietary data.”, - CEO Jensen Huang.

Support for other Nvidia technologies

The upcoming range of systems powered by NVIDIA's Grace, Hopper, and Ada Lovelace architectures provides extensive support for the NVIDIA software stack, which encompasses NVIDIA AI, the NVIDIA Omniverse™ platform, and NVIDIA RTX™ technology.

NVIDIA AI Enterprise, the software layer of the NVIDIA AI platform, offers a vast array of over 100 frameworks, pre-trained models, and development tools. These resources streamline the development and deployment of AI applications, including generative AI, computer vision, and speech AI, enabling efficient production-level AI implementation.

The NVIDIA Omniverse development platform revolutionizes the creation and operation of metaverse applications, allowing individuals and teams to collaborate in real time across multiple software suites. Built on the Universal Scene Description framework, an open and flexible 3D language for virtual worlds, the platform provides a shared environment for seamless collaboration.

The NVIDIA RTX platform combines ray tracing, deep learning, and rasterization techniques to revolutionize the creative process for content creators and developers. With support for leading industry tools and APIs, applications built on the RTX platform empower designers and artists by delivering real-time photorealistic rendering and AI-enhanced graphics, video, and image processing. This technology enables millions of creative professionals to unleash their full creative potential.

Price, Power consumption and Conclusion

When asked about the power consumption of the DGX GH200 system, Nvidia declined to provide an answer. Similarly, pricing details were not disclosed, as DGX products are sold through partners who determine the final customer prices. However, if we consider the DGX H100 as a reference, with a maximum power consumption of 10.2 kW, multiplying that by 32 (to account for 256 GPUs) results in 326.4 kW. We will provide an update once the actual power specifications are available.

Now the new technology from Nvidia will be tested by the world's IT giants. AI continues to evolve at a breakneck pace and Nvidia is clearly contributing to this.