Wednesday, October 19th 2022

Meta's Grand Teton Brings NVIDIA Hopper to Its Data Centers

Meta today announced its next-generation AI platform, Grand Teton, including NVIDIA's collaboration on design. Compared to the company's previous generation Zion EX platform, the Grand Teton system packs in more memory, network bandwidth and compute capacity, said Alexis Bjorlin, vice president of Meta Infrastructure Hardware, at the 2022 OCP Global Summit, an Open Compute Project conference.

AI models are used extensively across Facebook for services such as news feed, content recommendations and hate-speech identification, among many other applications. "We're excited to showcase this newest family member here at the summit," Bjorlin said in prepared remarks for the conference, adding her thanks to NVIDIA for its deep collaboration on Grand Teton's design and continued support of OCP.
Designed for Data Center Scale
Named after the 13,000-foot mountain that crowns one of Wyoming's two national parks, Grand Teton uses NVIDIA H100 Tensor Core GPUs to train and run AI models that are rapidly growing in their size and capabilities, requiring greater compute.

The NVIDIA Hopper architecture, on which the H100 is based, includes a Transformer Engine to accelerate work on these neural networks, which are often called foundation models because they can address an expanding set of applications from natural language processing to healthcare, robotics and more.

The NVIDIA H100 is designed for performance as well as energy efficiency. H100-accelerated servers, when connected with NVIDIA networking across thousands of servers in hyperscale data centers, can be 300x more energy efficient than CPU-only servers.

"NVIDIA Hopper GPUs are built for solving the world's tough challenges, delivering accelerated computing with greater energy efficiency and improved performance, while adding scale and lowering costs," said Ian Buck, vice president of hyperscale and high performance computing at NVIDIA. "With Meta sharing the H100-powered Grand Teton platform, system builders around the world will soon have access to an open design for hyperscale data center compute infrastructure to supercharge AI across industries."

Mountain of a Machine
Grand Teton sports 2x the network bandwidth and 4x the bandwidth between host processors and GPU accelerators compared to Meta's prior Zion system, Meta said.

The added network bandwidth enables Meta to create larger clusters of systems for training AI models, Bjorlin said. It also packs more memory than Zion to store and run larger AI models.

Simplified Deployment, Increased Reliability
Packing all these capabilities into one integrated server "dramatically simplifies deployment of systems, allowing us to install and provision our fleet much more rapidly, and increase reliability," said Bjorlin.
Add your own comment

6 Comments on Meta's Grand Teton Brings NVIDIA Hopper to Its Data Centers

#1
CGLBESE
I just registered for this comment (but I've read TPU for ages)

Do they even check their projects naming?!? :eek:
Posted on Reply
#2
caroline!
#3 was hidden but that's unironically what it means in Spanish.
Posted on Reply
#3
kilo
CGLBESEI just registered for this comment (but I've read TPU for ages)

Do they even check their projects naming?!? :eek:
It appears their projects are after national parks in the west. There's Zion national park and the Grand Teton national park. And yes, Grand Teton in its native language is kind of hilarious for the younger folks.
Posted on Reply
#4
CGLBESE
kiloIt appears their projects are after national parks in the west. There's Zion national park and the Grand Teton national park. And yes, Grand Teton in its native language is kind of hilarious for the younger folks.
I didn't notice that it was mentioned in the article, shame on me... :shadedshu:

So I went reading about it, very interesting.

This said, back to the news in itself, I hope they won't meet the same difficulties as AMD's exascale. Increased complexity often leads to unexpected stability and poor reliability. Any idead on how many H100 will be used?
Posted on Reply
#5
Patriot
CGLBESEI didn't notice that it was mentioned in the article, shame on me... :shadedshu:

So I went reading about it, very interesting.

This said, back to the news in itself, I hope they won't meet the same difficulties as AMD's exascale. Increased complexity often leads to unexpected stability and poor reliability. Any idead on how many H100 will be used?
They aren't deploying them in a supercomputer fashion but rather spread across their datacenters as pools of gpu compute.
A lot of AMD's problem is software ecosystem as well as using Slingshot rather than Mellanox.... Mellanox is super great at tuning scale out networks...
Posted on Reply
May 1st, 2024 05:03 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts