• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

NVIDIA "Blackwell" GB200 Server Dedicates Two-Thirds of Space to Cooling at Microsoft Azure

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
3,211 (1.11/day)
Late Tuesday, Microsoft Azure shared an interesting picture on its social media platform X, showcasing the pinnacle of GPU-accelerated servers—NVIDIA "Blackwell" GB200-powered AI systems. Microsoft is one of NVIDIA's largest customers, and the company often receives products first to integrate into its cloud and company infrastructure. Even NVIDIA listens to feedback from companies like Microsoft about designing future products, especially those like the now-canceled NVL36x2 system. The picture below shows a massive cluster that roughly divides the compute area into a single-third of the entire system, with a gigantic two-thirds of the system dedicated to closed-loop liquid cooling.

The entire system is connected using Infiniband networking, a standard for GPU-accelerated systems due to its lower latency in packet transfer. While the details of the system are scarce, we can see that the integrated closed-loop liquid cooling allows the GPU racks to be in a 1U form for increased density. Given that these systems will go into the wider Microsoft Azure data centers, a system needs to be easily maintained and cooled. There are indeed limits in power and heat output that Microsoft's data centers can handle, so these types of systems often fit inside internal specifications that Microsoft designs. There are more compute-dense systems, of course, like NVIDIA's NVL72, but hyperscalers should usually opt for other custom solutions that fit into their data center specifications. Finally, Microsoft noted that we can expect to see more details at the upcoming Microsoft Ignite conference in November and learn more about its GB200-powered AI systems.



View at TechPowerUp Main Site | Source
 
This is what happens when you take the easy option and do not make architectural changes and smarter designs, and just overclock and over volt for a "free upgrade". NV haven't made any major architectural updates to their GPU for many years now - they just bolt on more of the same, max it up to the reticle limit, then OC it to meet the performance goal. Very cheap and fast to do, but we end up with this monstrosity.

NV will need to actually come up with a new architecture to move the needle on the next chip, as TSMC is at their limits now, and nothing new that can manufacture a GPU at this size for NV is close for at least another 2 years.

NV really need to separate their AI and GPU business and make optimized versions of each.
 
So how long before the cooling needs of our AI datacenters can provide steam turbine power for our industry needs to provide more AI power to power our AI overlords?
 
Excuse me, what other purpose serve these chips except for generating heat? Well, if they power up Microsofts co-pilot-like stuff, LLM and generative AI, that the heat is better purpose. As they say in GoT: "Winter is coming".
 
Anyone else notice the towel at the bottom of the radiator?
 
This is what happens when you take the easy option and do not make architectural changes and smarter designs, and just overclock and over volt for a "free upgrade".
What smart "architectural changes" would you make? Be specific, with calculated details on their effects on manufacturing costs, yield rates, and power:performance ratios.
 
This is what happens when you take the easy option and do not make architectural changes and smarter designs, and just overclock and over volt for a "free upgrade". NV haven't made any major architectural updates to their GPU for many years now - they just bolt on more of the same, max it up to the reticle limit, then OC it to meet the performance goal. Very cheap and fast to do, but we end up with this monstrosity.

NV will need to actually come up with a new architecture to move the needle on the next chip, as TSMC is at their limits now, and nothing new that can manufacture a GPU at this size for NV is close for at least another 2 years.

NV really need to separate their AI and GPU business and make optimized versions of each.
No arch changes? Really? You saying that ampere, ada, and pascal are the same now?

:laugh::roll::laugh::banghead::laugh::roll::laugh:

So how long before the cooling needs of our AI datacenters can provide steam turbine power for our industry needs to provide more AI power to power our AI overlords?
Sadly never, because these chips dont have anywhere near the thermal output or max temperature needed to make high pressure steam.
 
Anyone else notice the towel at the bottom of the radiator?
I'm afraid this is not even a radiator, just a water-water heat exchanger. The thick pipes at the top connect to the really big radiator outside the building.
 
I'm afraid this is not even a radiator, just a water-water heat exchanger. The thick pipes at the top connect to the really big radiator outside the building.
AFAIK the Cornell datacenter uses one of the finger lakes as a resevoir for the second part of that -- I'm sure there are others that do this.
 
Get a diploma in AI refrigeration mechanical engineering maintenance for new multi-point failure water cooling server farms. AI . It's hip!
 
Back
Top