• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Huawei CloudMatrix 384 System Outperforms NVIDIA GB200 NVL72

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
3,114 (1.09/day)
Huawei announced its CloudMatrix 384 system super node, which the company touts as its own domestic alternative to NVIDIA's GB200 NVL72 system, with more overall system performance but worse per-chip performance and higher power consumption. While NVIDIA's GB200 NVL72 uses 36 Grace CPUs paired with 72 "Blackwell" GB200 GPUs, the Huawei CloudMatrix 384 system employs 384 Huawei Ascend 910C accelerators to beat NVIDIA's GB200 NVL72 system. It takes roughly five times more Ascend 910C accelerators to deliver nearly twice the GB200 NVL system performance, which is not good on per-accelerator bias, but excellent on per-system level of deployment. SemiAnalysis argues that Huawei is a generation behind in chip performance but ahead of NVIDIA in scale-up system design and deployment.

When you look at individual chips, NVIDIA's GB200 NVL72 clearly outshines Huawei's Ascend 910C, delivering over three times the BF16 performance (2,500 TeraFLOPS vs. 780 TeraFLOPS), more on‑chip memory (192 GB vs. 128 GB), and faster bandwidth (8 TB/s vs. 3.2 TB/s). In other words, NVIDIA has the raw power and efficiency advantage at the chip level. But flip the switch to the system level, and Huawei's CloudMatrix CM384 takes the lead. It cranks out 1.7× the overall PetaFLOPS, packs in 3.6× more total HBM capacity, and supports over five times the number of GPUs and the associated bandwidth of NVIDIA's NVL72 cluster. However, that scalability does come with a trade‑off, as Huawei's setup draws nearly four times more total power. A single GB200 NVL72 draws 145 kW of power, while a single Huawei CloudMatrix 384 draws ~560 kW. So, NVIDIA is your go-to if you need peak efficiency in a single GPU. If you're building a massive AI supercluster where total throughput and interconnect speed matter most, Huawei's solution actually makes a lot of sense. Thanks to its all-to-all topology, Huawei has delivered an AI training and inference system worth purchasing. When SMIC, the maker of Huawei's chips, gets to a more advanced manufacturing node, the efficiency of these systems will also increase.



View at TechPowerUp Main Site | Source
 
now I’m wondering how those things would do in terms of folding and other workloads that actually mean something
 
For the layperson, this makes little sense (and all sense). If system A uses more parts and more energy but produces more output than a smaller more efficient system B, you'd ask: can't system B just be doubled up? I'm guessing the system itself is a holistic unit, and it can't be made to interconnect (without penalty) to another 'sister' unit?
 
It takes roughly five times more Ascend 910C accelerators to deliver nearly twice the GB200 NVL system performance, which is not good on per-accelerator bias, but excellent on per-system level of deployment.
That says nothing without power consumption and price per chip.
However, that scalability does come with a trade‑off, as Huawei's setup draws nearly four times more total power. A single GB200 NVL72 draws 145 kW of power, while a single Huawei CloudMatrix 384 draws ~560 kW.
And here is where Nvidia is ahead. But is this efficiency advantage because of architecture or because of node advantage?
So, NVIDIA is your go-to if you need peak efficiency in a single GPU.
So, only Nvidia and Huawei exist in the AI market. No AMD, no Intel, no Broadcom, no Google, nobody else.
 
So it doesn't really outperform it but CCP needs something to brag about so they spin it that if you just use more of Huawei chips it will eventually perform better. Amazing. Now add more Nvidia chips and see what happens.
 
For the layperson, this makes little sense (and all sense). If system A uses more parts and more energy but produces more output than a smaller more efficient system B, you'd ask: can't system B just be doubled up? I'm guessing the system itself is a holistic unit, and it can't be made to interconnect (without penalty) to another 'sister' unit?
We are past scaling from nodes, we are at the point where systems are the unit of computing. So delivering a better system == better solution. If you have lots of bandwidth and enough compute, a higher power consumptionis nothing to worry about. China can absorb electricity requirements far better than US can. Read the SemiAnalysis source, its very interesting.
No AMD, no Intel, no Broadcom, no Google, nobody else.
Google TPU is a league of its own. AMD doesn't have an equivalent to the NVL72 system, yet, IIRC.
 
So it doesn't really outperform it but CCP needs something to brag about so they spin it that if you just use more of Huawei chips it will eventually perform better. Amazing. Now add more Nvidia chips and see what happens.

Clearly you missed the point (like me). See reply #6, especially:

we are at the point where systems are the unit of computing. So delivering a better system == better solution
 
Clearly you missed the point (like me). See reply #6, especially:
Yes I know about what AleksandarK wrote, thanks to China not really caring about current eco craze they can just burn more coal (which they actively do) and get the electricity needed for the power hungry Huawei system and scale it up as needed. I still wouldn't call that Huawei win over Nvidia or call it a better system.
 
Back
Top