• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Huawei CloudMatrix 384 System Outperforms NVIDIA GB200 NVL72

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,981 (1.06/day)
Huawei announced its CloudMatrix 384 system super node, which the company touts as its own domestic alternative to NVIDIA's GB200 NVL72 system, with more overall system performance but worse per-chip performance and higher power consumption. While NVIDIA's GB200 NVL72 uses 36 Grace CPUs paired with 72 "Blackwell" GB200 GPUs, the Huawei CloudMatrix 384 system employs 384 Huawei Ascend 910C accelerators to beat NVIDIA's GB200 NVL72 system. It takes roughly five times more Ascend 910C accelerators to deliver nearly twice the GB200 NVL system performance, which is not good on per-accelerator bias, but excellent on per-system level of deployment. SemiAnalysis argues that Huawei is a generation behind in chip performance but ahead of NVIDIA in scale-up system design and deployment.

When you look at individual chips, NVIDIA's GB200 NVL72 clearly outshines Huawei's Ascend 910C, delivering over three times the BF16 performance (2,500 TeraFLOPS vs. 780 TeraFLOPS), more on‑chip memory (192 GB vs. 128 GB), and faster bandwidth (8 TB/s vs. 3.2 TB/s). In other words, NVIDIA has the raw power and efficiency advantage at the chip level. But flip the switch to the system level, and Huawei's CloudMatrix CM384 takes the lead. It cranks out 1.7× the overall PetaFLOPS, packs in 3.6× more total HBM capacity, and supports over five times the number of GPUs and the associated bandwidth of NVIDIA's NVL72 cluster. However, that scalability does come with a trade‑off, as Huawei's setup draws nearly four times more total power. A single GB200 NVL72 draws 145 kW of power, while a single Huawei CloudMatrix 384 draws ~560 kW. So, NVIDIA is your go-to if you need peak efficiency in a single GPU. If you're building a massive AI supercluster where total throughput and interconnect speed matter most, Huawei's solution actually makes a lot of sense. Thanks to its all-to-all topology, Huawei has delivered an AI training and inference system worth purchasing. When SMIC, the maker of Huawei's chips, gets to a more advanced manufacturing node, the efficiency of these systems will also increase.



View at TechPowerUp Main Site | Source
 
Joined
Jan 11, 2022
Messages
1,235 (1.03/day)
now I’m wondering how those things would do in terms of folding and other workloads that actually mean something
 

the54thvoid

Super Intoxicated Moderator
Staff member
Joined
Dec 14, 2009
Messages
13,586 (2.42/day)
Location
Glasgow - home of formal profanity
Processor Ryzen 7800X3D
Motherboard MSI MAG Mortar B650 (wifi)
Cooling be quiet! Dark Rock Pro 4
Memory 32GB Kingston Fury
Video Card(s) MSI RTX 5080 Vanguard SOC
Storage Seagate FireCuda 530 M.2 1TB / Samsumg 960 Pro M.2 512Gb
Display(s) LG 32" 165Hz 1440p GSYNC
Case Asus Prime AP201
Audio Device(s) On Board
Power Supply be quiet! Pure POwer M12 850w Gold (ATX3.0)
Software W10
For the layperson, this makes little sense (and all sense). If system A uses more parts and more energy but produces more output than a smaller more efficient system B, you'd ask: can't system B just be doubled up? I'm guessing the system itself is a holistic unit, and it can't be made to interconnect (without penalty) to another 'sister' unit?
 
Joined
Sep 6, 2013
Messages
3,738 (0.88/day)
Location
Athens, Greece
System Name 3 desktop systems: Gaming / Internet / HTPC
Processor Ryzen 5 7600 / Ryzen 5 4600G / Ryzen 5 5500
Motherboard X670E Gaming Plus WiFi / MSI X470 Gaming Plus Max (1) / MSI X470 Gaming Plus Max (2)
Cooling Aigo ICE 400SE / Segotep T4 / Νoctua U12S
Memory Kingston FURY Beast 32GB DDR5 6000 / 16GB JUHOR / 32GB G.Skill RIPJAWS 3600 + Aegis 3200
Video Card(s) ASRock RX 6600 / Vega 7 integrated / Radeon RX 580
Storage NVMes, ONLY NVMes / NVMes, SATA Storage / NVMe, SATA, external storage
Display(s) Philips 43PUS8857/12 UHD TV (120Hz, HDR, FreeSync Premium) / 19'' HP monitor + BlitzWolf BW-V5
Case Sharkoon Rebel 12 / CoolerMaster Elite 361 / Xigmatek Midguard
Audio Device(s) onboard
Power Supply Chieftec 850W / Silver Power 400W / Sharkoon 650W
Mouse CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Keyboard CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Software Windows 10 / Windows 10&Windows 11 / Windows 10
It takes roughly five times more Ascend 910C accelerators to deliver nearly twice the GB200 NVL system performance, which is not good on per-accelerator bias, but excellent on per-system level of deployment.
That says nothing without power consumption and price per chip.
However, that scalability does come with a trade‑off, as Huawei's setup draws nearly four times more total power. A single GB200 NVL72 draws 145 kW of power, while a single Huawei CloudMatrix 384 draws ~560 kW.
And here is where Nvidia is ahead. But is this efficiency advantage because of architecture or because of node advantage?
So, NVIDIA is your go-to if you need peak efficiency in a single GPU.
So, only Nvidia and Huawei exist in the AI market. No AMD, no Intel, no Broadcom, no Google, nobody else.
 
Joined
Jan 19, 2023
Messages
462 (0.56/day)
So it doesn't really outperform it but CCP needs something to brag about so they spin it that if you just use more of Huawei chips it will eventually perform better. Amazing. Now add more Nvidia chips and see what happens.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,981 (1.06/day)
For the layperson, this makes little sense (and all sense). If system A uses more parts and more energy but produces more output than a smaller more efficient system B, you'd ask: can't system B just be doubled up? I'm guessing the system itself is a holistic unit, and it can't be made to interconnect (without penalty) to another 'sister' unit?
We are past scaling from nodes, we are at the point where systems are the unit of computing. So delivering a better system == better solution. If you have lots of bandwidth and enough compute, a higher power consumptionis nothing to worry about. China can absorb electricity requirements far better than US can. Read the SemiAnalysis source, its very interesting.
No AMD, no Intel, no Broadcom, no Google, nobody else.
Google TPU is a league of its own. AMD doesn't have an equivalent to the NVL72 system, yet, IIRC.
 

the54thvoid

Super Intoxicated Moderator
Staff member
Joined
Dec 14, 2009
Messages
13,586 (2.42/day)
Location
Glasgow - home of formal profanity
Processor Ryzen 7800X3D
Motherboard MSI MAG Mortar B650 (wifi)
Cooling be quiet! Dark Rock Pro 4
Memory 32GB Kingston Fury
Video Card(s) MSI RTX 5080 Vanguard SOC
Storage Seagate FireCuda 530 M.2 1TB / Samsumg 960 Pro M.2 512Gb
Display(s) LG 32" 165Hz 1440p GSYNC
Case Asus Prime AP201
Audio Device(s) On Board
Power Supply be quiet! Pure POwer M12 850w Gold (ATX3.0)
Software W10
So it doesn't really outperform it but CCP needs something to brag about so they spin it that if you just use more of Huawei chips it will eventually perform better. Amazing. Now add more Nvidia chips and see what happens.

Clearly you missed the point (like me). See reply #6, especially:

we are at the point where systems are the unit of computing. So delivering a better system == better solution
 
Joined
Jan 19, 2023
Messages
462 (0.56/day)
Clearly you missed the point (like me). See reply #6, especially:
Yes I know about what AleksandarK wrote, thanks to China not really caring about current eco craze they can just burn more coal (which they actively do) and get the electricity needed for the power hungry Huawei system and scale it up as needed. I still wouldn't call that Huawei win over Nvidia or call it a better system.
 
Top