• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

AMD Releases its CDNA2 MI250X "Aldebaran" HPC GPU Block Diagram

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,696 (7.42/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
AMD in its HotChips 22 presentation released a block-diagram of its biggest AI-HPC processor, the Instinct MI250X. Based on the CDNA2 compute architecture, at the heart of the MI250X is the "Aldebaran" MCM (multi-chip module). This MCM contains two logic dies (GPU dies), and eight HBM2E stacks, four per GPU die. The two GPU dies are connected by a 400 GB/s Infinity Fabric link. They each have up to 500 GB/s of external Infinity Fabric bandwidth for inter-socket communications; and PCI-Express 4.0 x16 as the host system bus for AIC form-factors. The two GPU dies together make up 58 billion transistors, and are fabricated on the TSMC N6 (6 nm) node.

The component hierarchy of each GPU die sees eight Shader Engines share a last-level L2 cache. The eight Shader Engines total 112 Compute Units, or 14 CU per engine. The CDNA2 compute unit contains 64 stream processors making up the Shader Core, and four Matrix Core Units. These are specialized hardware for matrix/tensor math operations. There are hence 7,168 stream processors per GPU die, and 14,336 per package. AMD claims a 100% increase in double-precision compute performance over CDNA (MI100). AMD attributes this to increases in frequencies, efficient data paths, extensive operand reuse and forwarding; and power-optimization enabling those higher clocks. The MI200 is already powering the Frontier supercomputer, and is working for more design wins in the HPC space. The company also dropped a major hint that the MI300, based on CDNA3, will be an APU. It will incorporate GPU dies, core-logic, and CPU CCDs onto a single package, in what is a rival solution to NVIDIA Grace Hopper Superchip.



View at TechPowerUp Main Site | Source
 
AMD: We are better
Nvidia: No we are!!
:nutkick::laugh:
 
In package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
 
In package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
That 400 GB could be lower power/latency. But yes that is strange. It's probably also why it's still recognized as 2 independent chip and not single chip (among other things like scheduler etc...)

Also, some feature could be enabled on the 400 GB that would require additional bandwidth for control. Still, they will have to improve that in the future because Apple and IBM have way better die to die interface than AMD right now.

The double (or half the HBM bandwidth per die) would have made more sense. From initial benchmark laying around the internet, they are super fast when your code can run independently on each tiles, but perf start to collapse if you need die to die access.
 
Pretty cool to see Nvidia and AMD going at (and AMD actually getting some wins) it in this space, going for sort of the same overall design but from each others opposite areas of expertise.
 
Looks promising but those odd IF bandwidth numbers might point to some continuing inter-die latency issues, with higher external fabric speeds to compensate. Either way very nice to see team red get serious about HPC.
 
Bodes well for RDNA3 which is also MCP and TSMC 6nm

In package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
HBM2 is exceptionally wide but quite slow, so whilst the bandwidth HBM2 offers is very good, that bandwidth comes mostly from the bus width, meaning that latencies will likely be order(s) of magnitude higher than Infinity Fabric.
 
Bodes well for RDNA3 which is also MCP and TSMC 6nm


HBM2 is exceptionally wide but quite slow, so whilst the bandwidth HBM2 offers is very good, that bandwidth comes mostly from the bus width, meaning that latencies will likely be order(s) of magnitude higher than Infinity Fabric.
RDNA3 GCDs (graphics core dies) are 5nm while the cache dies and the IOD are 6nm, at least on Navi 31 and 32 (which each have their own unique GCD, in other words, Navi 31 is NOT just two Navi 32 GCDs like many initially believed), Navi 33 is monolithic and is on 6nm...at least according to the most recent, agreed upon leaks.

The tile structure should allow RDNA3 to be relatively much cheaper to manufacture than Nvidia's monolithic Lovelace. In the latest leaks, the RDNA3 GCDs for Navi 31. and 32 are really small, less than 250mm^2 if I remember correctly
 
RDNA3 GCDs (graphics core dies) are 5nm while the cache dies and the IOD are 6nm, at least on Navi 31 and 32 (which each have their own unique GCD, in other words, Navi 31 is NOT just two Navi 32 GCDs like many initially believed), Navi 33 is monolithic and is on 6nm...at least according to the most recent, agreed upon leaks.

The tile structure should allow RDNA3 to be relatively much cheaper to manufacture than Nvidia's monolithic Lovelace. In the latest leaks, the RDNA3 GCDs for Navi 31. and 32 are really small, less than 250mm^2 if I remember correctly
Already RDNA3 for desktop is much cheaper for AMD to produce than Lovelace is for Nvidia. AMD won't be under any pressure on price, Nvidia will need to slash margins to compete on price.
 
Back
Top