Wednesday, August 24th 2022

AMD Releases its CDNA2 MI250X "Aldebaran" HPC GPU Block Diagram

AMD in its HotChips 22 presentation released a block-diagram of its biggest AI-HPC processor, the Instinct MI250X. Based on the CDNA2 compute architecture, at the heart of the MI250X is the "Aldebaran" MCM (multi-chip module). This MCM contains two logic dies (GPU dies), and eight HBM2E stacks, four per GPU die. The two GPU dies are connected by a 400 GB/s Infinity Fabric link. They each have up to 500 GB/s of external Infinity Fabric bandwidth for inter-socket communications; and PCI-Express 4.0 x16 as the host system bus for AIC form-factors. The two GPU dies together make up 58 billion transistors, and are fabricated on the TSMC N6 (6 nm) node.

The component hierarchy of each GPU die sees eight Shader Engines share a last-level L2 cache. The eight Shader Engines total 112 Compute Units, or 14 CU per engine. The CDNA2 compute unit contains 64 stream processors making up the Shader Core, and four Matrix Core Units. These are specialized hardware for matrix/tensor math operations. There are hence 7,168 stream processors per GPU die, and 14,336 per package. AMD claims a 100% increase in double-precision compute performance over CDNA (MI100). AMD attributes this to increases in frequencies, efficient data paths, extensive operand reuse and forwarding; and power-optimization enabling those higher clocks. The MI200 is already powering the Frontier supercomputer, and is working for more design wins in the HPC space. The company also dropped a major hint that the MI300, based on CDNA3, will be an APU. It will incorporate GPU dies, core-logic, and CPU CCDs onto a single package, in what is a rival solution to NVIDIA Grace Hopper Superchip.
Source: Wccftech
Add your own comment

9 Comments on AMD Releases its CDNA2 MI250X "Aldebaran" HPC GPU Block Diagram

#1
P4-630
AMD: We are better
Nvidia: No we are!!
:nutkick::laugh:
Posted on Reply
#2
thegnome
In package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
Posted on Reply
#3
Punkenjoy
thegnomeIn package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
That 400 GB could be lower power/latency. But yes that is strange. It's probably also why it's still recognized as 2 independent chip and not single chip (among other things like scheduler etc...)

Also, some feature could be enabled on the 400 GB that would require additional bandwidth for control. Still, they will have to improve that in the future because Apple and IBM have way better die to die interface than AMD right now.

The double (or half the HBM bandwidth per die) would have made more sense. From initial benchmark laying around the internet, they are super fast when your code can run independently on each tiles, but perf start to collapse if you need die to die access.
Posted on Reply
#4
Operandi
Pretty cool to see Nvidia and AMD going at (and AMD actually getting some wins) it in this space, going for sort of the same overall design but from each others opposite areas of expertise.
Posted on Reply
#5
Tropick
Looks promising but those odd IF bandwidth numbers might point to some continuing inter-die latency issues, with higher external fabric speeds to compensate. Either way very nice to see team red get serious about HPC.
Posted on Reply
#6
Chrispy_
Bodes well for RDNA3 which is also MCP and TSMC 6nm
thegnomeIn package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
HBM2 is exceptionally wide but quite slow, so whilst the bandwidth HBM2 offers is very good, that bandwidth comes mostly from the bus width, meaning that latencies will likely be order(s) of magnitude higher than Infinity Fabric.
Posted on Reply
#7
delshay
I want to see HBM3 products.
Posted on Reply
#8
AnarchoPrimitiv
Chrispy_Bodes well for RDNA3 which is also MCP and TSMC 6nm


HBM2 is exceptionally wide but quite slow, so whilst the bandwidth HBM2 offers is very good, that bandwidth comes mostly from the bus width, meaning that latencies will likely be order(s) of magnitude higher than Infinity Fabric.
RDNA3 GCDs (graphics core dies) are 5nm while the cache dies and the IOD are 6nm, at least on Navi 31 and 32 (which each have their own unique GCD, in other words, Navi 31 is NOT just two Navi 32 GCDs like many initially believed), Navi 33 is monolithic and is on 6nm...at least according to the most recent, agreed upon leaks.

The tile structure should allow RDNA3 to be relatively much cheaper to manufacture than Nvidia's monolithic Lovelace. In the latest leaks, the RDNA3 GCDs for Navi 31. and 32 are really small, less than 250mm^2 if I remember correctly
Posted on Reply
#9
Minus Infinity
AnarchoPrimitivRDNA3 GCDs (graphics core dies) are 5nm while the cache dies and the IOD are 6nm, at least on Navi 31 and 32 (which each have their own unique GCD, in other words, Navi 31 is NOT just two Navi 32 GCDs like many initially believed), Navi 33 is monolithic and is on 6nm...at least according to the most recent, agreed upon leaks.

The tile structure should allow RDNA3 to be relatively much cheaper to manufacture than Nvidia's monolithic Lovelace. In the latest leaks, the RDNA3 GCDs for Navi 31. and 32 are really small, less than 250mm^2 if I remember correctly
Already RDNA3 for desktop is much cheaper for AMD to produce than Lovelace is for Nvidia. AMD won't be under any pressure on price, Nvidia will need to slash margins to compete on price.
Posted on Reply
Apr 24th, 2024 18:04 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts