Thursday, July 1st 2021

AMD CDNA2 "Aldebaran" MI200 HPC Accelerator with 256 CU (16,384 cores) Imagined

AMD Instinct MI200 will be an important product for the company in the HPC and AI supercomputing market. It debuts the CDNA2 compute architecture, and is based on a multi-chip module (MCM) codenamed "Aldebaran." PC enthusiast Locuza, who conjures highly detailed architecture based on public information, imagined what "Aldebaran" could look like. The MCM contains two logic dies, and eight HBM2E stacks. Each of the two dies has a 4096-bit HBM2E interface, which talks to 64 GB of memory (128 GB per package). A silicon interposer provides microscopic wiring among the ten dies.

Each of the two logic dies, or chiplets, has sixteen shader engines that have 16 compute units (CU), each. The CDNA2 compute unit is capable of full-rate FP64, packed FP32 math, and Matrix Engines V2 (fixed function hardware for matrix multiplication, accelerating DNN building, training, and AI inference). With 128 CUs per chiplet, assuming the CDNA2 CU has 64 stream processors, one arrives at 8,192 SP. Two such dies add up to a whopping 16,384, more than three times that of the "Navi 21" RDNA2 silicon. Each die further features its independent PCIe interface, and XGMI (AMD's rival to CXL), an interconnect designed for high-density HPC scenarios. A rudimentary VCN (Video CoreNext) component is also present. It's important to note here, that the CDNA2 CU, as well as the "Aldebaran" MCM itself, doesn't have a dual-use as a GPU, since it lacks much of the hardware needed for graphics processing. The MI200 is expected to launch later this year.
Source: Locuza_ (Twitter)
Add your own comment

9 Comments on AMD CDNA2 "Aldebaran" MI200 HPC Accelerator with 256 CU (16,384 cores) Imagined

#1
IbaChiba
I wonder how well this could mine, just curious of course.:twitch:
Posted on Reply
#2
dragontamer5788
btarunrTwo such dies add up to a whopping 16,384, more than three times that of the "Navi 21" RDNA2 silicon
RDNA and CDNA cannot be easily compared with each other in this manner. CDNA uses the compute units of old (the same system from GCN 1.0 through Vega and now CDNA). That is: 16-wide native x 4 clock ticks x 4 ALUs per CU == 64 physical CUs executing 256 threads every 4 clock ticks.

RDNA had extremely major changes: 32-wide native x 4 ALUs per WGP which executes 128 threads every 1 clock tick. In RDNA terms, they call the WGP a "dual-compute unit", because 128-threads per RDNA clock tick is kinda-sorta like 2x256-threads every 4 CDNA clock ticks.

--------

RDNA2 also has 1024 x 32-bit registers per ALU. CDNA only has 256 x 32bit registers per hardware thread (but given the 4x clock ticks for 4x different threads: its kinda-sorta like having 1024 registers across 4 different threads). There are similarities between the two because they're both made by AMD, but... the differences are quite striking and will probably lead to major performance differences between the two platforms.

RDNA2 quite possibly is faster in some scenarios, while CDNA is faster in other scenarios. Its really difficult to compare the two on any microarchitectural level. AMD really did make a huge number of changes.
Posted on Reply
#3
TheoneandonlyMrK
dragontamer5788RDNA and CDNA cannot be easily compared with each other in this manner. CDNA uses the compute units of old (the same system from GCN 1.0 through Vega and now CDNA). That is: 16-wide native x 4 clock ticks x 4 ALUs per CU == 64 physical CUs executing 256 threads every 4 clock ticks.

RDNA had extremely major changes: 32-wide native x 4 ALUs per WGP which executes 128 threads every 1 clock tick. In RDNA terms, they call the WGP a "dual-compute unit", because 128-threads per RDNA clock tick is kinda-sorta like 2x256-threads every 4 CDNA clock ticks.

--------

RDNA2 also has 1024 x 32-bit registers per ALU. CDNA only has 256 x 32bit registers per hardware thread (but given the 4x clock ticks for 4x different threads: its kinda-sorta like having 1024 registers across 4 different threads). There are similarities between the two because they're both made by AMD, but... the differences are quite striking and will probably lead to major performance differences between the two platforms.

RDNA2 quite possibly is faster in some scenarios, while CDNA is faster in other scenarios. Its really difficult to compare the two on any microarchitectural level. AMD really did make a huge number of changes.
I have not seen CDNA architecture analyzed do you have links? also given this is CDNA gen 2 maybe 3? it is possible they have made more changes at least this time, this is a projection.
Posted on Reply
#5
TheoneandonlyMrK
FouquinThis includes some good technical info amongst the marketing.

www.servethehome.com/amd-radeon-instinct-mi100-32gb-cdna-gpu-launched/
I just read the AMD whitepaper here.

www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf

between the two sources the two state some things are clearly essentially the same as classical GCN but every element has been considered cut or improved and they are vague on some areas, I am only suggesting some minor updates could have been done with them having to retouch everything anyway.

cheers though more info is always nice
Posted on Reply
#6
dragontamer5788
TheoneandonlyMrKI have not seen CDNA architecture analyzed do you have links? also given this is CDNA gen 2 maybe 3? it is possible they have made more changes at least this time, this is a projection.
developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf

CDNA 1.0 has had its ISA released late last year. Its clear that AMD believes that GCN (16 SIMD-lanes x 4 clock ticks x 4 per compute unit) is a worthwhile architecture (even if RDNA / Graphics require something with lower latency). The ISA doc only has basic information on performance, and the ISA itself is almost identical to GCN documents from the past.

CDNA has new matrix multiplication instructions, and that's about it. Otherwise, its programming model is much more like GCN than RDNA.
Posted on Reply
#7
z1n0x
What's this beast build on 7nm or 5nm?
Posted on Reply
#8
btarunr
Editor & Senior Moderator
z1n0xWhat's this beast build on 7nm or 5nm?
I expect the logic dies to be 5 nm.
Posted on Reply
#9
dragontamer5788
z1n0xWhat's this beast build on 7nm or 5nm?
Hopes, dreams, and imagination.

MCM is somewhat confirmed, but I don't think any of the other specs are confirmed.
Posted on Reply
Copyright © 2004-2021 www.techpowerup.com. All rights reserved.
All trademarks used are properties of their respective owners.