• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD CDNA2 "Aldebaran" MI200 HPC Accelerator with 256 CU (16,384 cores) Imagined

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,901 (7.37/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
AMD Instinct MI200 will be an important product for the company in the HPC and AI supercomputing market. It debuts the CDNA2 compute architecture, and is based on a multi-chip module (MCM) codenamed "Aldebaran." PC enthusiast Locuza, who conjures highly detailed architecture based on public information, imagined what "Aldebaran" could look like. The MCM contains two logic dies, and eight HBM2E stacks. Each of the two dies has a 4096-bit HBM2E interface, which talks to 64 GB of memory (128 GB per package). A silicon interposer provides microscopic wiring among the ten dies.

Each of the two logic dies, or chiplets, has sixteen shader engines that have 16 compute units (CU), each. The CDNA2 compute unit is capable of full-rate FP64, packed FP32 math, and Matrix Engines V2 (fixed function hardware for matrix multiplication, accelerating DNN building, training, and AI inference). With 128 CUs per chiplet, assuming the CDNA2 CU has 64 stream processors, one arrives at 8,192 SP. Two such dies add up to a whopping 16,384, more than three times that of the "Navi 21" RDNA2 silicon. Each die further features its independent PCIe interface, and XGMI (AMD's rival to CXL), an interconnect designed for high-density HPC scenarios. A rudimentary VCN (Video CoreNext) component is also present. It's important to note here, that the CDNA2 CU, as well as the "Aldebaran" MCM itself, doesn't have a dual-use as a GPU, since it lacks much of the hardware needed for graphics processing. The MI200 is expected to launch later this year.



View at TechPowerUp Main Site
 
I wonder how well this could mine, just curious of course.:twitch:
 
Two such dies add up to a whopping 16,384, more than three times that of the "Navi 21" RDNA2 silicon

RDNA and CDNA cannot be easily compared with each other in this manner. CDNA uses the compute units of old (the same system from GCN 1.0 through Vega and now CDNA). That is: 16-wide native x 4 clock ticks x 4 ALUs per CU == 64 physical CUs executing 256 threads every 4 clock ticks.

RDNA had extremely major changes: 32-wide native x 4 ALUs per WGP which executes 128 threads every 1 clock tick. In RDNA terms, they call the WGP a "dual-compute unit", because 128-threads per RDNA clock tick is kinda-sorta like 2x256-threads every 4 CDNA clock ticks.

--------

RDNA2 also has 1024 x 32-bit registers per ALU. CDNA only has 256 x 32bit registers per hardware thread (but given the 4x clock ticks for 4x different threads: its kinda-sorta like having 1024 registers across 4 different threads). There are similarities between the two because they're both made by AMD, but... the differences are quite striking and will probably lead to major performance differences between the two platforms.

RDNA2 quite possibly is faster in some scenarios, while CDNA is faster in other scenarios. Its really difficult to compare the two on any microarchitectural level. AMD really did make a huge number of changes.
 
RDNA and CDNA cannot be easily compared with each other in this manner. CDNA uses the compute units of old (the same system from GCN 1.0 through Vega and now CDNA). That is: 16-wide native x 4 clock ticks x 4 ALUs per CU == 64 physical CUs executing 256 threads every 4 clock ticks.

RDNA had extremely major changes: 32-wide native x 4 ALUs per WGP which executes 128 threads every 1 clock tick. In RDNA terms, they call the WGP a "dual-compute unit", because 128-threads per RDNA clock tick is kinda-sorta like 2x256-threads every 4 CDNA clock ticks.

--------

RDNA2 also has 1024 x 32-bit registers per ALU. CDNA only has 256 x 32bit registers per hardware thread (but given the 4x clock ticks for 4x different threads: its kinda-sorta like having 1024 registers across 4 different threads). There are similarities between the two because they're both made by AMD, but... the differences are quite striking and will probably lead to major performance differences between the two platforms.

RDNA2 quite possibly is faster in some scenarios, while CDNA is faster in other scenarios. Its really difficult to compare the two on any microarchitectural level. AMD really did make a huge number of changes.
I have not seen CDNA architecture analyzed do you have links? also given this is CDNA gen 2 maybe 3? it is possible they have made more changes at least this time, this is a projection.
 
I have not seen CDNA architecture analyzed do you have links? also given this is CDNA gen 2 maybe 3? it is possible they have made more changes at least this time, this is a projection.

This includes some good technical info amongst the marketing.

 
This includes some good technical info amongst the marketing.

I just read the AMD whitepaper here.


between the two sources the two state some things are clearly essentially the same as classical GCN but every element has been considered cut or improved and they are vague on some areas, I am only suggesting some minor updates could have been done with them having to retouch everything anyway.

cheers though more info is always nice
 
I have not seen CDNA architecture analyzed do you have links? also given this is CDNA gen 2 maybe 3? it is possible they have made more changes at least this time, this is a projection.


CDNA 1.0 has had its ISA released late last year. Its clear that AMD believes that GCN (16 SIMD-lanes x 4 clock ticks x 4 per compute unit) is a worthwhile architecture (even if RDNA / Graphics require something with lower latency). The ISA doc only has basic information on performance, and the ISA itself is almost identical to GCN documents from the past.

CDNA has new matrix multiplication instructions, and that's about it. Otherwise, its programming model is much more like GCN than RDNA.
 
What's this beast build on 7nm or 5nm?
 
Back
Top