AMD Instinct MI350X Series AI GPU Silicon Detailed

btarunr · Thursday at 10:15 PM

AMD today unveiled its Instinct MI350X series AI GPU. Based on the company's latest CDNA 4 compute architecture, the MI350X is designed to compete with NVIDIA B200 "Blackwell" AI GPU series, with the top-spec Instinct MI355X being compared by AMD to the B200 in its presentation. The chip debuts not just the CDNA 4 architecture, but also the latest ROCm 7 software stack, and hardware ecosystem based on the industry-standard Open Compute Project specification, which combines AMD EPYC Zen 5 CPUs, Instinct MI350 series GPUs, AMD-Pensando Pollara scale-out NICs supporting Ultra-Ethernet, and industry-standard racks and nodes, both in air- and liquid-cooled form-factors.

The MI350 is a gigantic chiplet-based AI GPU that consists of stacked silicon. There are two base tiles called I/O dies (IODs), each built on the 6 nm TSMC N6 process. This tile has microscopic wiring for up to four Accelerator Compute Die (XCD) tiles stacked on top, besides the 128-channel HBM3E memory controllers, 256 MB of Infinity Cache memory, the Infinity Fabric interfaces, and a PCI-Express 5.0 x16 root complex. The XCDs are built on the 3 nm TSMC N3P foundry node. These contain a 4 MB L2 cache, and four shader engines, each with 9 compute units. Each XCD hence has 36 CU, and each IOD seats 144 CU. Two IODs are joined at the hip by a 5.5 TB/s bidirectional interconnect that enables full cache coherency among the two IODs. The package has a total of 288 CU. Each IOD controls four HBM3E stacks for 144 GB of memory, the package has 288 GB.

While the MI350 with its 288 CU and 288 GB of memory can function like a single GPU, AMD innovated ways for the GPU and its physical memory to be partitioned in many ways, along the IODs, and along the XCDs.

At the platform level, each blade supports up to eight MI350 series GPUs, with memory pools enabled across a point-to-point network of 153.6 GB/s links connecting each package with every other package on the node. Besides these, each package has a PCI-Express 5.0 x16 link to one of the node's two EPYC "Turin" processors handling serial processing.

View at TechPowerUp Main Site

Flyordie · Friday at 2:13 AM

Still amazes me that 8 years later... GCN5 Compute units are still being refined and used currently.

Onasi · Friday at 2:58 AM

@Flyordie
They are really good for this stuff. Always were. They just were rather inefficient for using them in the consumer GPUs for the purpose of game graphics. That was always the issue with GCN - massive compute on paper, but lackluster results.

Daven · Friday at 11:38 AM

The slides very clearly state 256 CUs not 288 CUs.

AnotherReader · Friday at 5:46 PM

Interestingly, rather than increasing CU count for each chiplet, they opted to beef up the CU's lower precision matrix multiplication units. They are also beginning to reap the benefits of the chiplet strategy by moving to new nodes faster than Nvidia. Ryan Smith, formerly of Anandtech, has an article detailing yesterday's announcement:

The biggest change here is that AMD has doubled the throughput of the matrix engines that are responsible for providing matrix operation support. So clock-for-clock, for FP16 and below data types, a CDNA4 compute unit (CU) can process twice as many matrix operations as a CDNA3 compute unit.

FP6 performance seems to be a priority (emphasis added by me):

And, aiming to one-up NVIDIA at their own game here, AMD has even beefed up FP6 performance on their architecture so that it processes at twice the rate of FP8, unlike NVIDIA’s architecture where it processes at the same rate as FP8. AMD in essence built a better FP4 unit to support FP6, rather than reusing an FP8 unit to support FP6. This carries a die area penalty, but the upshot is double the performance.

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	Gigabyte B550 AORUS Elite V2
Cooling	DeepCool Gammax L240 V2
Memory	2x 16GB DDR4-3200
Video Card(s)	Galax RTX 4070 Ti EX
Storage	Samsung 990 1TB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

System Name	Budget AMD System
Processor	Threadripper 2950X Stock, undervolted, 1.235V max
Motherboard	Gigabyte X399 Aorus Gaming 7
Cooling	EKWB X399 Monoblock
Memory	4x8GB GSkill TridentZ RGB 14-14-14-28 CR1 @ 3266
Video Card(s)	XFX Radeon RX Vega₆⁴ Liquid @ 1,800Mhz Core, 1025Mhz HBM2
Storage	1x ADATA SX8200 NVMe, 1x Segate 2.5" FireCuda 2TB SATA, 1x 500GB HGST SATA
Display(s)	Vizio 22" 1080p 60hz TV (Samsung Panel)
Case	Corsair 570X
Audio Device(s)	Onboard
Power Supply	Seasonic X Series 850W KM3
Software	Windows 10 Pro x64

System Name	The Workhorse
Processor	AMD Ryzen R9 5900X
Motherboard	Gigabyte Aorus B550 Pro
Cooling	CPU - Noctua NH-D15S Case - 3 Noctua NF-A14 PWM at the bottom, 2 Fractal Design 180mm at the front
Memory	GSkill Trident Z 3200CL14
Video Card(s)	NVidia GTX 1070 MSI QuickSilver
Storage	Adata SX8200Pro 1 TB
Display(s)	LG 32GK850G
Case	Fractal Design Torrent (Solid)
Audio Device(s)	Sennheiser HD598, FiiO E-10K DAC/AMP, Samson Meteorite USB Microphone
Power Supply	Corsair RMx850 (2018)
Mouse	Zaopin Z1 Pro on a X-Raypad Equate Plus V2
Keyboard	Cooler Master QuickFire Rapid TKL (Cherry MX Black)
Software	Windows 11 Pro (24H2)

Processor	Ryzen 7 5700X
Motherboard	ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling	Noctua NH-C14S (two fans)
Memory	2x16GB DDR4 3200
Video Card(s)	Reference Vega 64
Storage	Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s)	Nixeus NX-EDG27, and Samsung S23A700
Case	Fractal Design R5
Power Supply	Seasonic PRIME TITANIUM 850W
Mouse	Logitech
VR HMD	Oculus Rift
Software	Windows 11 Pro, and Ubuntu 20.04

AMD Instinct MI350X Series AI GPU Silicon Detailed

btarunr

Editor & Senior Moderator

Flyordie

Onasi

Daven

AnotherReader