• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

AMD Instinct MI350X Series AI GPU Silicon Detailed

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,795 (7.40/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
AMD today unveiled its Instinct MI350X series AI GPU. Based on the company's latest CDNA 4 compute architecture, the MI350X is designed to compete with NVIDIA B200 "Blackwell" AI GPU series, with the top-spec Instinct MI355X being compared by AMD to the B200 in its presentation. The chip debuts not just the CDNA 4 architecture, but also the latest ROCm 7 software stack, and hardware ecosystem based on the industry-standard Open Compute Project specification, which combines AMD EPYC Zen 5 CPUs, Instinct MI350 series GPUs, AMD-Pensando Pollara scale-out NICs supporting Ultra-Ethernet, and industry-standard racks and nodes, both in air- and liquid-cooled form-factors.

The MI350 is a gigantic chiplet-based AI GPU that consists of stacked silicon. There are two base tiles called I/O dies (IODs), each built on the 6 nm TSMC N6 process. This tile has microscopic wiring for up to four Accelerator Compute Die (XCD) tiles stacked on top, besides the 128-channel HBM3E memory controllers, 256 MB of Infinity Cache memory, the Infinity Fabric interfaces, and a PCI-Express 5.0 x16 root complex. The XCDs are built on the 3 nm TSMC N3P foundry node. These contain a 4 MB L2 cache, and four shader engines, each with 9 compute units. Each XCD hence has 36 CU, and each IOD seats 144 CU. Two IODs are joined at the hip by a 5.5 TB/s bidirectional interconnect that enables full cache coherency among the two IODs. The package has a total of 288 CU. Each IOD controls four HBM3E stacks for 144 GB of memory, the package has 288 GB.



While the MI350 with its 288 CU and 288 GB of memory can function like a single GPU, AMD innovated ways for the GPU and its physical memory to be partitioned in many ways, along the IODs, and along the XCDs.

At the platform level, each blade supports up to eight MI350 series GPUs, with memory pools enabled across a point-to-point network of 153.6 GB/s links connecting each package with every other package on the node. Besides these, each package has a PCI-Express 5.0 x16 link to one of the node's two EPYC "Turin" processors handling serial processing.



View at TechPowerUp Main Site
 
Still amazes me that 8 years later... GCN5 Compute units are still being refined and used currently.
 
@Flyordie
They are really good for this stuff. Always were. They just were rather inefficient for using them in the consumer GPUs for the purpose of game graphics. That was always the issue with GCN - massive compute on paper, but lackluster results.
 
Interestingly, rather than increasing CU count for each chiplet, they opted to beef up the CU's lower precision matrix multiplication units. They are also beginning to reap the benefits of the chiplet strategy by moving to new nodes faster than Nvidia. Ryan Smith, formerly of Anandtech, has an article detailing yesterday's announcement:

The biggest change here is that AMD has doubled the throughput of the matrix engines that are responsible for providing matrix operation support. So clock-for-clock, for FP16 and below data types, a CDNA4 compute unit (CU) can process twice as many matrix operations as a CDNA3 compute unit.
FP6 performance seems to be a priority (emphasis added by me):
And, aiming to one-up NVIDIA at their own game here, AMD has even beefed up FP6 performance on their architecture so that it processes at twice the rate of FP8, unlike NVIDIA’s architecture where it processes at the same rate as FP8. AMD in essence built a better FP4 unit to support FP6, rather than reusing an FP8 unit to support FP6. This carries a die area penalty, but the upshot is double the performance.
 
Last edited:
Back
Top