AMD Instinct MI200: Dual-GPU Chiplet; CDNA2 Architecture; 128 GB HBM2E

dragontamer5788 · Nov 10, 2021

medi01 said:
A100 were sold in packs of 10, if I'm not mistaken, for $200k. I don't see why AMD would ask half of that sum for a vastly faster product.

I've PCIe-versions of A100 quoted at $10k. The HGX-version probably cost more, and maybe is the one you're talking about for 10-for-$200k (I never seen the HGX-version quoted personally).

The PCIe-versions won't have cache-coherency, and will have fewer links. Anyone who wants 2-or-fewer MI200s (or A100s) probably wants the PCIe version. The HGX A100 / OAM MI200 is really for customers who run 4x GPUs, 8x GPUs or more (which is probably why it makes sense to sell them in packs of 10).

Sol · Nov 11, 2021

medi01 said:
I don't get how that math works.
A100 - 54 billion transistors
MI200 - 58 billion transistors (29 + 29), yet it runs circles around A100.

You mean 22% faster is "barely faster"?

ADL was recently praised like a miracle for being this "barely faster" in ST only lmao.

Richards · Nov 11, 2021

medi01 said:
I don't get how that math works.
A100 - 54 billion transistors
MI200 - 58 billion transistors (29 + 29), yet it runs circles around A100.

You mean 22% faster is "barely faster"?

Its basically 4 rx6900 xt glued together ofcouse its gonna beat it

Punkenjoy · Nov 11, 2021

Richards said:
Its basically 4 rx6900 xt glued together ofcouse its gonna beat it

Except that it's not, RDNA and CDNA are different architecture. CDNA is more compute oriented and designed for compute heavy workload where RDNA is designed toward graphic workload.

TheoneandonlyMrK · Nov 11, 2021

Lycanwolfen said:
Still Dual GPU's acting as one. Yes faster with the new fabric but still basic Idea. When you get into 4k or even 8k gaming single card cannot handle it pushing over 100 fps. I tried that with one 1070ti forget it. My two in SLI can push 100 fps easy. But since Nvidia and AMD dropped that tech. I bought a 3080TI and tried it at 4k could not push 100 fps constantly. Ya it looks great but my eyes can see the lag and frame buffering trying to keep up. Now lucky friend of mine has two 3090's in SLI and man 8k res at 150 FPS looks soo clean and perfect. But I do not have 5 grand lying around to afford such nice things.

This isn't crossfire, it doesn't even have a video output, isn't running games and can't and doesn't work how you think, this IS new tech.

Punkenjoy · Nov 11, 2021

TheoneandonlyMrK said:
This isn't crossfire, it doesn't even have a video output, isn't running games and can't and doesn't work how you think, this IS new tech.

You are right about not being crossfire. But here, the thing is it still see both chip independently and not 1. The advantages is they are being linked by a very fast infinity fabrics link (800 GB/s) That is much faster than going thru PCI-E (where frequently the second cards was running at PCI-E 3.0 8X (or both). (8 GB/s with much higher latency).

And that is one of the key thing here, latency. The 2 chips being so close, they can access the other chip memory with minimal impact versus if it had to go thru the PCI-E bus or any other external connection. And at last, something spec sheet do not really tell, is what AMD implemented for cache coherency and memory sharing.

Richards · Nov 11, 2021

I meant

Punkenjoy said:
Except that it's not, RDNA and CDNA are different architecture. CDNA is more compute oriented and designed for compute heavy workload where RDNA is designed toward graphic workload.

In transistors size rx 6900 xt has 26.8 billion and this has 29+29 per chiplet

TheoneandonlyMrK · Nov 11, 2021

Punkenjoy said:
You are right about not being crossfire. But here, the thing is it still see both chip independently and not 1. The advantages is they are being linked by a very fast infinity fabrics link (800 GB/s) That is much faster than going thru PCI-E (where frequently the second cards was running at PCI-E 3.0 8X (or both). (8 GB/s with much higher latency).

And that is one of the key thing here, latency. The 2 chips being so close, they can access the other chip memory with minimal impact versus if it had to go thru the PCI-E bus or any other external connection. And at last, something spec sheet do not really tell, is what AMD implemented for cache coherency and memory sharing.

Read the white paper, it's seen as one chip, one cores the master number two is it's slave , not like anything else I might add, new tech, new IP new ways.

Hopefully they will carry over well to consumer cards.

Punkenjoy · Nov 11, 2021

TheoneandonlyMrK said:
Read the white paper, it's seen as one chip, one cores the master number two is it's slave , not like anything else I might add, new tech, new IP new ways.

Hopefully they will carry over well to consumer cards.

Do you have that whitepaper?

Most people i see say it being show to the OS as 2 chip with 64 GB devices (but with many tools for memory coherency)

On this whitepaper, nothing say what you say

https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf

i know what you mean is what leaker said RDNA 3 will be, but it do not look like this is the case for this architecture. But those are made to be grouped together in large cluster so that do not really matter that much in the end as long as you are able to split your code and data into chunk that each GPU can digest.

Richards · Nov 11, 2021

Punkenjoy said:
Do you have that whitepaper?

Most people i see say it being show to the OS as 2 chip with 64 GB devices (but with many tools for memory coherency)

On this whitepaper, nothing say what you say

https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf

i know what you mean is what leaker said RDNA 3 will be, but it do not look like this is the case for this architecture. But those are made to be grouped together in large cluster so that do not really matter that much in the end as long as you are able to split your code and data into chunk that each GPU can digest.

Thats gonna be hard for rdna 3 for the os to read the gpu as one and not sli... plus aren't games having a hard time splitting the workload on combined gpu's ?

Punkenjoy · Nov 11, 2021

Richards said:
Thats gonna be hard for rdna 3 for the os to read the gpu as one and not sli... plus aren't games having a hard time splitting the workload on combined gpu's ?

Let say CDNA2 is similar to the first gen Threadripper where full Zen1 chip were put on the same socket. For the OS it was similar than having multi socket since each cpu had their own memory controller.

From what we are hearing, RDNA3 might look a bit more like Zen2/3 where there is some kind of I/O die. In this case, it could be a part of one chip act as a bridge similar to the I/O die or there could be a bridge between the two chip that could do that.

The main thing are how to handle different memory zone. In Zen 1 Threadripper, there are many memory controller to deal with (although the OS can see them as one with NUMA). In zen2 threadripper, there is just 1 memory controller and NUMA is not used.

If RDNA 3 have just 1 die with memory and the second access it via a bridge, or there is an i/o die that is also the memory controller, it could be seen by the OS as 1 chip. There is also how they communicate with the OS, if it's hidden behind an i/o die or if it have to go thru the first "Master Die" to access the PCI-E bus.

Everything is still rumours but it look like AMD figured it out for RDNA 3. They do not need to implemented it as much for CDNA 2 as most software running on it are already made to scale with multi GPU. Doesn't mean they won't do something similar for CDNA3.

TheoneandonlyMrK · Nov 11, 2021

Punkenjoy said:
Let say CDNA2 is similar to the first gen Threadripper where full Zen1 chip were put on the same socket. For the OS it was similar than having multi socket since each cpu had their own memory controller.

From what we are hearing, RDNA3 might look a bit more like Zen2/3 where there is some kind of I/O die. In this case, it could be a part of one chip act as a bridge similar to the I/O die or there could be a bridge between the two chip that could do that.

The main thing are how to handle different memory zone. In Zen 1 Threadripper, there are many memory controller to deal with (although the OS can see them as one with NUMA). In zen2 threadripper, there is just 1 memory controller and NUMA is not used.

If RDNA 3 have just 1 die with memory and the second access it via a bridge, or there is an i/o die that is also the memory controller, it could be seen by the OS as 1 chip. There is also how they communicate with the OS, if it's hidden behind an i/o die or if it have to go thru the first "Master Die" to access the PCI-E bus.

Everything is still rumours but it look like AMD figured it out for RDNA 3. They do not need to implemented it as much for CDNA 2 as most software running on it are already made to scale with multi GPU. Doesn't mean they won't do something similar for CDNA3.

Rumours have rDNA 3 tapped out as well.
I could be getting this confused with rdna3, good point.

As is the white paper being lite on details.

medi01 · Nov 13, 2021

Richards said:
I meant

In transistors size rx 6900 xt has 26.8 billion and this has 29+29 per chiplet

So, how is that "4 6900 'glued together' then"?

I also recall that Intel's "glued together" comment didn't age well...

Sms · Nov 13, 2021

Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?

Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?

dragontamer5788 · Nov 15, 2021

Sms said:
Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?

Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?

Because they designed it that way.

Usually, 32 bit performance is more important. However, it seems like ORNL asked for double precision performance.

It should be noted that CPUs usually do 64 bit scalar at the same speed as 32 scalar due to the sizing of 64 bit registers.

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	M3401 notebook
Processor	5600H
Motherboard	NA
Memory	16GB
Video Card(s)	3050
Storage	500GB SSD
Display(s)	14" OLED screen of the laptop
Software	Windows 10
Benchmark Scores	3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.

AMD Instinct MI200: Dual-GPU Chiplet; CDNA2 Architecture; 128 GB HBM2E

dragontamer5788

Sol

New Member

Richards

Punkenjoy

TheoneandonlyMrK

Punkenjoy

Richards

TheoneandonlyMrK

Punkenjoy

Richards

Punkenjoy

TheoneandonlyMrK

medi01

Sms

New Member

dragontamer5788