• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

AMD Granite Ridge "Zen 5" Processor Annotated

Very interesting, seems like they optimized the die's very well this time. only problem is the aged and slow way they are connected. Why are the two dies so far from eachother when it would seem be faster and more efficient to be close... On TR/Epyc it's acceptable because of the heat and the much more capable IO die.
Density of wires. You can see part of the complexity in the pic that @Tek-Check attached. Here you see one layer but the wires are spread across several layers.

That's also the probable reason why AMD couldn't move the IOD farther to the edge, and CCDs closer to the centre. There are just too many wires for signals running from the IOD to the contacts at the other side of the substrate. The 28 PCIe lanes take four wires each, for example.
 
They created the cIOD so it spares them development costs for the uncore for at least 2 generations (worked for Ryzen 3000 and Ryzen 5000).

So, if they stick with AM5 for Zen 6, they might develop a new cIOD. Maybe switch to N5, give it an RDNA 3.5 iGPU, faster memory controllers, and maybe even an NPU.
Strix Halo is getting new cIOD made on 3nm. One would expect Zen 6 to do so too. Zen 6 is about about fixing all the failings with the current design, including high ccd-to-ccd core latency.
 
Strix Halo is getting new cIOD made on 3nm. One would expect Zen 6 to do so too. Zen 6 is about about fixing all the failings with the current design, including high ccd-to-ccd core latency.
Strix Halo is arguably closer to a respectably capable GPU that happened to have CPU chiplets hanging off the bus, than a regular CPU. Zen 5 CCD-to-CCD latency regression is said to have been restored to Zen 4 level in the latest firmware, though it remains to be seen if Zen 6 would do better, without trade-offs elsewhere.

They could well implement some IBM-like evict-to-other-chiplet virtual-L4-cache scheme, if they could do significantly better than memory latency with that. DRAM latency is only going to get worse.
 
Strix Halo is getting new cIOD made on 3nm. One would expect Zen 6 to do so too. Zen 6 is about about fixing all the failings with the current design, including high ccd-to-ccd core latency.

CCD to CCD latency doesn't really matter though and is always something you wish to avoid anyway, but CCD to IOD does matter. The former is already fixed and back to Zen 4 numbers and was a simple power saving thing they turned off.

Zen 6 will get a new IOD which will help the IO bottlenecks.
 
In the image of the article I count 7 cores annotated. Is that… correct?

For me it does not make sense. I don’t think I saw a 7 core sku ever, and even if they eventually release one it will be for sure a 8 core CCD with one disabled.
 
In the image of the article I count 7 cores annotated. Is that… correct?

For me it does not make sense. I don’t think I saw a 7 core sku ever, and even if they eventually release one it will be for sure a 8 core CCD with one disabled.
It's 8 cores, with the first core annotated. Here you go:
o8fTKzZg4LzwrEPN.jpg
 
CCD to CCD latency doesn't really matter though and is always something you wish to avoid anyway, but CCD to IOD does matter. The former is already fixed and back to Zen 4 numbers and was a simple power saving thing they turned off.
CCD to CCD latency is fixed but still huge. It's really hard to understand where those ~80 ns come from. There must be some very complex switching logic and cache coherency logic on the IOD, or something.
CCD to IOD latency can't be observed directly but CCD to RAM latency seems fine, no issues here (68 ns in AIDA64 on launch day review).
Zen 6 will get a new IOD which will help the IO bottlenecks.
Hopefully. That should be high on AMD's priority list. Another possibility for improvement would be a direct CCD to CCD connection in addition to the existing ones.
 
CCD to CCD latency is fixed but still huge. It's really hard to understand where those ~80 ns come from. There must be some very complex switching logic and cache coherency logic on the IOD, or something.
CCD to IOD latency can't be observed directly but CCD to RAM latency seems fine, no issues here (68 ns in AIDA64 on launch day review).
I assume that the second interconnect on the CCD is only used on Epyc, but not on Ryzen. So, inter-CCD communication on Ryzen is still done via the IO die.
 
I assume that the second interconnect on the CCD is only used on Epyc, but not on Ryzen. So, inter-CCD communication on Ryzen is still done via the IO die.
There are no direct CCD-to-CCD links in any of the AMD processors.
 
There are no direct CCD-to-CCD links in any of the AMD processors.
So what's this?

o8fTKzZg4LzwrEPN.jpg


As I understand it, there's one "IFOP PHY" responsible for communication with the IO die, and the other one is inactive on Ryzen. Or is it some kind of double-wide communication bus with the Epyc IO die?
 
EPYC processors with 4 or fewer chiplets use both GMI links (wide GMI) to increase the bandwidth from 36 GB/s to 72 GB/s (page 11 of the file attached). By analogy, that is the case for Ryzen processors too. On the image below, both wide GMI3 links on both chiplets connect to two GMI ports on IOD, two links (wide GMI) from chiplet 1 to GMI3 port 0 and another two links (wide GMI) from chiplet 2 to GMI port 1 on IOD. We can see four clusters of links.
That IFOP or GMI is still a bit of a mystery, with too little documentation available (some is here). May I ask you to do a tek-check of the data I compiled, calculated and listed here below?

A single (not wide) IFOP (GMI) interface in the Zen 4 architecture has:

- a 32-bit wide bus in each direction (single-ended, 1 wire per bit)
- 3000 MHz default clock in Ryzen CPUs (2250 MHz in Epyc CPUs)
- quad data rate transfers, which calculates to...
- 12 GT/s in Ryzen (9 GT/s in Epyc)
- 48 GB/s per direction in Ryzen (36 GB/s in Epyc)
 
So what's this?

View attachment 366587

As I understand it, there's one "IFOP PHY" responsible for communication with the IO die, and the other one is inactive on Ryzen. Or is it some kind of double-wide communication bus with the Epyc IO die?
indeed it is so, AFAIK client Ryzen does not use wide GMI, only one IFOP per die, i haven't seen any documentation that says they use wide GMI

There are no direct chiplet-to-chiplet interconnects, that is correct. Everything goes through IF/IOD. I should have been more explicit in wording, but replied quickly, on-the-go.

EPYC processors with 4 or fewer chiplets use both GMI links (wide GMI) to increase the bandwidth from 36 GB/s to 72 GB/s (page 11 of the file attached). By analogy, that is the case for Ryzen processors too. On the image below, both wide GMI3 links on both chiplets connect to two GMI ports on IOD, two links (wide GMI) from chiplet 1 to GMI3 port 0 and another two links (wide GMI) from chiplet 2 to GMI port 1 on IOD. We can see four clusters of links.

We do not have a shot of a single chiplet CPU that exposes GMI link, but the principle should be the same, aka IF bandwidth should be 72 GB/s, like on EPYCs with four and fewer chiplets, and not 36 GB/s.

* from page 11
INTERNAL INFINITY FABRIC INTERFACES connect the I/O die with each CPU die using 36 Gb/s Infinity Fabric links. (This is known internally as the Global Memory Interface [GMI] and is labeled this way in many figures.) In EPYC 9004 and 8004 Series processors with four or fewer CPU dies, two links connect to each CPU die for up to 72 Gb/s of connectivity
i don't think client ryzen uses wide GMI, that's only reserved for special EPYC
 
CCD to CCD latency is fixed but still huge. It's really hard to understand where those ~80 ns come from. There must be some very complex switching logic and cache coherency logic on the IOD, or something.
CCD to IOD latency can't be observed directly but CCD to RAM latency seems fine, no issues here (68 ns in AIDA64 on launch day review).

Hopefully. That should be high on AMD's priority list. Another possibility for improvement would be a direct CCD to CCD connection in addition to the existing ones.
How long is 80 nanoseconds? How many nanos are 1 second?
 
How long is 80 nanoseconds? How many nanos are 1 second?
It's 80 billionth of a second. It's apparently enough time for some folks to make breakfast or something.
 
I assume that the second interconnect on the CCD is only used on Epyc, but not on Ryzen. So, inter-CCD communication on Ryzen is still done via the IO die.
As I understand it, there's one "IFOP PHY" responsible for communication with the IO die, and the other one is inactive on Ryzen. Or is it some kind of double-wide communication bus with the Epyc IO die?
indeed it is so, AFAIK client Ryzen does not use wide GMI, only one IFOP per die, i haven't seen any documentation that says they use wide GMI
i don't think client ryzen uses wide GMI, that's only reserved for special EPYC
Let's try to get to the bottom of this by focusing on what we know, what is visible on Ryzen die and what could be inferred.
Zen 5 image​
Zen 2 image​
ZEN 5 .png
AMD AM4 Z2 CPU 3000.png
- first of all, it looks like that High Yields provided an old image of Zen 2 communication lanes by adding the layer of old lanes onto Zen 5 photo
- we can clearly see this on CCD level, as GMIs were placed in the middle of CCD with two 4-core CCXs on Zen 2
- so, the left image is not a genuine Zen 5 communication diagram from the video, but it must have lanes positioned in a different way
- with that out the way, let's move on

- what do you mean by "wide GMI"? Both GMI ports?
- each CCD has two GMI ports and IOD also has the same two GMI ports. Each GMI port is '9-wide', which means each GMI PHY has nine logic areas. - each of the nine logic areas within GMI port translates into PHY that could get one or two communication lanes. This is visible.
- what we do not know is whether all those IF lanes are wired to one GMI port only; not visible on the image; topology documentation is scarce
- it'd be great to see EPYC tolopogy of IF lanes on CPUs that have 4 and fewer CCDs; this would bring us closer to the anwer
That IFOP or GMI is still a bit of a mystery, with too little documentation available (some is here). May I ask you to do a tek-check of the data I compiled, calculated and listed here below? A single (not wide) IFOP (GMI) interface in the Zen 4 architecture has:

- a 32-bit wide bus in each direction (single-ended, 1 wire per bit)
- 3000 MHz default clock in Ryzen CPUs (2250 MHz in Epyc CPUs)
- quad data rate transfers, which calculates to...
- 12 GT/s in Ryzen (9 GT/s in Epyc)
- 48 GB/s per direction in Ryzen (36 GB/s in Epyc)
- Zen 4 is 32B/cycle read, 16B/cycle write
- more on this here: https://chipsandcheese.com/p/amds-zen-4-part-3-system-level-stuff-and-igpu
- Gigabyte leak: https://chipsandcheese.com/p/details-on-the-gigabyte-leak
- read speeds are much faster than write speeds over IF
- read bandwidth does not increase much when two CCDs operate
- write bandwidth almost doubles with two CCDs

Screenshot 2024-10-08 at 22-17-32 AMD’s Zen 4 Part 3 System Level Stuff and iGPU.png


Screenshot 2024-10-08 at 22-17-38 AMD’s Zen 4 Part 3 System Level Stuff and iGPU.png
 
Last edited:
it'd be great to see EPYC tolopogy of IF lanes on CPUs that have 4 and fewer CCDs; this would bring us closer to the anwer
You mean this slide, or are you looking for something more detailed?

1728479832900.png

(taken from Tom's)

Also, you've mentioned 36 GB/s and 72 GB/s before, and 9 bits here. It's obvious that 9 bits include the parity bit, but I don't understand what numbers AMD took to calculate 36 and 72 GB/s - unles that includes parity too.
 
You mean this slide, or are you looking for something more detailed?
I have this slide and entire presentation in .pdf. Ideal image or diagram would be EPYC with 4 or fewer CCDs.
I am looking for those detailed die shots where we can see physical wiring, or a diagram that shows them all.
This would allow us to see how they wire two GMI ports from each CCD.
When AMD says "GMI-Narrow", do they mean one GMI port only? And "GMI-Wide" means both GMI ports? It would make sense.
The next question is whether they wire all nine logic parts of a single GMI port to Infinity Fabric. If so, how many single wires, how many double ones?
I do not know.

Of course, AMD does not want to give up some secrets about their IF sauce and wiring, such as the speed of the fabric itself and how they wire it to CCDs and IOD. This is beyond my pay grade.

What we know from AMD:
1. CCD on >32C EPYC Zen4 was configured for 36 Gbps throughput per link on one GMI port to IF ("GMI-Narrow"). 36 Gbps = 4.5 GB/s per link
2. CCD on ≤32C EPYC Zen4 was configured for 36x2 Gbps throughput per link on two GMI ports to IF ("GMI-Wide"). 72 Gbps = 9 GB/s per dual link

The answer we need here is how many IF links does one GMI port provide? Is it 9? There are 9 pieces of logic on die per GMI port. Are they all used on PHY level? If 9 links are used, the throughput would be 9x36 Gbps = 324 Gbps = 40.5 GB/s for one CCD, and 648 Gbps = 81 GB/s for "GMI-wide"

Chips&Cheese testing of IF on Ryzen:
3. one CCD on Ryzen Zen4 has throughput of ~63 GB/s towards DDR5-6000 memory via IF; two CCDs ~77 GB/s
This shows the speed of 504 Gbps for one CCD and 616 Gbps for two CCDs.

- if only one GMI port is used on Ryzen and we assumed there are 9 links in each GMI port, this gives us 9x36 Gbps = 324 Gbps = 40.5 GB/s.
- the measured throughput by C&C was 63 GB/s on read speed, so more links would be needed on one CCD to achieve this throughput
- it seems physically impossible to use both GMI ports on Ryzen CCD and connect those to one GMI port on IOD.
- therefore, it could be the case that IF was configured to run faster on Ryzen CPUs

How does this sit?
 
You mean this slide, or are you looking for something more detailed?
We now have a confirmation from EPYC Zen 5 file:
"INTERNAL INFINITY FABRIC INTERFACES connect the I/O die with each CPU die using a total of 16 36 Gb/s Infinity Fabric links."

Going back to our previous considerations. What we know from AMD about theoretical bandwidth:
1. CCD on >32C EPYC Zen4 was configured for 36 Gbps throughput per link on one GMI port to IF ("GMI-Narrow"). 36 Gbps = 4.5 GB/s per link
- 16 links x 36 Gbps = 576 Gbps = 72 GB/s
2. CCD on ≤32C EPYC Zen4 was configured for 36x2 Gbps throughput per link on two GMI ports to IF ("GMI-Wide"). 72 Gbps = 9 GB/s per dual link
- 16 links x 72 Gbps = 1152 Gbps = 144 GB/s

3. one CCD on Ryzen Zen4 has throughput of ~63 GB/s towards DDR5-6000 memory via IF; two CCDs ~77 GB/s
So yes, it looks like one GMI port is used.
 
Back
Top