Saturday, November 25th 2017

AMD Doubles L3 Cache Per CCX with Zen 2 "Rome"

A SiSoft SANDRA results database entry for a 2P AMD "Rome" EPYC machine sheds light on the lower cache hierarchy. Each 64-core EPYC "Rome" processor is made up of eight 7 nm 8-core "Zen 2" CPU chiplets, which converge at a 14 nm I/O controller die, which handles memory and PCIe connectivity of the processor. The result mentions cache hierarchy, with 512 KB dedicated L2 cache per core, and "16 x 16 MB L3." Like CPU-Z, SANDRA has the ability to see L3 cache by arrangement. For the Ryzen 7 2700X, it reads the L3 cache as "2 x 8 MB L3," corresponding to the per-CCX L3 cache amount of 8 MB.

For each 64-core "Rome" processor, there are a total of 8 chiplets. With SANDRA detecting "16 x 16 MB L3" for 64-core "Rome," it becomes highly likely that each of the 8-core chiplets features two 16 MB L3 cache slices, and that its 8 cores are split into two quad-core CCX units with 16 MB L3 cache, each. This doubling in L3 cache per CCX could help the processors cushion data transfers between the chiplet and the I/O die better. This becomes particularly important since the I/O die controls memory with its monolithic 8-channel DDR4 memory controller.
Source: SiSoft SANDRA Database
Add your own comment

24 Comments on AMD Doubles L3 Cache Per CCX with Zen 2 "Rome"

#1
_Flare
In fact, it shows something diffrent.
A single-die contains 2 Quad-Core CCX, like Zen1 and Zen1+ before.
Those are showing 2x 8MB L3-Cache in SiSoft Sandra, today. (1700, 1700X, 1800X, 2700, 2700X)

This leads to Zen2 having 2x 16MB L3-Cache per die/chiplet,
SiSoft shows the 2P hint before the CPU-Name, but thereafter the Single-CPU-Stats,
resulting in 2P, each 64 Cores and 16x 16MB (256MB L3) (2x 16MB per 8-Core-Chiplet)
Posted on Reply
#3
btarunr
Editor & Senior Moderator
_Flare said:
In fact, it shows something diffrent.
A single-die contains 2 Quad-Core CCX, like Zen1 and Zen1+ before.
Those are showing 2x 8MB L3-Cache in SiSoft Sandra, today. (1700, 1700X, 1800X, 2700, 2700X)

This leads to Zen2 having 2x 16MB L3-Cache per die/chiplet,
SiSoft shows the 2P hint before the CPU-Name, but thereafter the Single-CPU-Stats,
resulting in 2P, each 64 Cores and 16x 16MB (256MB L3) (2x 16MB per 8-Core-Chiplet)
You are absolutely correct. I've revised the article.
Posted on Reply
#4
Reeves81x
_Flare said:
In fact, it shows something diffrent.
A single-die contains 2 Quad-Core CCX, like Zen1 and Zen1+ before.
Those are showing 2x 8MB L3-Cache in SiSoft Sandra, today. (1700, 1700X, 1800X, 2700, 2700X)

This leads to Zen2 having 2x 16MB L3-Cache per die/chiplet,
SiSoft shows the 2P hint before the CPU-Name, but thereafter the Single-CPU-Stats,
resulting in 2P, each 64 Cores and 16x 16MB (256MB L3) (2x 16MB per 8-Core-Chiplet)
Thats interesting, i see what you mean, with only 16 clusters for 64 cores. You are right.
Posted on Reply
#5
king of swag187
Imagine replacing half the CPU cores with GPU cores, you could fit a RX 570 in there.
Posted on Reply
#6
mumar1
Taking the architectural changes on Zen 2 that are already confirmed by AMD into account I would not bet 1 € on SiSoft SANDRA being able to detect the correct cache-configuration.
Posted on Reply
#7
Vayra86
king of swag187 said:
Imagine replacing half the CPU cores with GPU cores, you could fit a RX 570 in there.
Great, so you can play 1080p medium on a top-end CPU.
Posted on Reply
#8
Xajel
king of swag187 said:
Imagine replacing half the CPU cores with GPU cores, you could fit a RX 570 in there.
You can't put such a powerful GPU without worrying about memory bandwidth. In theory you could use HBM there also but you need to manage space for it also like using smaller IO die. But again packaging will be a big issue here specially with the silicon interposer and different Z-Heights.
Posted on Reply
#9
TheinsanegamerN
king of swag187 said:
Imagine replacing half the CPU cores with GPU cores, you could fit a RX 570 in there.
You already have that, its called the 2400G, and its already bandwidth limited.
Posted on Reply
#10
First Strike
So Rome comes with 256MB L3 cache? not the previously rumored 128MB L3?

This is getting increasingly interesting, in the aspect of cache hierarchy. 256MB L3 = definitely no L4 as LLC on the IO chip, because IO chip definitely is not large enough to cram in 512MB L4. So how will they arrange and manage these L3 cache?
Posted on Reply
#11
Mysteoa
There are some speculation reagrding the double l3 cash. It's possible that the IO die has a duplication of the L3 and SiSoftSandra doesn't read it correctly. This is maybe to keep latency low when a cores that need something from L3 on different chiplet will only make one hop to the IO, not 2 hops without it.
Posted on Reply
#12
Captain_Tom
Mysteoa said:
There are some speculation reagrding the double l3 cash. It's possible that the IO die has a duplication of the L3 and SiSoftSandra doesn't read it correctly. This is maybe to keep latency low when a cores that need something from L3 on different chiplet will only make one hop to the IO, not 2 hops without it.
Yes, that is a speculated explanation made by a few Techtubers and rumor guys a week or two ago.

In fact I also remember prevalent rumors that AMD has completely done away with their current NUMA design, and yet this new architecture is supposed to gain IPC while arguably spreading out the resources more than before. This puzzled me until now.

Doubling the L3 cache might allow them to uniformly design all of their product lines (AM4, TR, EPYC) in a manner that effectively works in the same way (as opposed to now where AM4 uses only one die, but TR and EPYC use multiple dies). The exciting prospect of this is that no longer would their be ANY need to "localize" memory for certain games that only use 4 cores, and Threadripper would have the same gaming IPC as AM4 chips. It would just work.
Posted on Reply
#13
_Flare
Of course Sisoft could be mistaking by staticly dividing every four Cores, yes.
But i think the OS should get correct reports about the segmentation etc., so its unlikely but in deed possible.

So there is little chance that every Chiplet is One big 8-Core CCX with 32MB L3-Cache.

@btarunr "you are very welcome"
Posted on Reply
#14
Nkd
btarunr said:
You are absolutely correct. I've revised the article.
I saw this on reddit last week. Other theory is Adored tv might have been on to something. It could be 8core ccx and IO die has copy of L3 cache to improve latency. He mentioned that in his video, that would make sense too. He said it would make too much sense to reduce latency between cores and IO die being massive. Its either what adore was saying in his video or 4 core ccx a massive IO chip. But I am honestly leaning towards 8 core ccx with copy of l3 caceh on IO Die. The die seems massive and there has to be something going on there.

Mysteoa said:
There are some speculation reagrding the double l3 cash. It's possible that the IO die has a duplication of the L3 and SiSoftSandra doesn't read it correctly. This is maybe to keep latency low when a cores that need something from L3 on different chiplet will only make one hop to the IO, not 2 hops without it.
Yep! I am leaning on this, Adored guy was mentioning the same thing. He said it made so much sense given IO die is so massive and they can improve latency by copying l3 cache. The guy knows his shit about chips, he said he was speculating but it made too much sense.

_Flare said:
Of course Sisoft could be mistaking by staticly dividing every four Cores, yes.
But i think the OS should get correct reports about the segmentation etc., so its unlikely but in deed possible.

So there is little chance that every Chiplet is One big 8-Core CCX with 32MB L3-Cache.

@btarunr "you are very welcome"
It doesn't have to be. THe massive IO die could basically have a copy of each l3 cache to reduce latency and sandra might just be reading it wrong.
Posted on Reply
#15
sergionography
Nkd said:

It doesn't have to be. THe massive IO die could basically have a copy of each l3 cache to reduce latency and sandra might just be reading it wrong.
I was just thinking this. Who knows if any of the L3 cache is even on the chiplets at all with that massive IO die...
Posted on Reply
#16
btarunr
Editor & Senior Moderator
Nkd said:
I saw this on reddit last week. Other theory is Adored tv might have been on to something. It could be 8core ccx and IO die has copy of L3 cache to improve latency. He mentioned that in his video, that would make sense too. He said it would make too much sense to reduce latency between cores and IO die being massive. Its either what adore was saying in his video or 4 core ccx a massive IO chip. But I am honestly leaning towards 8 core ccx with copy of l3 caceh on IO Die. The die seems massive and there has to be something going on there.
Not sure you need caches on both ends to reduce latencies (it should in turn increase latencies). The caches will almost never be coherent. If that concept worked, they'd have placed caches on discrete northbridges a long time ago.
Posted on Reply
#17
_Flare
Without the I/O- and DRAM synchronization, maybe the CCX-talking is now more like pure CPU instructions and data synchoniziation, with way more bandwidth.
How the R/W-buffering is maintained on the I/O-die will be interesting to see, i don´t thing there will be any big compromise.
Maybe this whole layout is even better by default to use something like the infinity-fabric, maybe Zen1 was only to look if it´s even possible.

Anandtech measured the powerconsumption-ratio cores vs. fabric, intel mesh vs. Zen.
In conclusion, the next battle in servers is not the efficiency of the better core, it´s the fabric that counts.
The charts showed mesh is better at low load, but gets beaten by the Epyc at higher load.

Now remember the whole possibilities of the Zen2 layout ? Chiplet power-gating anyone, I/O-die segment power-gating. Thats only possible if you have no MC or IO etc on the chiplets.
----------------------------------------------------------------
I/O-die will be something like this,
without GPU, Multimedia and Display

This thing will be big, smart and very fast

Posted on Reply
#18
Shatun_Bear
Nkd said:
I saw this on reddit last week. Other theory is Adored tv might have been on to something. It could be 8core ccx and IO die has copy of L3 cache to improve latency. He mentioned that in his video, that would make sense too. He said it would make too much sense to reduce latency between cores and IO die being massive. Its either what adore was saying in his video or 4 core ccx a massive IO chip. But I am honestly leaning towards 8 core ccx with copy of l3 caceh on IO Die. The die seems massive and there has to be something going on there.


Yep! I am leaning on this, Adored guy was mentioning the same thing. He said it made so much sense given IO die is so massive and they can improve latency by copying l3 cache. The guy knows his shit about chips, he said he was speculating but it made too much sense.



It doesn't have to be. THe massive IO die could basically have a copy of each l3 cache to reduce latency and sandra might just be reading it wrong.
I wouldn't listen to that guy much regarding Zen 2 after watching his latest video on his 'predictions' for Ryzen 3000-series. He said he believes the Ryzen 3000-series flagship will have a base clock of 4.4Ghz. Base clock that high would be absolutely mental and is not happening. So I question whether he understands what he's talking about half the time. That prediction was absurd but no-one in the comments seemed to question it.
Posted on Reply
#19
TheGuruStud
Shatun_Bear said:
I wouldn't listen to that guy much regarding Zen 2 after watching his latest video on his 'predictions' for Ryzen 3000-series. He said he believes the Ryzen 3000-series flagship will have a base clock of 4.4Ghz. Base clock that high would be absolutely mental and is not happening. So I question whether he understands what he's talking about half the time. That prediction was absurd but no-one in the comments seemed to question it.
Makes perfect sense if you mean average clock. Intel claims 3.6 base clock and NOT A SINGLE ONE even on TDP limited thin machines run at it. Mature 7nm with EUV sounds like a solid bet for high clocks on lower core count parts, so it could very well be a base clock for 8-12 cores (boost similar to current intel) depending on the wall with a refresh. There's plenty of power to be wasted at 95w/125w tdp when power reduction is so good. For first spin silicon...idk. 4.0 base?
Posted on Reply
#20
Shatun_Bear
TheGuruStud said:
Makes perfect sense if you mean average clock. Intel claims 3.6 base clock and NOT A SINGLE ONE even on TDP limited thin machines run at it. Mature 7nm with EUV sounds like a solid bet for high clocks on lower core count parts, so it could very well be a base clock for 8-12 cores (boost similar to current intel) depending on the wall with a refresh. There's plenty of power to be wasted at 95w/125w tdp when power reduction is so good. For first spin silicon...idk. 4.0 base?
4.4Ghz base clock he said and it makes no sense at all. It's a ludicrous prediction. Even 4Ghz would be a sky-high base clock for the 3700X/3800X.

Remember the 2700X has a base clock of 3.7Ghz. So he thinks 7nm will allow AMD to just slap 700Mhz on top of that!
Posted on Reply
#21
sergionography
Shatun_Bear said:
I wouldn't listen to that guy much regarding Zen 2 after watching his latest video on his 'predictions' for Ryzen 3000-series. He said he believes the Ryzen 3000-series flagship will have a base clock of 4.4Ghz. Base clock that high would be absolutely mental and is not happening. So I question whether he understands what he's talking about half the time. That prediction was absurd but no-one in the comments seemed to question it.
Technically you cant discredit someone's predictions without facts(when ryzen 3000 series is announced/released).
4.4ghz base is about 20% higher than ryzen 7 2700x base clock(3.7g. Which is not impossible from a higher level especially when we already know that zen+ can clock up tp 4.3-4.4
The challenge is to scale max clocks. Also 14nm glofo/samsung had decent density but didn't scale well on higher voltage. So here u r going from a 14nm process that has its efficiency sweet spot in lower voltages, to a 7nm high performance pro3.7ghz)
Posted on Reply
#22
Shatun_Bear
sergionography said:
Technically you cant discredit someone's predictions without facts(when ryzen 3000 series is announced/released).
4.4ghz base is about 20% higher than ryzen 7 2700x base clock(3.7g. Which is not impossible from a higher level especially when we already know that zen+ can clock up tp 4.3-4.4
The challenge is to scale max clocks. Also 14nm glofo/samsung had decent density but didn't scale well on higher voltage. So here u r going from a 14nm process that has its efficiency sweet spot in lower voltages, to a 7nm high performance pro3.7ghz)
Oh come on. What's the point of posting all that - just to disagree for the sake of it? You CAN and SHOULD discredit such a prediction unless you are stupid. There's no chance any CPU from AMD released next year will have a base clock of 4.4Ghz. Trying to justify it shows you have a complete lack of understanding of CPUs.
Posted on Reply
#23
Harry Lloyd
Sad to see the 4-core CCX design again. Gaming performance will still be affected, and it will be even worse if desktop chips get a separate I/O die as well (which is almost certain if they want to put 16 cores on AM4).
Posted on Reply
#24
sergionography
Shatun_Bear said:
Oh come on. What's the point of posting all that - just to disagree for the sake of it? You CAN and SHOULD discredit such a prediction unless you are stupid. There's no chance any CPU from AMD released next year will have a base clock of 4.4Ghz. Trying to justify it shows you have a complete lack of understanding of CPUs.
Bulldozer had a base clock of 4ghz on 32nm, what makes you think 4.4ghz on 7nm is impossible? Now again im not saying i agree or disagree, just that its not as impossible as u make it sound like. What is more likely to happen is to have a lower base clock but with an all core turbo sustaining constant 4.4ghz+ frequency.
Posted on Reply
Add your own comment