• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

I have a question about caches in CPU cores.

Joined
Dec 27, 2013
Messages
887 (0.21/day)
Location
somewhere
Sorry I made another thread. I don't mean to spam.:(

Btw I like discussion here first and foremost and interacting with people here makes me happy. :3

A question about the L1,2 and 3 caches in modern a CPU core. Ok so im interested in Skylake and Zen for now.

it's a simply question here it is:

Are the caches inside the core, L1D and L1I, tied to the Core's clock domain. I.e do the SRAMs run at the same speed as the core. What I mean by this, does increasing CPU clock rate also increase cache bandwidth?

I ask the same for the L2 cache, also. IDK about the L3, I did hear somewhere that Zen L3 cache is tied to core speed somewhere. Actually it would be easy to find out I guess with AIDA64 Extreme benchmark for Memory and Caches.

But some information would be helpful. It just occurs to me because of the OC Coffee Lake results having MUCH higher L1 and L2 bandwidth than Zen, is mainly because of the Clock rate advantage, right?

thanks

also please tell me if i am making too many threads. I don't mean to do it negatively. Actually i wont post any more today :x
 
Are the caches inside the core, L1D and L1I, tied to the Core's clock domain. I.e do the SRAMs run at the same speed as the core. What I mean by this, does increasing CPU clock rate also increase cache bandwidth?
Yes, bandwidth is in bytes per clock. And yes, it can be increased by increasing the clock speed until you run into timing issues. But bandwidth and timing is highly dependent on the architecture.

Edit:
But some information would be helpful. It just occurs to me because of the OC Coffee Lake results having MUCH higher L1 and L2 bandwidth than Zen, is mainly because of the Clock rate advantage, right?
Actually not, cache efficiency is mainly due to three components; cache structure, latency and prefetcher.

Cache doesn't work the way most people think it does. The cache is a "streaming buffer", it's overwritten within a few microseconds. Memory is divided into what we call cache lines, which is 64 bytes on current x86 architectures. This means that whenever the CPU reads one byte from a single cache line, the entire cache line is cached. So once a 64 byte cache line is cached, anything else within those 64 bytes is also cached. If data is more spread, it makes the cache less efficient and so on, but this depends on the program.

Cache is divided into banks. E.g. if a cache is 256 kB 8-way, it means it's actually not one cache, it's 8 separate 32kB caches. A specific memory address will always be stored in a specific bank; the first 0-63 bytes in bank 0, 64-127 into bank 1, 128-191 bank 2, etc. looping over and over. This also means that cache banks might not be evenly used, depending on the alignment of data in memory. More cache banks reduces storage efficiency(hitrate) and may increase worst case latency, but may improve total bandwidth and be more simple to implement into the design.

Let's look at Skylake vs. Zen:
Skylake:
L1I: 32kB 8-way
L1D: 32kB 8-way (64-bytes per cycle bidirectional?)
L2: 256KB 4-way (64-bytes per cycle bidirectional?)

Zen:
L1I: 64kB 4-way (32-bytes per cycle?)
L1D: 32kB 8-way
L2: 512kB 8-way (32-bytes per cycle?)
This still doesn't tell everything, like how many clock cycles of latency for each ones, etc.

There is also the last thing; the prefetcher, which controles how the cache is used, but that's a subject of its own.
 
Last edited:
Sorry I made another thread. I don't mean to spam.:(
also please tell me if i am making too many threads. I don't mean to do it negatively. Actually i wont post any more today :x
Hahaha you should just create an @ArbitraryAffection curiosity and general questions thread :p

But yes the caches speeds are linked to FSB and multiplier of the CPU you can see this quite simply by running stock settings and running aida 64 cache and memory benchmark and then overclocking the CPU
 
But yes the caches speeds are linked to FSB and multiplier of the CPU you can see this quite simply by running stock settings and running aida 64 cache and memory benchmark and then overclocking the CPU
This isn't true for all CPUs. Sometimes the memory controller (which would include L3 cache,) has its own multiplier. I know that the Phenom II CPUs I had worked this way. X58 chips could individually alter the multiplier for the uncore as well. It's more accurate to say that the L1 and likely L2 are running at the core frequency. L3 really varies from CPU to CPU.
 
L3 really varies from CPU to CPU.

Yep, I know with the latest Intel CPUs, the L3 has its own multiplier and hence its own clock speed separate from the CPU.
 
This isn't true for all CPUs. Sometimes the memory controller (which would include L3 cache,) has its own multiplier. I know that the Phenom II CPUs I had worked this way. X58 chips could individually alter the multiplier for the uncore as well. It's more accurate to say that the L1 and likely L2 are running at the core frequency. L3 really varies from CPU to CPU.
I was referring more to his setup being a Ryzen and my own experiences of cache speed and overclocking a Ryzen platform, but you are indeed correct and remember things like uncore from my time with i7 920's and before that AMD's HTT bus speed etc which would all effect things like cache speed, latencies, other buses etc , it all seems a little over simplified these days though I probably wouldnt have a clue on either of those chipsets these days it's been so long lol.
 
Thanks for replies^^ okay so somewhere i also heard that L3 cache speed on Ryzen is potentially what is holding back the overal clock speed potential of the CPU core as it is tied to that.(but i didnt know it was valid claim as i was under the impression cache was somehow clocked separately).. So maybe the L3 cache timing issues are the reason why Zen doesnt clock as high. I am sure GloFo 14nm/12nm is capable of higher frequencies than ~4.2Ghz right?
 
I am sure GloFo 14nm/12nm is capable of higher frequencies than ~4.2Ghz right?
Nah. AMD actually has excellent quality control on these dies considering 4.2Ghz seems to be the magic number for just about every die that has been based on Xen from everything I've read. L3 tends to be an issue because it shares the same clock domain as the memory controller which is dictated by the speed of your DRAM. So faster DRAM makes the L3 run faster which translates into better performance, regardless of core clock speed however higher clocks might not reach their full potential if the IMC is the bottleneck.
 
Nah. AMD actually has excellent quality control on these dies considering 4.2Ghz seems to be the magic number for just about every die that has been based on Xen from everything I've read. L3 tends to be an issue because it shares the same clock domain as the memory controller which is dictated by the speed of your DRAM. So faster DRAM makes the L3 run faster which translates into better performance, regardless of core clock speed however higher clocks might not reach their full potential if the IMC is the bottleneck.
Oh so L3 runs at same speed as Infinity fabric bus and dram? so it is 1.6 Ghz with my Ram. Hm. Slightly off topic but do you think Zen2 will put IF and L3 in a seperate clock domain than DRAM. I heard this can increase latency jumping data between different clock domains, but maybe with potentially much, much higher IF and L3 Speeds it could be beneficial? I mean in BIOS would be nice to be able to set Infinity Fabric clock speed independent of dram:)

Actually not, cache efficiency is mainly due to three components; cache structure, latency and prefetcher.

Cache doesn't work the way most people think it does. The cache is a "streaming buffer", it's overwritten within a few microseconds. Memory is divided into what we call cache lines, which is 64 bytes on current x86 architectures. This means that whenever the CPU reads one byte from a single cache line, the entire cache line is cached. So once a 64 byte cache line is cached, anything else within those 64 bytes is also cached. If data is more spread, it makes the cache less efficient and so on, but this depends on the program.

Cache is divided into banks. E.g. if a cache is 256 kB 8-way, it means it's actually not one cache, it's 8 separate 32kB caches. A specific memory address will always be stored in a specific bank; the first 0-63 bytes in bank 0, 64-127 into bank 1, 128-191 bank 2, etc. looping over and over. This also means that cache banks might not be evenly used, depending on the alignment of data in memory. More cache banks reduces storage efficiency(hitrate) and may increase worst case latency, but may improve total bandwidth and be more simple to implement into the design.

Let's look at Skylake vs. Zen:
Skylake:
L1I: 32kB 8-way
L1D: 32kB 8-way (64-bytes per cycle bidirectional?)
L2: 256KB 4-way (64-bytes per cycle bidirectional?)

Zen:
L1I: 64kB 4-way (32-bytes per cycle?)
L1D: 32kB 8-way
L2: 512kB 8-way (32-bytes per cycle?)
This still doesn't tell everything, like how many clock cycles of latency for each ones, etc.

There is also the last thing; the prefetcher, which controles how the cache is used, but that's a subject of its own.
Oh wow thanks for this explanation. I'm not sure i fully understand how it all works but this is really informative thanks. So is it safe to say Skylake has a better cache system than Zen? If so does this also explain why Skylake is faster in games, i mean are they really sensitive to cache performance?

I also heard about Victim cache vs Inclusive. So essentially Zen L3 is like a huge overflow for its L2 right? Whereas a program can load stuff directly into the L3 , bypassing L2 with Skylake right? Also SKL-X is like Zen in this regard i think. Do you have any idea if this could also impact gaming performance? I would like to know the advantages and disadvantages of Victim/Inclusive caches though
 
Last edited:
Nah. AMD actually has excellent quality control on these dies considering 4.2Ghz seems to be the magic number for just about every die that has been based on Xen from everything I've read. L3 tends to be an issue because it shares the same clock domain as the memory controller which is dictated by the speed of your DRAM. So faster DRAM makes the L3 run faster which translates into better performance, regardless of core clock speed however higher clocks might not reach their full potential if the IMC is the bottleneck.
That's not to say that CPU clock speed alone doesn't affect L3 cache on a Ryzen, case in point my aida64 benches, same RAM speed, timings etc, only difference is stock CPU clock (3.7ghz single core boost) compared to 3.9ghz all core boost, as you can see there's a significant different in all cache speeds from default CPU speeds and overclocked even with the same RAM timings.

l3.png
 
That's not to say that CPU clock speed alone doesn't affect L3 cache on a Ryzen, case in point my aida64 benches, same RAM speed, timings etc, only difference is stock CPU clock (3.7ghz single core boost) compared to 3.9ghz all core boost, as you can see there's a significant different in all cache speeds from default CPU speeds and overclocked even with the same RAM timings.

l3.png
This says Zen L3 cache is part of core clock domain so yeah also that kinda proves it I think.
 
So is it safe to say Skylake has a better cache system than Zen? If so does this also explain why Skylake is faster in games, i mean are they really sensitive to cache performance?
To my understanding, cache is a part of the reason, and I do believe Skylake have lower L1 latency and higher bandwidth, but a larger factor is a more efficient prefetcher.

I also heard about Victim cache vs Inclusive. So essentially Zen L3 is like a huge overflow for its L2 right? Whereas a program can load stuff directly into the L3 , bypassing L2 with Skylake right? Also SKL-X is like Zen in this regard i think. Do you have any idea if this could also impact gaming performance? I would like to know the advantages and disadvantages of Victim/Inclusive caches though
An inclusive L3 stores a copy of the L2, which is more wasteful, but helps if another core needs it, which is rare, as I said the caches are overwritten very quickly.
A victim cache means data is not stored directly (prefetched) into L3, but only stored there when it's discarded from L2. While there does exist some machine code for prefetching, this is generally not controlled by the program, and the program is definitely not aware of where things are stored in various caches. From the program's perspective everything is stored in RAM.

I think the victim cache is not a disadvantage for gaming, there might be edge cases of course, but Skylake-X has performed well with this solution. The main advantage is of course storage efficiency; which means it can be used for something else, or effectively just a larger L3.
 
To my understanding, cache is a part of the reason, and I do believe Skylake have lower L1 latency and higher bandwidth, but a larger factor is a more efficient prefetcher.


An inclusive L3 stores a copy of the L2, which is more wasteful, but helps if another core needs it, which is rare, as I said the caches are overwritten very quickly.
A victim cache means data is not stored directly (prefetched) into L3, but only stored there when it's discarded from L2. While there does exist some machine code for prefetching, this is generally not controlled by the program, and the program is definitely not aware of where things are stored in various caches. From the program's perspective everything is stored in RAM.

I think the victim cache is not a disadvantage for gaming, there might be edge cases of course, but Skylake-X has performed well with this solution. The main advantage is of course storage efficiency; which means it can be used for something else, or effectively just a larger L3.
Thanks for explanation!!

One last question can I ask please. Some skylake X CPUs perform much worse in gaming than others. I talk about 7800X and to a lesser extent the 7820X. In many games the 7800X is worse than even a 2600X. Is this do you think, because of the mesh connection between the cores, or the way it is cut down. Btw also 7920X suffers performance issues too I heard. That's 12 core iirc, the most cut down of the HCC die. Thanks so much for taking the time to explain to me.:love:
 
I am sure GloFo 14nm/12nm is capable of higher frequencies than ~4.2Ghz right?
The fabrication process tech. is a low power process hence why the clock speeds are lower and doesn't scale up.
 
Screenshot_20170718-015517.png
Screenshot_20170718-015539.png


LanOC.org has more recent results as well, still these will do.
One thing of note is L1-L2 channels work differently between Intel and AMD architectures. Both have latency improvement in L1 channels however it is more so in Intel, especially read amplification. AMD's L2 is almost equal in thoroughput - the only benefit is latency, not read access time.
 
Last edited:
It's just marketing. Ever read one of our news posts about a new product? It's always something like "such and such manufacturer, an industry leader in whatever this article is about, today announced so and so new product...".

Even the node names themselves are marketing. Intel's 10nm (if it worked properly) is actually pretty close to TSMC's 7nm, despite being a whole "3nm" larger, as those names would lead you to think.

I've read over and over again that GloFo's process, which has been used for making Zen chips so far, was not designed for high performance parts. It's a low power node, typically for smartphone chips and such.
 
Thanks for explanation!!

One last question can I ask please. Some skylake X CPUs perform much worse in gaming than others. I talk about 7800X and to a lesser extent the 7820X. In many games the 7800X is worse than even a 2600X. Is this do you think, because of the mesh connection between the cores, or the way it is cut down. Btw also 7920X suffers performance issues too I heard. That's 12 core iirc, the most cut down of the HCC die. Thanks so much for taking the time to explain to me.:love:
My pleasure.

As far as I've seen, i7-7800X(6-core) performs in line with what we should expect for games; ahead of Broadwell-E, Haswell and Zen, but slightly behind higher clocked Kaby and Coffee Lake. Perhaps what you've seen is some kind of edge case? Or were you talking about highly overclocked CPUs?
i7-7800X(6-core) is a bit odd compared to its bigger brothers, it has the lowest boost clocks in the family, and also lacks the more aggressive turbo boost 3.0.
 
My pleasure.

As far as I've seen, i7-7800X(6-core) performs in line with what we should expect for games; ahead of Broadwell-E, Haswell and Zen, but slightly behind higher clocked Kaby and Coffee Lake. Perhaps what you've seen is some kind of edge case? Or were you talking about highly overclocked CPUs?
i7-7800X(6-core) is a bit odd compared to its bigger brothers, it has the lowest boost clocks in the family, and also lacks the more aggressive turbo boost 3.0.
:love:

I did see some seriously bad results for 7800x but i cant dig them out right now as my mum wants me to do some housework:cry:. i iwll try and look up where i saw it when i am done later today. But i did find this

https://www.techpowerup.com/235267/...core-i7-7700k-better-than-i7-7800x-for-gaming

and it shows 7700K much faster, i think faster than the clock speed increase would allow honestly. 8700k seems to do much better. what is all core boost for 7800x? 8700k is 4,.3 afaik and 7700k is 4,.4. not sure of 7800x.
 
BTW, most intel cpus dont depend on ram controller /speed for L3, as this cache is usually on-die, and runs at processor full speed. What is equally important as cache bandwidth, is it's size. The more cache you have, the more instructions and data can be kept close to the core and prefetched. But the architecture works on many more factors, like system agent, qpi, and the processor engineering itself. For instance, the differences between ryzen and intel in the main design. If we look at aida64 processor section screenshot, we can see if the cache is running at full processor speed , and in cpu z you see it's properties, like 8 way, 16 way etc.

03102019-111536.jpg
 
BTW, most intel cpus dont depend on ram controller /speed for L3, as this cache is usually on-die, and runs at processor full speed. What is equally important as cache bandwidth, is it's size. The more cache you have, the more instructions and data can be kept close to the core and prefetched. But the architecture works on many more factors, like system agent, qpi, and the processor engineering itself. For instance, the differences between ryzen and intel in the main design. If we look at aida64 processor section screenshot, we can see if the cache is running at full processor speed , and in cpu z you see it's properties, like 8 way, 16 way etc.

View attachment 118425
Zen has 512kb of L2 though. i wonder why Skylake client can get away with 50% of the l2 cache ? more efficient prefetcher ? Well actually looking at Skylake server with 1MB of l2 per core not sure it makes a huge difference for gaming
 
Cache is expensive, in every sense of the word. It's expensive to produce, sucks down power and kicks out a lot of heat. You don't want more cache than you need.

C2D had a lot of cache... the venerable e8400 had a whopping 6MB L2, and that was just a dual core. Then the OG i7 came along with 256k L2 (per core, resulting in 1MB total). I assume it was because it was on a much faster bus than its predecessors, and the inclusion of L3 cache.

Why Zen has more cache than Skylake, I'm not sure. Maybe there's more "stuff" in the Zen cores than Skylake cores, which warrants having more cache?
 
Cache is expensive, in every sense of the word. It's expensive to produce, sucks down power and kicks out a lot of heat. You don't want more cache than you need.

C2D had a lot of cache... the venerable e8400 had a whopping 6MB L2, and that was just a dual core. Then the OG i7 came along with 256k L2 (per core, resulting in 1MB total). I assume it was because it was on a much faster bus than its predecessors, and the inclusion of L3 cache.

Why Zen has more cache than Skylake, I'm not sure. Maybe there's more "stuff" in the Zen cores than Skylake cores, which warrants having more cache?
Come to think of it i think Zen is wider design than Skylake. For sure it has a wider FPU (not in total vector width , but in more, but narrower fpu's. Higher granularity). And does gain a bit more from SMT than Skylake in my testing and reading, the bigger L2 cache surely helps with keeping those 4 FP pipes fed. But IDK for sure.

edit: maybe SKL-X also needs the huge L2 cache of 1MB of AVX512 FPU
 
Regardless of what you do, there is basic exchange you do with every cache : Speed vs. Capacity.
What is better for which level is up to architecture.
In short :
Speed, both in terms of Bandwitdh and latency (clock cycles), may be more important than having more data stored on-die.
Think about it this way : Getting any data faster to execution units, can be more important than ammount they actually get.
That's why Intel stick to 32kB/256kB of L1/L2 for so long, it was probably best compromise between capacity vs. speed for their architectures.

PS. @er557 Haswell(-e) has L3 cache multiplier, last CPU architecture with Core Clock L3 cache (no multiplier), is Ivy Bridge(-E). Here's X99 UEFI screenshot :
attachment.php

AIDA64 tab you showcased only changes if there is a half-speed or quarter speed.
It doesn't detect if cache is actually linked to core speed.
See what you have under "NB Frequency" in "Memory" tab from CPU-z (it's usually UnCore/L3 Cache clock).
 
Last edited:
Come to think of it i think Zen is wider design than Skylake. For sure it has a wider FPU (not in total vector width , but in more, but narrower fpu's. Higher granularity). And does gain a bit more from SMT than Skylake in my testing and reading, the bigger L2 cache surely helps with keeping those 4 FP pipes fed. But IDK for sure.

edit: maybe SKL-X also needs the huge L2 cache of 1MB of AVX512 FPU

Well, servers are big and slow, but do a lot of work. That's why you see 32 core EPYC (server) chips. By contrast, desktops have smaller, faster cores. It's like comparing a Mack truck to a Ferrari. You don't use a fleet of Ferraris to haul cargo (like lots of web traffic), and you don't take an 18 wheeler to a race track (like running your favorite game at 165hz).
 
Back
Top