Monday, August 14th 2023

Intel Arrow Lake-S to Feature 3 MB of L2 Cache per Performance Core

Intel's next-generation designs are nearing launch, and we are already getting information about the upcoming generations. Today, we have the information that Intel's Arrow Lake-S desktop/client implementations of the Arrow Lake family will feature as much as 3 MB of level two (L2) cache for each performance core. Currently, Intel's latest 13th-generation Raptor Lake and 14th-generation Raptor Lake Refresh feature 2 MB of L2 cache per performance core. However, the 15th generation Arrow Lake, scheduled for launch in 2024, will bump that up by 50% and reach 3 MB. Given that P-cores are getting a boost in capacity, we expect E-cores to do so as well, but at a smaller size.

Arrow Lake will utilize Lion Cove P-core microarchitecture, while the E-core design will be based on Skymont. Intel plans to use a 20A node for this CPU, and more details will be presented next year.
Source: via VideoCardz
Add your own comment

36 Comments on Intel Arrow Lake-S to Feature 3 MB of L2 Cache per Performance Core

#26
Eskimonster
ToothlessPricing for it is actually fair.
5800X3D was just for sale a 269$ at amazon, thats why i ask.
Posted on Reply
#27
Space Lynx
Astronaut
5800X3D was just for sale a 269$ at amazon, thats why i ask. I paid 200 after tax cause I had a coupon for microcenter. So i saved 75 bucks and it runs colder. I have no complaints
Posted on Reply
#28
Eskimonster
Space Lynx5800X3D was just for sale a 269$ at amazon, thats why i ask. I paid 200 after tax cause I had a coupon for microcenter. So i saved 75 bucks and it runs colder. I have no complaints
thats very cheap, for such a new CPU. i would also be happy with it.
Posted on Reply
#29
Minus Infinity
skatesSo, a 10% increase in performance vs. last generation, as per usual for decades now? I'm not being sarcastic as I've not delved into the numbers and just assume more of the same I've grown accustomed to over the years from Intel.
I think you'll be very surprised about Arrow Lake vs Raptor Lake not just in performance but power efficiency. Time will tell, but I would expect Arrow Lake to thrash Raptor Lake. It'll need to as Zen 5 is looking very strong with > 20% IPC, and a a lot of architectural changes. Any way we'll gte an idea soon with Meteor Lake how it goes even if it's only mobile as Arrow Lake is a much refined and performant version of that.
Posted on Reply
#30
R0H1T
skatesSo, a 10% increase in performance vs. last generation, as per usual for decades now? I'm not being sarcastic as I've not delved into the numbers and just assume more of the same I've grown accustomed to over the years from Intel.
10% more has not been a pattern for decades, in fact even last decade Intel couldn't get 10% more for 5 gens before they switched from SKL & its derivates to RKL. Same goes for AMD with Zen -> Zen+ although it was just a minor shrink.
Posted on Reply
#31
efikkan
PunkenjoyThe larger the cache, the longer it get to look it up. It's why by example the largest cache is the L3 for both Intel and AMD.
It depends on many factors.
Caches are usually organized in banks, which increases bandwidth substantially and offsets latency, but decreases overall cache efficiency (per cache line).
New node improvements may also lead to latency decreases.
And so on.
Punkenjoy...The smaller the region, the higher will be the hit ratio but at the same time, the longer it will take to see if the data is in there. Working with cache isn't just more is better. it's a balance you do when you design a CPU. By example, AMD frequently went with Larger L1 and L2 but had slower cache speed. And by example, Core 2 Duo had 3 MB of L2 per core (merged into a 6 MB shared L2). So 3 MB L2 isn't new. But at that time they didn't had L3.
Both AMD and Intel have increased and decreased their L1D/L1I and L2 caches over various generations, it all depends on the cache design and priorities of the architecture. Comparing a cache across CPU architectures solely based on size is nearly pointless. And as I often say, performance is what ultimately matters.

Pretty much all current CPU architectures caches memory of the same region size, it's called a "cache line", and currently it's 64 bytes with most x86 and ARM architectures.
I do expect them to move to 128 bytes eventually, as this would greatly benefit dense data accesses (which is where you have good hit rates anyways), and implementing e.g. 3 MB of L2(128b cache lines) would not cost anywhere near 50% more than 2 MB L2 (64b cache lines) in die space, and have approx. the same latency, so a huge win in hit rates. This will also allow for 1024-bit SIMD, which is probably coming "soon".
Posted on Reply
#32
AnotherReader
efikkanIt depends on many factors.
Caches are usually organized in banks, which increases bandwidth substantially and offsets latency, but decreases overall cache efficiency (per cache line).
New node improvements may also lead to latency decreases.
And so on.


Both AMD and Intel have increased and decreased their L1D/L1I and L2 caches over various generations, it all depends on the cache design and priorities of the architecture. Comparing a cache across CPU architectures solely based on size is nearly pointless. And as I often say, performance is what ultimately matters.

Pretty much all current CPU architectures caches memory of the same region size, it's called a "cache line", and currently it's 64 bytes with most x86 and ARM architectures.
I do expect them to move to 128 bytes eventually, as this would greatly benefit dense data accesses (which is where you have good hit rates anyways), and implementing e.g. 3 MB of L2(128b cache lines) would not cost anywhere near 50% more than 2 MB L2 (64b cache lines) in die space, and have approx. the same latency, so a huge win in hit rates. This will also allow for 1024-bit SIMD, which is probably coming "soon".
The data arrays won't decrease in size due to a larger line size, but the tag arrays would be smaller as you would need only 3*1024*1024/128 tags versus 2*1024*1024/64, i.e 24k vs 32k. Also note that the Pentium 4 had 128 byte lines for L2 while the L1 stayed at 64 bytes.
Posted on Reply
#33
Dan.G
Would like to see CPUs with L4 cache. So... how come that's not a thing? No performance gains? Design challenges? Unjustified prices?
Posted on Reply
#34
AnotherReader
Dan.GWould like to see CPUs with L4 cache. So... how come that's not a thing? No performance gains? Design challenges? Unjustified prices?
Power 8 and some SKUs of Haswell, Broadwell, and Skylake had L4 caches. For workloads with very large memory footprints, it can make sense, but for most workloads, a large L3 is better.
Posted on Reply
#35
skates
More cache and wider lanes please.
Posted on Reply
#36
efikkan
Dan.GWould like to see CPUs with L4 cache. So... how come that's not a thing? No performance gains? Design challenges? Unjustified prices?
To explain this, we first need to address how L3 works.
As you might already know, L3 is a spillover cache, which means it only contains discarded cache lines from L2. L3 is also accessible across cores, which is why it has some effect on multithreaded workloads. There is a tremendous amount of data flowing constantly through the caches, including lots of prefetched data which was ultimately unnecessary. In terms of cache lines, the largest volume is data, while a smaller volume is instructions, but the chances of a single cache line being needed before it is evicted from L3 is much higher for instructions, especially from other cores. (The chances of another core needing lots of the same data within nanoseconds is slim, except for explicit synchronization.) This is why CPUs need so large L3 caches before it starts to matter, in most cases where we see sensitivity to L3, it's due to instruction cache lines being shared, not data. But we usually don't see significant gains from huge L3 caches in most computationally intense tasks, even though they churn though large amounts of data. This is due to the application being cache optimized, which is one of the most important types of low-level optimization. As any low-level programmer can tell you, sensitivity to L3 usually means the code is too large, bloated and unpredictable, which is why the CPU evicts it from cache.

Even though huge L3s make appreciable in some games and select applications, I don't believe it's a good direction to go for CPU development. It costs a tremendous amount of die space, and don't yield any meaningful significance for most heavy workloads. This die space and development effort could be spent on much more useful improvements, which would benefit most workloads. But I guess this is what we get when people are more focused on synthetic benchmark than real world results. Just think about it; slapping a whole extra cache die on the CPU makes less of a difference than a minor architectural upgrade (~10% IPC gains). That's a crude brute-force approach to extract very little overall. And this is why I'm not for L4, the usefulness of L4 would be even less, especially with a larger L3. But I do believe there is one way L3 could become more cost-effective though, splitting instructions and data. Then a much smaller L3 pool could have the same effect as 100 MB or so, at a small cost.

I'm much more excited about real architectural improvements, such as much wider execution. The difference between well written and poorly written software will only become more clear over time, as well written software will continue to scale.
skatesMore cache and wider lanes please.
PCIe lanes?
Posted on Reply
Add your own comment
May 14th, 2024 04:18 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts