• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

AMD "Zen 7" Rumors: Three Core Classes, 2 MB L2, 7 MB V‑Cache, and TSMC A14 Node

I am just looking at the differences between the 7950x and 9950x and the lack of clock improvements even though there is a node change. In actual fact the 9950x has a 200mhz DECREASE in base clocks over the 7950x even though its on the better node.
It's not a full node change. N4 is basically an optimized version of the main node N5. Minor change. A full node change, N7->N5, N5/N4->N3, N3->N2, or major architectural change, such as unifying two 4-core CCXs into one 8-core CCX with shared L3 cache bring significant improvements. We will need to wait and see how Zen6 pans out. 9600X also has lower base frequency than 7600X, by 800 MHz.
 
It's so "awful" that buyers must be masochists to dare to enjoy it.


Did you look into the article and diagrams? For Zen7, 7MB of cache per core on V-cache chiplets. On 33-core EPYC die, that's 231MB per V-cache chiplet.
We don't know whether the same V-cache chiplets would be used for desktop CPUs.
I did look hence why I ask asking in speculation about desktop chips. 7MB cache per core is actually a regression in cache size per core compared to the current variants.
 
I did look hence why I ask asking in speculation about desktop chips. 7MB cache per core is actually a regression in cache size per core compared to the current variants.
Currently, each core has 4MB of primary L3 cache on 8-core CCD and 2MB on 16-core dense CCD. Where is 'regression'?

7MB of cache per core is planned for 33-core chiplets on EPYCs. Did you watch Tom's video, which is the basis for this article?

You might be confusing server chips with client chips.
 
Currently, each core has 4MB of primary L3 cache on 8-core CCD and 2MB on 16-core dense CCD. Where is 'regression'?

7MB of cache per core is planned for 33-core chiplets on EPYCs. Did you watch Tom's video, which is the basis for this article?

You might be confusing server chips with client chips.
I'm talking about client (not server based chips - maybe we are getting wires crossed) v-cache chiplets have 12 meg L3 per chiplet (or 16 on the 6 core X3D CCD variants)

I'm too daft to dig into server cpu's atm.
 
I'm talking about client (not server based chips - maybe we are getting wires crossed) v-cache chiplets have 12 meg L3 per chiplet (or 16 on the 6 core X3D CCD variants)

I'm too daft to dig into server cpu's atm.
Don't you mean per core?
A chiplet/ccd in current desktop Zen CPUs has 32MB per CCD, for the 8-core parts that's 4MB per core.
I don't think the stacked V-cache is being taking into consideration since that's a different piece that can be bolted on. Anyhow, the v-cache adds another 64MB per CCD that it's bolted on, and is the same cache that's used in both Ryzen and Epyc CPUs.
 
Don't you mean per core?
A chiplet/ccd in current desktop Zen CPUs has 32MB per CCD, for the 8-core parts that's 4MB per core.
I don't think the stacked V-cache is being taking into consideration since that's a different piece that can be bolted on. Anyhow, the v-cache adds another 64MB per CCD that it's bolted on, and is the same cache that's used in both Ryzen and Epyc CPUs.
Yeah I mean per core but I now think we have a lost in translation thing here where I am assuming that @Tek-Check is calling core's chiplets, if that's the case I am talking about cores.
If Tek is saying the chiplets are NOT cores and are something else then we have a lost in translation/terminology issue and I've made a mistake.

But

Did you look into the article and diagrams? For Zen7, 7MB of cache per core on V-cache chiplets. On 33-core EPYC die, that's 231MB per V-cache chiplet.
We don't know whether the same V-cache chiplets would be used for desktop CPUs.

This where I am confused, we are talking about v-cache not standard parts with 32MB cache.
 
Yeah I mean per core but I now think we have a lost in translation thing here where I am assuming that @Tek-Check is calling core's chiplets, if that's the case I am talking about cores.
If Tek is saying the chiplets are NOT cores and are something else then we have a lost in translation/terminology issue and I've made a mistake.

But



This where I am confused, we are talking about v-cache not standard parts with 32MB cache.
Yeah, fair enough, I guess there were some communication issues indeed.

7MB per core with V-cache would actually be a regression compared to the performance offerings we currently have (which, as you mentioned, have 12MB per core on a full CCD). But we don't know how many cores those rumored cores are going to have, and it depends on what core config we're comparing it against, given that dense offerings only have 2MB of L3 per core, with the perf non-X3D with 4MB and the X3D ones with 12MB. The source of this post has wondered about the same thing as well:
An expansion of the L3 cache to 7 MB per core is planned via 3D V-Cache. Depending on the CCD variant, however, the current L3 cache is 12 MB per core (32 + 64 MB for the CCD with 3D V-Cache and Zen 5 cores) or 2 MB per core for a CCD with 16 Zen 5c cores and a total of 32 MB of L3 cache.
If I were to guess, I believe those 7MB per core with V-cache is referring to a dense config.

Anyhow, given that the rumors say that the V-cache is going to move to 4NM instead of the current 7nm, I guess we can expect the amount of cache to actually increase as well.
Fun fact, the V-cache has been almost the same done in 7nm since Zen 3 (source)
 
This where I am confused, we are talking about v-cache not standard parts with 32MB cache.
If I were to guess, I believe those 7MB per core with V-cache is referring to a dense config.
Ryzen
1. Zen3/4/5 - 32MB of L3 per CCD chiplet (plus 64MB on 3D V-cache chiplet for gaming CPU), so 4MB per core on vanilla and 12MB per core on gaming
2. Zen6 - 48MB of L3 per 12-core CCD chiplet, so the same 4MB per core, but 3D V-cache chiplet is rumoured to increase from 64MB to 96 MB; we will see
3. Zen7 - we don't know now, but minimally the same as on Zen6

EPYC vanilla - the same as above
EPYC dense c cores
1. Zen5c - 32MB of L3 per 16-core CCD chiplet, so 2MB per c core; no 3D V-cache version has been planned this generation on dense c cores
2. Zen6c - 128MB of L3 per 32-core CCD chiplet is rumoured, so 4MB per c core
3. Zen7c - leaked diagram shows 231MB of L3 on cache chiplet beneath 33-core CCD chiplet; so 7MB slice of L3 per c core; there are 33 slices on cache die; this is the instance where CCD does not have its own L3 cache at all, but entire L3 becames a separate die, aka stacked L3 chiplet

In all instances, L3 cache increases per core.
Does that make sense now?
 
Ryzen
1. Zen3/4/5 - 32MB of L3 per CCD chiplet (plus 64MB on 3D V-cache chiplet for gaming CPU), so 4MB per core on vanilla and 12MB per core on gaming
2. Zen6 - 48MB of L3 per 12-core CCD chiplet, so the same 4MB per core, but 3D V-cache chiplet is rumoured to increase from 64MB to 96 MB; we will see

So I assume it's going to remain 12MB per core if we get 12 core 3D vcache variants then, expected but I was secretly hoping they'd amp the cache up (50% more, maybe double depending on the process node shrink so it does eat more space than it currently does)

Yeah it makes sense now, the Epyc chip was the misunderstanding on my part.
 
Plenty for those who need it.
I need ALL of the cache.

No but seriously I would love to know how much doubling the cache (for example) would help gaming benchmark numbers. If 3D v-cache gave you say 20% but doubling the cache again only made it 25% then I agree, 12MB is probably enough (for now)
 
Plenty for those who need it.
I think it also depends on if they fix the memory controller and bring the latency down. At least it's obvious the non 3D cache variants are cache starved, and the low memory bandwidth exasperates this - Hence the need for 3D cache in the first place.
 
I think it also depends on if they fix the memory controller and bring the latency down. At least it's obvious the non 3D cache variants are cache starved, and the low memory bandwidth exasperates this - Hence the need for 3D cache in the first place.
Hmm. Current memory controller, any latency issues and the amount of cache are interesting to debate for tech nerds, but still not hampering their lead, both in productivity and especially in gaming. As AMD has several different lines of desktop CPUs, that allows everyone to find something for their desired case use. Arrow Lake does not sell much, which allowed AMD to collect another billion in client CPU revenues in Q1.
3D Centre Z5 9950X3D meta.png
 
My guess, based on a number of factors, would be Autumn 2026.

yeah i might do a microcenter campout in Fall 2026 for launch day, upgrade my 7800x3d to 11800x3d and possibly a gpu upgrade on the same day.
 
Also, remember that Ryzen 9000 was rushed, even if nobody talks about it anymore. No reason to repeat that.. does anyone remember this 800 post thread lol
https://www.techpowerup.com/forums/threads/why-everyone-say-zen-5-is-bad.325345/
Yeah, but I also don't think even some substantial competition makes a difference, the importance is that AMD is selling.
I don't think there's really that much headroom for rushing it anyway, it messes up so many things for them, and they need their shiny, pricey EPYCs, first and foremost.
What part of the design was rushed?
Designing CPUs is a tedious multi-year process with deadlines and more and more constraints, so anything major must be designed early on, years ahead of the final release.
What I see is an underpowered front-end not capable to especially saturate the two extra ALUs they added, leading to lower performance gains than "expected".

Base L3 Cache should increase along the lines of the cores but I am thinking more about the extra cache die should increase as well (hopefully)
That would be a very costly way to achieve very diminishing returns, resources which would have been spent much better elsewhere. Slapping on a lot of L3 cache is more of a desperate move in lack of other advancements, as it really doesn't help much the overall throughput of the CPU. If anything, increasing the L2 cache along with much better branch predictions would make more sense.

About time AMD fixed the latency and cache starvation issues of the architecture. Hopefully they fix the awful memory controller too!
The trend for the past decade or so is that latency of caches are slightly increasing over time, meanwhile bandwidth is increasing rapidly, much faster than the gains in performance. Caches simply don't work the way most people expect; they are more like streaming buffers where loads of data constantly flow through, and just making them larger and faster is extremely hard, keeping them synchronized is getting harder, especially with ever-increasing core counts. This is one of the reasons why we see cache sizes sometimes go down with a new architecture, as even the slightest improvement in branch prediction, prefetching etc. will have massive effects on the entire cache hierarchy. So even though cache "hides latency", it's not as simple as throwing more cache at everything. AMD's current design is if anything understaturated from a weak front-end, and improving this will alleviate the effects of said latency and cache starvation. So improving the front-end is key, along with improvements in ISA and software, to continue performance scaling in the long-term, not just adding more and more cache forever.

Waiting for reviews is for wimps.
Any good keyboard warrior knows to make the most bold predictions based on nothing but gibberish, and to defend them to the bitter end like their honor depended on it!
 
Since moving to the MCM design from Zen 2 onwards the biggest weakness for the multi CCD design is getting data from the IO Die to a CCD and especially from one CCD to another. They have worked on that aspect with 9xxx series a lot but its still very obvious that when you really start loading the CCDs the Infinity Fabric/Memory controller is the key weakness causing both cache and core startvation. X3D has mitigated this weakness by keeping more data on CCD meaning overall there is less traffic going across the IF.

That would be a very costly way to achieve very diminishing returns, resources which would have been spent much better elsewhere. Slapping on a lot of L3 cache is more of a desperate move in lack of other advancements, as it really doesn't help much the overall throughput of the CPU. If anything, increasing the L2 cache along with much better branch predictions would make more sense.

"Slapping on a lot of L3" is one of the most scalable answers AMD currently has. L1 cache is one of the most expensive in terms of die size to increase usually due to its width and L2 isnt much better either. Also L1 and L2 are a per core change where as the L3 is something either every core can share or be dedicated to a single thread on the fly so in single/lightly threaded workloads you may be commiting a lot of die size for only a small performance improvement.

In regards to Branch prediction it is always an area being improved upon but there are always going to be prediction failures causing a cache miss. IMO until there is a major rework of IF with something like the Patent I posted earlier this will always be THE key limitation to the design and will only become more of an issues with an increase with core counts per CCD

AMD have even highlighted this in EPYC that can be mitigated on certain SKUs.
The I/O die used in all 4th Gen AMD EPYC processors has 12 Infinity Fabric connections to CPU dies. Our CPU dies can support one or two connections to the I/O die
In processor models with four CPU dies, two connections can be used to optimize bandwidth to each CPU die. This is the case for some EPYC 9004 Series CPUs and all EPYC 8004 Series CPUs
In processor models with more than four CPU dies, such as in the EPYC 9004 Series, one Infinity Fabric connection ties each CPU die to the I/O die

The have used the same IO die in both Zen 4 and 5 so the above applies to both current architectures.
 
"Slapping on a lot of L3" is one of the most scalable answers AMD currently has. L1 cache is one of the most expensive in terms of die size to increase usually due to its width and L2 isnt much better either. Also L1 and L2 are a per core change where as the L3 is something either every core can share or be dedicated to a single thread on the fly so in single/lightly threaded workloads you may be commiting a lot of die size for only a small performance improvement.
That's not how it works at all.
Even though L3 is cheaper per MB, L2 and L3 work very differently, as L3 only contains recently discarded cache lines. Most workloads see no benefit beyond a certain L3 size (or even a performance penalty), because at that point you'll almost only get hits on instruction cache lines, which is why we mostly see this benefit e.g. low-resolution gaming and certain other edge-cases. Additionally, adding even more L3 would see even less benefits.

A larger L2 on the other hand, let's the prefetcher work completely differently, and going from e.g. 1->2 MB L2 would for most workloads have a larger impact than adding 64 MB of L3. But L2 is much more deeply connected to the pipeline, so adding more isn't trivial.

In regards to Branch prediction it is always an area being improved upon but there are always going to be prediction failures causing a cache miss.
You are mixing up the terms a bit here.
A branch misprediction doesn't necessarily cause a cache miss, while it certainly can indirectly, if the block of code is dense and contains no function calls or pointers needing to be dereferenced, then the penalty will only be the stalled and flushed pipeline and decoding overhead. There can however be a function call there that isn't in cache, leading to even a sequence of cache misses, so in sense cause a chain-reaction.

And obviously, cache misses can happen without any misprediction if the front-end simply can't keep up. This is a case where larger L2 will help somewhat, but not more L3, as L3 is just discarded L2, so if the front-end is the bottleneck more L3 isn't going to make you prefetch more.

While the front-end is always improved upon, AMD have stayed a couple of steps behind Intel for the entire Zen family. Their back-end is much better though, which is why we see Zen excel with large batch loads, but Intel retains an edge when it comes to responsiveness in user-interactive applications.

IMO until there is a major rework of IF with something like the Patent I posted earlier this will always be THE key limitation to the design and will only become more of an issues with an increase with core counts per CCD
I fail to see the correlation between performance data, specs and how the IF should be the key limitation in the design. Zen 5 scales brilliantly with heavy AVX-512 loads on all cores, which is very data intensive but easy on the front-end, but it comes up short in logic heavy code. And while every tiny reduction in latency will help a tiny bit, improving the front-end and its associated cache which is fed into the pipeline (L2) will have several orders of magnitude more impact, as it doesn't just impact the cache hits in L2 (cache misses in L1), it also impacts the when the front-end doesn't fail - so the front-end can keep up the prefetching so it doesn't cause any cache miss, which is something L3 doesn't help at all.
 
That's not how it works at all.
Even though L3 is cheaper per MB, L2 and L3 work very differently, as L3 only contains recently discarded cache lines. Most workloads see no benefit beyond a certain L3 size (or even a performance penalty), because at that point you'll almost only get hits on instruction cache lines, which is why we mostly see this benefit e.g. low-resolution gaming and certain other edge-cases. Additionally, adding even more L3 would see even less benefits.

A larger L2 on the other hand, let's the prefetcher work completely differently, and going from e.g. 1->2 MB L2 would for most workloads have a larger impact than adding 64 MB of L3. But L2 is much more deeply connected to the pipeline, so adding more isn't trivial.
The performance penalty for a larger L3 cache is still very miniscule however due to the still relatively low sizes we are talking about currently. When we are getting up into the 256Mb+ total L3 per CCD size I can imagine that it will be something that would be interesting to start benchmarking in earnest and I can imagine in things like high speed trading that latency penalty would actually be something that would exclude them from consideration in production.

I 100% agree thata larger L2 and L1 would definately be more beneficial to everyone but the die space cost per core that you have to then multiply by 8 cores currently and going to 12 or 16 cores per CCD I believe for Zen7 it really starts adding up massively. This is why CURRENTLY I am saying that adding the L3 is the current best option for AMD from a business perspective as if they went down with doubling or quadrupling L1 and L2 would basically eliminate the benefits that they have acheived with going down the MCM path as the CCDs would just be physically too big to scale effectively the way they do now.

I wasnt aware that the L3 was purely a "victim" cache design in AMDs implementation but that makes more sense as to why only certain workloads take benefit from such a cache increase. Intel seems to have implemented an L4 cache to behave like this previously but their current L3 is more similar to AMDs L2 just accessible to all cores.

You are mixing up the terms a bit here.
A branch misprediction doesn't necessarily cause a cache miss, while it certainly can indirectly, if the block of code is dense and contains no function calls or pointers needing to be dereferenced, then the penalty will only be the stalled and flushed pipeline and decoding overhead. There can however be a function call there that isn't in cache, leading to even a sequence of cache misses, so in sense cause a chain-reaction.

And obviously, cache misses can happen without any misprediction if the front-end simply can't keep up. This is a case where larger L2 will help somewhat, but not more L3, as L3 is just discarded L2, so if the front-end is the bottleneck more L3 isn't going to make you prefetch more.

Yeah I should have worded that a lot better. What I was attempted to say is that no matter how good the branch prediction/front end improves with the current design the moment the CCD needs to reach out beyond itself it is hamstrung by a half speed interconnect to the IO Die as well as a sub par memory controller (Vs Intels offering)

While the front-end is always improved upon, AMD have stayed a couple of steps behind Intel for the entire Zen family. Their back-end is much better though, which is why we see Zen excel with large batch loads, but Intel retains an edge when it comes to responsiveness in user-interactive applications.
AMD have been behind in this aspect for a very long time however with Intel abandoning SMT I do wonder how much they will still gain from this.


I fail to see the correlation between performance data, specs and how the IF should be the key limitation in the design. Zen 5 scales brilliantly with heavy AVX-512 loads on all cores, which is very data intensive but easy on the front-end, but it comes up short in logic heavy code. And while every tiny reduction in latency will help a tiny bit, improving the front-end and its associated cache which is fed into the pipeline (L2) will have several orders of magnitude more impact, as it doesn't just impact the cache hits in L2 (cache misses in L1), it also impacts the when the front-end doesn't fail - so the front-end can keep up the prefetching so it doesn't cause any cache miss, which is something L3 doesn't help at all.

The reason why I call IF the "key limitation" is due to the fact that it is not a full speed interconnect between CCDs and IO Die, add to the fact that AMDs memory controller has always been an effective generation behind Intels just adds another layer of bottleneck. I just found it interesting that AMD offers on certain parts an ability to double up on the bandwidth by having 2 interconnects between each CCD and IO die so there is obviously performance to be gained by doing this.

I wish I could be in AMDs labs to test a 9980X setup with the dual links AND them being equipped with X3D just for complete curiosity.
 
Desktop applications are poorly optimized for multicore processing or cannot be optimized for more than 1 core or thread. They require cores with very high IPC.

Pile up "E-cores" on a desktop CPU is a dumb idea.

The ideal has always been to develop cores with very high IPC to they perform tasks as quickly as possible and then enter a low power consumption state.
 
Last edited:
The smart thing about e cores and c cores and little cores is they allow a CPU to have really large cores for making lightly threaded tasks fast while also having a lot of cores for heavily threaded tasks. It's what allowed Alder Lake to have the single-threaded performance and multi-threaded performance to match or beat Vermeer when Vermeer had chiplets and Alder Lake was monolithic.
 
Back
Top