Thursday, May 15th 2025

AMD "Zen 7" Rumors: Three Core Classes, 2 MB L2, 7 MB V‑Cache, and TSMC A14 Node
AMD is already looking ahead to its Zen 7 generation and is planning the final details for its next generation of Zen IP. The first hints come from YouTuber "Moore's Law Is Dead," which points to a few interesting decisions. AMD plans to extend its multi‑class core strategy that began with Zen 4c and continued into Zen 5. Zen 7 will reportedly include three types of cores: the familiar performance cores, dense cores built for maximum throughput, and a new low‑power variant aimed at energy‑efficient tasks, just like Intel and its LP/E-Cores. There is even an unspecified "PT" and "3D" core. By swapping out pipeline modules and tweaking their internal libraries, AMD can fine‑tune each core so it performs best in its intended role, from running virtual machines in the cloud to handling AI workloads at the network edge.
On the manufacturing front, Zen 7 compute chiplets (CCDs) are expected to be made on TSMC's A14 process, which will now include a backside power delivery network. This was initially slated for the N2 node but got shifted to the A16/A14 line. The 3D V‑Cache SRAM chiplets underneath the CCDs will remain on TSMC's N4 node. It is a conservative choice, since TSMC has talked up using N2‑based chiplets for stacked memory in advanced packaging, but AMD appears to be playing it safe. Cache sizes should grow, too. Each core will get 2 MB of L2 cache instead of the current 1 MB, and L3 cache per core could expand to 7 MB through stacked V‑Cache slices. Standard CCDs without V‑Cache will still have around 32 MB of shared L3. A bold rumor suggests an EPYC model could feature 33 cores per CCD, totaling 264 cores across eight CCDs. Zen 7 tape‑out is planned for late 2026 or early 2027, and we probably won't see products on shelves until 2028 or later. As always with early-stage plans, take these details with a healthy dose of skepticism. The final Zen 7 lineup could look quite different once AMD locks down its roadmap.
Sources:
Moore's Law Is Dead, via HardwareLuxx
On the manufacturing front, Zen 7 compute chiplets (CCDs) are expected to be made on TSMC's A14 process, which will now include a backside power delivery network. This was initially slated for the N2 node but got shifted to the A16/A14 line. The 3D V‑Cache SRAM chiplets underneath the CCDs will remain on TSMC's N4 node. It is a conservative choice, since TSMC has talked up using N2‑based chiplets for stacked memory in advanced packaging, but AMD appears to be playing it safe. Cache sizes should grow, too. Each core will get 2 MB of L2 cache instead of the current 1 MB, and L3 cache per core could expand to 7 MB through stacked V‑Cache slices. Standard CCDs without V‑Cache will still have around 32 MB of shared L3. A bold rumor suggests an EPYC model could feature 33 cores per CCD, totaling 264 cores across eight CCDs. Zen 7 tape‑out is planned for late 2026 or early 2027, and we probably won't see products on shelves until 2028 or later. As always with early-stage plans, take these details with a healthy dose of skepticism. The final Zen 7 lineup could look quite different once AMD locks down its roadmap.
113 Comments on AMD "Zen 7" Rumors: Three Core Classes, 2 MB L2, 7 MB V‑Cache, and TSMC A14 Node
7MB of cache per core is planned for 33-core chiplets on EPYCs. Did you watch Tom's video, which is the basis for this article?
You might be confusing server chips with client chips.
I'm too daft to dig into server cpu's atm.
A chiplet/ccd in current desktop Zen CPUs has 32MB per CCD, for the 8-core parts that's 4MB per core.
I don't think the stacked V-cache is being taking into consideration since that's a different piece that can be bolted on. Anyhow, the v-cache adds another 64MB per CCD that it's bolted on, and is the same cache that's used in both Ryzen and Epyc CPUs.
If Tek is saying the chiplets are NOT cores and are something else then we have a lost in translation/terminology issue and I've made a mistake.
But This where I am confused, we are talking about v-cache not standard parts with 32MB cache.
7MB per core with V-cache would actually be a regression compared to the performance offerings we currently have (which, as you mentioned, have 12MB per core on a full CCD). But we don't know how many cores those rumored cores are going to have, and it depends on what core config we're comparing it against, given that dense offerings only have 2MB of L3 per core, with the perf non-X3D with 4MB and the X3D ones with 12MB. The source of this post has wondered about the same thing as well: If I were to guess, I believe those 7MB per core with V-cache is referring to a dense config.
Anyhow, given that the rumors say that the V-cache is going to move to 4NM instead of the current 7nm, I guess we can expect the amount of cache to actually increase as well.
Fun fact, the V-cache has been almost the same done in 7nm since Zen 3 (source)
1. Zen3/4/5 - 32MB of L3 per CCD chiplet (plus 64MB on 3D V-cache chiplet for gaming CPU), so 4MB per core on vanilla and 12MB per core on gaming
2. Zen6 - 48MB of L3 per 12-core CCD chiplet, so the same 4MB per core, but 3D V-cache chiplet is rumoured to increase from 64MB to 96 MB; we will see
3. Zen7 - we don't know now, but minimally the same as on Zen6
EPYC vanilla - the same as above
EPYC dense c cores
1. Zen5c - 32MB of L3 per 16-core CCD chiplet, so 2MB per c core; no 3D V-cache version has been planned this generation on dense c cores
2. Zen6c - 128MB of L3 per 32-core CCD chiplet is rumoured, so 4MB per c core
3. Zen7c - leaked diagram shows 231MB of L3 on cache chiplet beneath 33-core CCD chiplet; so 7MB slice of L3 per c core; there are 33 slices on cache die; this is the instance where CCD does not have its own L3 cache at all, but entire L3 becames a separate die, aka stacked L3 chiplet
In all instances, L3 cache increases per core.
Does that make sense now?
Yeah it makes sense now, the Epyc chip was the misunderstanding on my part.
No but seriously I would love to know how much doubling the cache (for example) would help gaming benchmark numbers. If 3D v-cache gave you say 20% but doubling the cache again only made it 25% then I agree, 12MB is probably enough (for now)
Designing CPUs is a tedious multi-year process with deadlines and more and more constraints, so anything major must be designed early on, years ahead of the final release.
What I see is an underpowered front-end not capable to especially saturate the two extra ALUs they added, leading to lower performance gains than "expected". That would be a very costly way to achieve very diminishing returns, resources which would have been spent much better elsewhere. Slapping on a lot of L3 cache is more of a desperate move in lack of other advancements, as it really doesn't help much the overall throughput of the CPU. If anything, increasing the L2 cache along with much better branch predictions would make more sense. The trend for the past decade or so is that latency of caches are slightly increasing over time, meanwhile bandwidth is increasing rapidly, much faster than the gains in performance. Caches simply don't work the way most people expect; they are more like streaming buffers where loads of data constantly flow through, and just making them larger and faster is extremely hard, keeping them synchronized is getting harder, especially with ever-increasing core counts. This is one of the reasons why we see cache sizes sometimes go down with a new architecture, as even the slightest improvement in branch prediction, prefetching etc. will have massive effects on the entire cache hierarchy. So even though cache "hides latency", it's not as simple as throwing more cache at everything. AMD's current design is if anything understaturated from a weak front-end, and improving this will alleviate the effects of said latency and cache starvation. So improving the front-end is key, along with improvements in ISA and software, to continue performance scaling in the long-term, not just adding more and more cache forever. Any good keyboard warrior knows to make the most bold predictions based on nothing but gibberish, and to defend them to the bitter end like their honor depended on it!
In regards to Branch prediction it is always an area being improved upon but there are always going to be prediction failures causing a cache miss. IMO until there is a major rework of IF with something like the Patent I posted earlier this will always be THE key limitation to the design and will only become more of an issues with an increase with core counts per CCD
AMD have even highlighted this in EPYC that can be mitigated on certain SKUs.
www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/221704010-B_en_4th-Gen-AMD-EPYC-Processor-Architecture---White-Paper_pdf.pdf The have used the same IO die in both Zen 4 and 5 so the above applies to both current architectures.
Even though L3 is cheaper per MB, L2 and L3 work very differently, as L3 only contains recently discarded cache lines. Most workloads see no benefit beyond a certain L3 size (or even a performance penalty), because at that point you'll almost only get hits on instruction cache lines, which is why we mostly see this benefit e.g. low-resolution gaming and certain other edge-cases. Additionally, adding even more L3 would see even less benefits.
A larger L2 on the other hand, let's the prefetcher work completely differently, and going from e.g. 1->2 MB L2 would for most workloads have a larger impact than adding 64 MB of L3. But L2 is much more deeply connected to the pipeline, so adding more isn't trivial. You are mixing up the terms a bit here.
A branch misprediction doesn't necessarily cause a cache miss, while it certainly can indirectly, if the block of code is dense and contains no function calls or pointers needing to be dereferenced, then the penalty will only be the stalled and flushed pipeline and decoding overhead. There can however be a function call there that isn't in cache, leading to even a sequence of cache misses, so in sense cause a chain-reaction.
And obviously, cache misses can happen without any misprediction if the front-end simply can't keep up. This is a case where larger L2 will help somewhat, but not more L3, as L3 is just discarded L2, so if the front-end is the bottleneck more L3 isn't going to make you prefetch more.
While the front-end is always improved upon, AMD have stayed a couple of steps behind Intel for the entire Zen family. Their back-end is much better though, which is why we see Zen excel with large batch loads, but Intel retains an edge when it comes to responsiveness in user-interactive applications. I fail to see the correlation between performance data, specs and how the IF should be the key limitation in the design. Zen 5 scales brilliantly with heavy AVX-512 loads on all cores, which is very data intensive but easy on the front-end, but it comes up short in logic heavy code. And while every tiny reduction in latency will help a tiny bit, improving the front-end and its associated cache which is fed into the pipeline (L2) will have several orders of magnitude more impact, as it doesn't just impact the cache hits in L2 (cache misses in L1), it also impacts the when the front-end doesn't fail - so the front-end can keep up the prefetching so it doesn't cause any cache miss, which is something L3 doesn't help at all.
I 100% agree thata larger L2 and L1 would definately be more beneficial to everyone but the die space cost per core that you have to then multiply by 8 cores currently and going to 12 or 16 cores per CCD I believe for Zen7 it really starts adding up massively. This is why CURRENTLY I am saying that adding the L3 is the current best option for AMD from a business perspective as if they went down with doubling or quadrupling L1 and L2 would basically eliminate the benefits that they have acheived with going down the MCM path as the CCDs would just be physically too big to scale effectively the way they do now.
I wasnt aware that the L3 was purely a "victim" cache design in AMDs implementation but that makes more sense as to why only certain workloads take benefit from such a cache increase. Intel seems to have implemented an L4 cache to behave like this previously but their current L3 is more similar to AMDs L2 just accessible to all cores. Yeah I should have worded that a lot better. What I was attempted to say is that no matter how good the branch prediction/front end improves with the current design the moment the CCD needs to reach out beyond itself it is hamstrung by a half speed interconnect to the IO Die as well as a sub par memory controller (Vs Intels offering) AMD have been behind in this aspect for a very long time however with Intel abandoning SMT I do wonder how much they will still gain from this. The reason why I call IF the "key limitation" is due to the fact that it is not a full speed interconnect between CCDs and IO Die, add to the fact that AMDs memory controller has always been an effective generation behind Intels just adds another layer of bottleneck. I just found it interesting that AMD offers on certain parts an ability to double up on the bandwidth by having 2 interconnects between each CCD and IO die so there is obviously performance to be gained by doing this.
I wish I could be in AMDs labs to test a 9980X setup with the dual links AND them being equipped with X3D just for complete curiosity.
Pile up "E-cores" on a desktop CPU is a dumb idea.
The ideal has always been to develop cores with very high IPC to they perform tasks as quickly as possible and then enter a low power consumption state.
Something like this: