• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops

It was based on AMD interposer technology for the first HBM stacks in 2015. That Intel also copied, and Nvidia.

View attachment 371864
Yes, AMD designed a chip around the packaging technology that is TSMCs, the 3d packaging technology is TSMC's. It is not AMD's.

The below is from my second link.

If you want TSMC's official link you can find it here: nhttps://3dfabric.tsmc.com/english/dedicatedFoundry/technology/3DFabric.htmt

In fact, here is TSMCs press release introducing the packaging, Introducing TSMC 3DFabric: TSMC’s Family of 3D Silicon Stacking, Advanced Packaging Technologies and Services - Taiwan Semiconductor Manufacturing Company Limited

1731773625175.png
 
Last edited:
That's just plainly wrong.
Most core AVX operations are within 1-5 cycles on recent architectures. Haswell and Skylake did a lot to improve AVX throughput, but there have been several improvements since then too. E.g. add operations are now down from 4 to 2 cycles on Alder Lake and Sapphire Rapids. Shift operations are down to a single cycle. This is as fast as single integer operations. And FYI, all floating point operations go through the vector units, whether it's single operation, SSE or AVX, the latency will be the same. ;)
Oh, I confused them with I guess x87 ones that take forever, but I was mostly talking about the complex AVX ones not just add/multiply, because I am pretty sure there are a few that take 20-40 cycles

Yes, AMD designed a chip around the packaging technology that is TSMCs, the 3d packaging technology is TSMC's. It is not AMD's.

The below is from my second link.

If you want TSMC's official link you can find it here: nhttps://3dfabric.tsmc.com/english/dedicatedFoundry/technology/3DFabric.htmt

In fact, here is TSMCs press release introducing the packaging, Introducing TSMC 3DFabric: TSMC’s Family of 3D Silicon Stacking, Advanced Packaging Technologies and Services - Taiwan Semiconductor Manufacturing Company Limited

View attachment 371976
Wasn't Micron with HBM the first one to release 3d stacking(4 stacks iirc)

With a product released in 2015 by AMD with packaging on interposer?
Same as what intel later calls "foveros" just with active Vs passive interposer
 
You don't grasp the difference between L2 and L3 caches. L3 only contains data recently discarded by L2
L3 victim cache*
Bit pedantic given how almost all (if not all) CPUs use L3 as a victim cache, but I think it's important to explain that's what causes the behaviour you mentioned w.r.t. the L2<->L3 relationship.
 
U PC have something wrong, maybe slow Ram?
my second PC whit 9900K and 4090 there is no difference in 4k gaming VS my main system 7800X3D whit same GPU
even 1440p there is no big differences

That depends on your software.

And if it's human visible. Which is a big topic of discussion. Yes vs no.
I do not see those 45 FPS from my "freesync" ASUS PA278QV monitor with a RAdoen 7800XT in widows 11 pro.

--

I had for a few months in 2023 a Ryzen 3 3100 at the time I sold my ryzen 5800x and my b550 mainboard. For daily usage this cheap 30 € second hand bought and sold for 30€ cpu was totally fine. I did not saw any difference in gnu gentoo linux. The software is always compiling next to my pc usage time. It does not really matter if it takes more minutes in the background - the linux kernel handles the load quite well.

--

I see more the issue with badly designed compilers. There should be more optimisations for a processor. I think only a few packages uses the avx512 instruction set from my ryzen 7600x in gnu gentoo linux while compiling or while executing the software.
 
It was based on AMD interposer technology for the first HBM stacks in 2015. That Intel also copied, and Nvidia.

View attachment 371864

I have never understood the point of RnD in a company, as you mention here, AMD invested in something, but then Intel and Nvidia just copied it, win win for them and saving on RnD costs, but if AMD tries to copy CUDA programming language, they will die... I will never understand that world. Sounds shady as hell though if you ask me.
 
I always thought the eDRAM in those Intel processors was for the iGPU...
It typically is but if you disable the iGPU in the BIOS the CPU cores get exclusive access to the eDRAM. I have heard it does shave off another ~2MB L3 cache in order to store the tags necessary to address the 128MB slice though. Funnily enough I've recently been messing around with an old i7-5775C.
 
I have never understood the point of RnD in a company, as you mention here, AMD invested in something, but then Intel and Nvidia just copied it, win win for them and saving on RnD costs, but if AMD tries to copy CUDA programming language, they will die... I will never understand that world. Sounds shady as hell though if you ask me.
Much like Intel pays AMD for X64, I would wager their engineering costs are covered by their royalties. How many years did AMD bleed money, they were propped up at least some by their competitors through fees
 
What a shame! Intel's desktop lineup could really use such a boost.
To be fair, they don't. I checked TPU's latest review - a 13600k can deliver - on average - 2 to 3 times more frames at 720p than the 4090 can do at 4k. Depends on the game of course - but if you include all the games tested on the 9800x 3d review, we need much much much faster GPUs for the CPUs to play any important role. It's only 2 games that a faster CPU than the 13600k would matter but in both of those framerate was at 120 and above (Bg3 and starfield).
 
I have never understood the point of RnD in a company, as you mention here, AMD invested in something, but then Intel and Nvidia just copied it, win win for them and saving on RnD costs, but if AMD tries to copy CUDA programming language, they will die... I will never understand that world. Sounds shady as hell though if you ask me.

There is a lot of software written in CUDA. And the quality of the tools for CUDA ensure that this trend won't stop anytime soon.

AMD doesn't copy CUDA, they just want to be able to execute CUDA code.
 
To be fair, they don't.
Sure they do. Why wouldn't they? Are you kidding? The reason AMD currently has the gaming throne is directly because of the Ryzen CPUs with stacked 3D cache. Without it, well, where are the NON-3D cache AMD CPUs ranked? Right, there's the answer. If Intel were to employ something similar and get it right, their CPU line up would return rather handily to the top spot.
 
my second PC whit 9900K and 4090 there is no difference in 4k gaming VS my main system 7800X3D whit same GPU
This could be for a few reasons, like the games you play, or the FPS you're comfy with or targeting.

The 7800X3D is undeniably a much faster gaming CPU and will give you a significantly higher FPS ceiling (and better 1/0.1% lows) vs a 9900K @ 5GHZ. There absolutely can be the situation where both provide beyond the FPS you target in certain games, or being GPU limited in certain games, but that does nothing to tell you the true gaming performance of a CPU.

Excellent video covering it here. This literally has absolutely nothing to do with the brand of the CPU.
 
As someone that got the i7 - 5775C five years ago and the Ryzen 5800X3D two and a half years ago I still can't understand why there is even conversation about the big cache? Reminds me back in the days , is a 2 core CPU better than a higher clocked single core CPU , does HyperThreading actually do something. Why should I use a 64bit CPU? I mean the benchmarks are clear and here is nothing to argue about with them.

i7 5775C was a good CPU end of story. If someone doesn't know how to properly configure it or don't wanna bother just get yourself iMac this year don't bother and get the next iMac next year.

 
The 7800X3D is undeniably a much faster gaming CPU and will give you a significantly higher FPS ceiling (and better 1/0.1% lows) vs a 9900K @ 5GHZ.
But that's only for non-4k resolution. At 4k res, a 4090 is going to render the same general results with ANY CPU that doesn't bottleneck it, and that's big damn list. The 7800X3D only counts for resolutions under 4k. The benchmarks here at TPU and elsewhere bare that out.
 
Last edited:
But that's only for non-4k resolution. At that res, a 4090 is going to render the same general results with ANY CPU that doesn't bottleneck it, and that's big damn list. The 7800X3D only counts for resolutions under 4k. The benchmark here at TPU and elsewhere bare that out.
The video makes some great points however. Say you're GPU limited at 70 FPS on all ultra settings with a 4090 in a given game, and the 9900K can give you 80 fps, then sure you're all good, but what if you want 120fps and are willing to lower settings/use upscaling etc to get there? the 9900K will make your FPS ceiling 80 FPS and no amount of settings lowering, reducing resolution or using upscaling will improve that. It heavily depends on the game and user's preferences, but it absolutely applies to 4k too if you want to game at higher fps. For someone happy with 30/40/60 it matters a fair bit less for sure, but in many games I'm absolutely willing to lower visuals to get the balance of visuals and FPS to my taste, and a CPU absolutely can matter there.

I also recall W1zzards 4090 vs 53 games on a 5800x vs 5800X3D had some games showing differences at 4k, very much a large dose of 'it depends' on this one, but I do find the math/science of it undeniable, it more so a case of if that matters to the individual or not.
 
Intel's been copying AMDs tech for decades, nothing new here people. :laugh:

Sure they do. Why wouldn't they? Are you kidding? The reason AMD currently has the gaming throne is directly because of the Ryzen CPUs with stacked 3D cache. Without it, well, where are the NON-3D cache AMD CPUs ranked? Right, there's the answer. If Intel were to employ something similar and get it right, their CPU line up would return rather handily to the top spot.
Not everybody is looking for the 3DX CPUs, though they do provide the best gaming.
Top 15 best sellers on Amazon has 2 x ZENs with 3DX the rest are all Ryzen's and 2 Intel CPUs.
 
Intel's been copying AMDs tech for decades, nothing new here people. :laugh:


Not everybody is looking for the 3DX CPUs, though they do provide the best gaming.
Top 15 best sellers on Amazon has 2 x ZENs with 3DX the rest are all Ryzen's and 2 Intel CPUs.

Really, so why did Intel thrash AMD for over 10 years then?
 
A fact that X3D cache improves performance only in games is busted with Zen 5.
You can see improvements in any workloads that relies heavily on memory, e.g. rendering.
 
They didn't manage to reverse engineer Bulldozer, so they couldn't copy it, that's why.

Bulldozer wasn't worth reverse engineering

Anyway this is OT so lets just stop this ok
 
We need moar cache to be the new moar cores and moar ghz
 
A fact that X3D cache improves performance only in games is busted with Zen 5.
You can see improvements in any workloads that relies heavily on memory, e.g. rendering.
Was it? Afaik performance is pretty much the same with or without the extra cache for most of those. The difference between a 9800x3D and a 9700x in those tasks was mostly due to the V-cache model having a higher power limit and clocking a bit higher due to that.
Rendering is one of the cases were the extra cache does no difference, as far as I've seen.

Only cases the extra cache is really worth it is for CFD/HPC stuff, and some other specific database workloads.
 
Oh, I confused them with I guess x87 ones that take forever, but I was mostly talking about the complex AVX ones not just add/multiply, because I am pretty sure there are a few that take 20-40 cycles
There are many instructions that are very slow, although they are usually a tiny fraction of the workload, if at all.
Interestingly enough, Ice Lake/Rocket Lake brought the legacy FMUL down from 5 to 4 cycles, as well as integer division(IDIV) from 97 cycles to 18 cycles.

For comparison, Intel's current CPUs have 4 cycles for multiplication, 11 cycles for division of fp32 using AVX, and 5 cycles for integer multiplication using AVX. (official spec)

As for "worst case" performers of legacy x87 instructions: examples are FSQRT(square root) at 14-21 cycles, sin/cos/tan ~50-160 cycles and the most complex; FBSTP at 264, but this one is probably not very useful today. FDIV is 14-16 cycles (so slightly slower than its AVX counterpart). And for comparison, in Zen 4, legacy x87 instructions seems to be overall lower latency than Intel. All of these figures are from agner.org and are benchmarked, so a grain of salt, but they are probably good approximations.

Many think "legacy" instructions are holding back the performance of modern x86 CPUs, but that's not true. Since the mid 90s, they've all translated the x86 ISA to their own specific micro-operations, and this is also how they support x87/MMX/SSE/AVX through the same execution ports; the legacy instructions are translated to micro-ops anyways. This allows them to design the CPUs to be as efficient as possible with the new features, yet support the old ones. If the older ones happens to have worse latency, it's usually not an issue, as applications that rely on those are probably very old. One thing of note is that x87 instructions are rounding off differently than normal IEEE 754 fp32/fp64 does.

L3 victim cache*
Bit pedantic given how almost all (if not all) CPUs use L3 as a victim cache, but I think it's important to explain that's what causes the behaviour you mentioned w.r.t. the L2<->L3 relationship.
It's not pedantic at all, you missed the point.
The prefetcher only feeds the L2, not the L3, so anything in L3 must first be prefetched into L2 then eventually evicted to L3, where it remains for a short window before being evicted there too. Adding a lot of extra L3 only means the "garbage dump" will be larger, while adding just a tiny bit more of L2 would allow the prefetcher to work differently. In other words; a larger L3 doesn't mean you can prefetch a lot more, it just means that the data you've already prefetched anyways stays a little longer.
Secondly, as I said, the stream of data flowing through L3 is all coming from memory->L2, so the overall bandwidth here is limited by memory, even though the tiny bit you read back will have higher burst speed.

Software that will be more demanding in the coming years will be more computationally intensive, so over the long term the faster CPUs will be favored over those with more L3 cache. Those that are very L3 sensitive will remain outliers.
 
In other words; a larger L3 doesn't mean you can prefetch a lot more, it just means that the data you've already prefetched anyways stays a little longer.
First, excellent explanation as a whole. Second, I think the quote above is a key point and likely the biggest reason why a larger L3 cache shows improvement in some workloads. Keeping data resident in cache longer only helps hit rates assuming latency remains constant. I'm sure there is a tipping point where if you have a relatively tight loop where if you have too much data in cache that you'll be evicting things before the loop starts from the beginning again where a little more cache might get over that hurdle. An example of this might be a tight rendering loop in a game.
 
Back
Top