AMD "Zen 2" IPC 29 Percent Higher than "Zen"

looncraz · Nov 13, 2018

HTC said:
Are you referring to the 1st chip on 14 nm or to the 1st Zen chip on 14 nm? IIRC, when Zen was introduced, there were several chips being manufactured @ 14 nm, meaning the process was much more mature then 7 nm, where it is the 2nd chip (1st is Apple's A12 chip).

What is your source of Zen 2 CCX chiplet size? From what i've read, Zen 2 CCX chiplet measurement is roughly 73 mm² while yours is almost 10 mm² smaller.

For reference, i got those measurements from this post @ Anandtech forums.

When i made the pic in my previous reply, i was under the impression the CCX chiplet size was 72 mm² and that the chiplet was a square instead of a rectangle.

According to the die calculator page, those scribe values are invalid: either 0.1 or 0.15 but not 0.12.

Base on the current information, and with a defect density of 0.25, we get this (7.3 is also an invalid number for width so i improvised):

View attachment 110433

The range is from ~64mm^2 to ~72mm^2, which is why I also include a good range in my estimates, but I focused on the smaller size since I was using it to estimate the IO die size to see how much room was left over after everything we know had to be there was inside (a healthy 120mm^2 of die space with an unknown purpose...).

I'll redo the measurements using a better image of Rome.

The 4094 package is 58.5x75.4mm. You can fit more than 10.5~10.75 width wise, giving a range from 7.01~7.18 for the width. You can fit about 5.75~6.0 height wise, giving a range of 9.75-10.17. The range is a necessity to correct for perspective (minor), pixelation (minor), and lack of detail for the edges (moderate).

...But this isn't the true die size (despite being the cut chip size) as far as the calculators are concerned....

Each die is surrounded by the cut edge, so each edge potentially has 0.05~0.15mm of extra material the die calculator removes since it's as good a way as any (the fact that some of the material becomes part of the cut chip doesn't concern the calculator - it knows that edge can't be used for anything) ... something that is usually immaterial for these calculations, but these things are pretty small, so it suddenly matters. That's 0.1~0.3mm extra width and height that should be subtracted before placing into the calculator (or you can set the scribe size to zero, I suppose).

That gives a chiplet die size (as far as the calculator is concerned) of 6.71~7.08mm for width and 9.45~10.07 for height. Which is 63.4 ~ 71.3mm^2, which pretty much everyone rounds to 64~72mm^2 since there's so much room for error.

At the smallest size, with a defect density of 0.3/cm^2 (more on that later), there are 772 perfect dies and 931 total candidates per wafer, with 82.9% yield.
At the largest size, with the same defect density rate, there are 669 perfect dies and 825 total candidates per wafer, with 81.1% yield.

Since there are 8 cores per chiplet, likely 16MiB of L3 taking up a good chunk of the die space, and so on, AMD will likely be able to use 95%+ of all chiplets made. If half the cores or L3 is damaged, they can likely still salvage the die. AMD achieved nearly perfect effective yields with 14nm right from the start because of their harvesting - I wouldn't expect them to change when moving to a much more expensive process... especially when pretty much betting the company's future on its success.

At 95% effective yield, the range is 783~884 chiplets per wafer. A very minor adjustment to my original estimated range of 800~900.

--

Regarding defect density. 14nm LPE, this early in its life, had a defect density of less than 0.2/cm^2. By the time production began, it was under 0.1/cm^2. It was 0.08/cm^2 on Ryzen's launch and is now believed to be slightly lower. No reason why TSMC can't manage the same with their 7nm processes.

A process will never make it to production with less than a 60% yield... unless they have some very high margin products for it... IBM can charge so much that they can throw away 60% of a wafer. AMD can't do that - they need 80%+ yields given the high price of 7nm. And that includes Vega 20's yields - which is a much larger die on the same process, which gives us a hint about how low TSMC's 7nm defect rate probably is (0.2 or under would not be surprising - 0.1 would be exceptional at this point). AMD's confidence in the process is telling.

bug said:
Oh, gee, that's so simple to explain. Try saying that to the average buyer, see how it fares

AMD calls it cTDP.

You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard. Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU). AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.

efikkan · Nov 13, 2018

looncraz said:
Together, this CPU was being hammered during testing by two workloads that do quite well with instruction level parallelism (ILP) - the magic behind IPC with x86.

We can't read anything more from these results other than Zen 2 is ~30% faster when doing mixed integer and floating point workloads.

Doing one benchmark which is highly superscalar doesn't give us IPC, that gives us a best case performance for one type of workload.
Not to mention that the clock speeds are unknown. They have to be completely fixed to benchmark anything close to IPC.

looncraz said:
However, that particular scenario is actually very common. For games, specifically, we should see a large jump - mixed integer, branch, and floating point work loads with significant cross communication is exactly what the cores see in heavy gaming loads - Intel has won here because they have a unified scheduler, making it easier to get FPU results back to dependent instructions which will execute on an ALU (which might even be the same port on Intel...), it looks like AMD has aimed for superiority on this front.

Games are in fact one of the workloads that is the least superscalar and have the most branch and cache mispredictions. The reason why AMD scale well for certain superscalar workloads (like Blender and certain encoding and compression tasks) is that Zen have more "brute force" through ALUs/FPUs on execution ports, but fall short in gaming due a weaker front-end/prefetcher. Intel have few execution ports but achieve higher efficiency through better prediction and caching.

bug · Nov 13, 2018

looncraz said:
AMD calls it cTDP.

You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard. Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU). AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.

Yes, I've mentioned that briefly before. The thing is, any motherboard that can run the chip at full TDP can be restricted to cTDP. But that doesn't work the other way around. I don't expect all (and especially the cheaper) motherboards that can run CPUs at 95W to be able to also push in excess of 150W even momentarily. Imagine the outrage that would stem from Intel printing 150W on the CPU box when users find out they can't hit that with every motherboard. The alternative would be to mandate 150W+ support on all motherboards and drive prices up for everyone.

Bottom line, I just don't see a problem here. Clearly everyone into tech can figure out how to run these CPUs. And those who can't aren't probably spending the money on these. All I see here is a (typical by now) "omg! Intel did X, they're screwing end users!!!" reaction. When CPUs get this complex, specs get this complex too, that's all there is to it.

londiste · Nov 13, 2018

What makes this hard for Intel and confusing for us is AVX (well, technically AVX2 and all the 256-bit stuff). That is about 40% power on top of what CPU uses without it.
From what I can see from reviews, without AVX 9900K actually does consume close enough to the rated 95W.

That is also something we can look forward to analyzing in Zen2.

looncraz said:
AMD calls it cTDP.

You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard. Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU). AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.

You would think it is that simple and straightforward especially with the perfect power management (at least on the package level) on Ryzens. I have a 2400G on that same Gigabyte board that most reviewers got. It used to have cTDP option in the BIOS but that went missing after a BIOS update. Similarly, the damn board actually had an MCE-like setting that indeed used to run my poor 2400G at 95-100W.

looncraz · Nov 13, 2018

efikkan said:
Doing one benchmark which is highly superscalar doesn't give us IPC, that gives us a best case performance for one type of workload.
Not to mention that the clock speeds are unknown. They have to be completely fixed to benchmark anything close to IPC.

AMD gave actual IPC values (they have perfcounters, so they know exactly the IPC). In a single workload, it's tough to get high IPC without SIMD. Still, the type of workloads were designed to exploit integer to FPU communications and the ability to keep them fed.

efikkan said:
Games are in fact one of the workloads that is the least superscalar and have the most branch and cache mispredictions. The reason why AMD scale well for certain superscalar workloads (like Blender and certain encoding and compression tasks) is that Zen have more "brute force" through ALUs/FPUs on execution ports, but fall short in gaming due a weaker front-end/prefetcher. Intel have few execution ports but achieve higher efficiency through better prediction and caching.

Superscalar isn't what I was talking about regarding games (my comment probably did make that a bit confusing, though) - it's the ability of the FPU to get results back to the ALUs that games need most. AMD showed they can do that at least 29% better. In all likelihood, it's probably exactly 33% higher and the 29% is a result of an ALU bottleneck (adds, subtractions, movs, etc. have likely not improved - but you can't really get those to be any better, anyway).

Ryzen uses the FPU as a coprocessor, so there's a good chunk of delay from when instructions are decoded and when they are executed... and dependent integer or memory operations are processed in parallel as far as possible. This isn't strictly superscalar simply because the execution steps become disjointed after classification as floating point, memory, or integer, and the front end generates multiple instruction streams from a single instruction stream.

efikkan · Nov 13, 2018

looncraz said:
AMD gave actual IPC values (they have perfcounters, so they know exactly the IPC).

They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.

looncraz said:
it's the ability of the FPU to get results back to the ALUs that games need most. AMD showed they can do that at least 29% better. In all likelihood, it's probably exactly 33% higher and the 29% is a result of an ALU bottleneck (adds, subtractions, movs, etc. have likely not improved - but you can't really get those to be any better, anyway).

Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.

Valantar · Nov 13, 2018

bug said:
Oh, gee, that's so simple to explain. Try saying that to the average buyer, see how it fares

"This number," *points at label showing 95W/3.6GHz* "is baseline performance, what you get with a cheap mobo and cooler. This other number," *points to where it says 180W/4.7GHz* "is what you can reach if you invest in better cooling and a more solidly built motherboard. It's around 30% faster on paper, though YMMV."

That wasn't so hard, now was it?

R0H1T · Nov 13, 2018

efikkan said:
They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.

Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.

That's because it's nearly impossible to replicate test conditions from one test bench to another, one chip to another. AMD can't lie to their investors, their claims are true even if cherry picked. Yes ~ let's assume best case, or best of 3 (runs) but what you seem to be doing here is Intel/Nivdia (AMD?) lied in the past so this is also a lie.

looncraz · Nov 13, 2018

efikkan said:
They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.

No doubt it was cherry picked - there's a reason those tests specifically leveraged the hardware AMD mentioned as being improved. This is the peak improvement (outside of 256 bit floating point, which should roughly double in performance) we should expect. It's the upper bounds... but it's still a valid test because we can gleam from it several things.

dkern can be extremely useful for testing the front end... but you need to have a rather large vector upon which to operate for that to happen... and we just don't know what AMD was doing with dkern since it's just a function that operates on data. However, dkern has branches, does integer or floating point comparisons, decrements, subtraction, division, and operates on potentially large amounts of data (usually image or scientific data, being a statistical smoothing method).

RSA in this situation could be used to decrypt or encrypt the data being accessed or could be an entirely other program being used... or AMD ran two benchmarks and averaged the results... they were very unclear.

RSA has a few tight loops, multiplication, division, comparisons and branches within loops, and potentially significant bandwidth utilization (simply jumping to L2 counts as significant in this context).

efikkan said:
Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.

They are usually bottlenecked by ALU->FPU communication or, as you say, cache miss penalties. Ryzen's sequential cache performance is very good, but it falls flat with random accesses, so that is definitely a significant role - and hopefully something AMD has resolved with Zen 2. A 33% improvement in ALU->FPU throughput means ~10% improvement for many CPU bottlenecked games per cycle. That puts them roughly on par with Intel for those games. Others that are thrashing the cache (which would be a bad game engine - of which there are plenty (I'm looking at you Hitman!)) won't care at all about that improvement (or very little... or even "dislike" it). Here, of course, Zen 2 will need to have reduced semi-random access latencies. It doesn't help that each core advertises access to 8MiB of L3 but only seems to search 4MiB worth of tags before jumping to the IMC. An L4 would help here - we wouldn't be hitting memory latencies for in-page random access, in the very least, but would be much closer to Intel's ~20ns figures.

Valantar · Nov 13, 2018

efikkan said:
They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.

Calling something solely mentioned in an endnote and not used as a marketing point whatsoever "marketing BS" is... a stretch. Yes, we should hold AMD to the same standard as Nvidia and Intel, but this isn't even close to the crap they've pulled earlier (or AMD, for that matter). This is "presenting" (well, not really, more like "not quite omitting") very specific numbers from a very specific benchmark in a very specific context, and not making a fuss about it. AMD hasn't at any point said "Zen2 has 29% improved IPC from Zen." If they did, that would be marketing BS. And, again:

Valantar said:
We need to remember that IPC is workload dependent. There might be a 29% increase in IPC in certain workloads, but generally, when we talk about IPC it is average IPC across a wide selection of workloads. This also applies when running test suites like SPEC or GeekBench, as they run a wide variety of tests stressing various parts of the core. What AMD has "presented" (it was in a footnote, it's not like they're using this for marketing) is from two specific workloads. This means that a) this can very likely be true, particularly if the workloads are FP-heavy, and b) this is very likely not representative of total average IPC across most end-user-relevant test suites. In other words, this can be both true (in the specific scenarios in question) and misleading (if read as "average IPC over a broad range of workloads").

Given that AMD hasn't pushed this as a selling point, they haven't done anything even remotely wrong. They can't be blamed for inept journalists and/or fanboys taking specific statements out of context.

efikkan · Nov 13, 2018

R0H1T said:
That's because it's nearly impossible to replicate test conditions from one test bench to another, one chip to another. AMD can't lie to their investors, their claims are true even if cherry picked. Yes ~ let's assume best case, or best of 3 (runs) but what you seem to be doing here is Intel/Nivdia (AMD?) lied in the past so this is also a lie.

You know very well what I'm talking about; all the vendors choose benchmarks which favors them at any time, which puts their product in the best possible position.

looncraz said:
They are usually bottlenecked by ALU->FPU communication or, as you say, cache miss penalties.

ALU->FPU communication? What do you mean by that? Conversion of ints to floats?

looncraz said:
Ryzen's sequential cache performance is very good, but it falls flat with random accesses, so that is definitely a significant role - and hopefully something AMD has resolved with Zen 2.

You do know how cache works, right? In sequential reads cache should be "transparent".
Cache is organized in banks. Zen's 8-way 512kB L2 is actually 8 separate 64kB caches (Skylake have 4-way 256kB, Haswell 8-way 256kB). Memory is stored in 64b cache lines, for sequential reads the cache lines will be evenly spread across the banks. Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.

Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.

What does random accesses have to do with this? Nothing can ever predict random accesses, they will fall through the cache and read directly from memory. The only thing that can marginally help here is the OoOE trying to dereference a pointer etc. as early as possible, but usually the limits to how far ahead the prefetcher can "see", and of course all branching logic and other pointers may limit the room for early execution here. Once again this has to do with the efficiency of the prediction, not the latency of the cache.

looncraz said:
It doesn't help that each core advertises access to 8MiB of L3 but only seems to search 4MiB worth of tags before jumping to the IMC. An L4 would help here - we wouldn't be hitting memory latencies for in-page random access, in the very least, but would be much closer to Intel's ~20ns figures.

You mean that each four cores shares one L3 cache?
L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores. And when it is, it's mostly code, not data. When it comes to writing, the CPU engages a write-lock to discard a cache line from all caches, any latency here comes down to the entire memory structure, not the L3.
I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.
I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.

looncraz · Nov 14, 2018

efikkan said:
ALU->FPU communication? What do you mean by that? Conversion of ints to floats?

Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.

efikkan said:
In sequential reads cache should be "transparent".

Ideally, yes, you should never have a stall with streaming data... but there's a difference between operating with data right off a data bus, within the register file, hitting the 1ns latency of L1D, or hitting the 3~4ns latency of the L2. On Zen, at least, there's no real bandwidth penalty for hitting the L2.

efikkan said:
Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.

That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.

efikkan said:
Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.

Most of Intel's front end isn't necessarily better than Zen's (just different). Remember that the 6900k has a 20MiB L3. Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...

Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.

efikkan said:
You mean that each four cores shares one L3 cache?
L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores.

Data sharing is insanely common in multi-threaded programs. If it wasn't we wouldn't have to worry so much about lock contention. I have a good ~20 years of MT programming experience (from way back in the BeOS 4.5 era) with numerous programming languages, operating systems, and devices. I directly tested Zen's inter-core and inter-CCX communications to discover that there was a fast-path (low latency, low bandwidth) communication path for small data packets before AMD detailed the command fabric. I discovered it by accident because I couldn't explain how I was getting data between Core 0 to Core7 (on different CCXes) with only something like a 20ns penalty versus going from core 0 to core 1 (same CCX, neighboring cores).... but I digress...

efikkan said:
I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.

Yes, they copied Zen (kind of a joke...). Though AMD uses a 'mostly' exclusive design - though we don't fully know, AFAIK, what they mean by that.

efikkan said:
I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.

It's not the 4MiB per core - it's what happens in many scenarios when a core tries to access beyond that basic block...

londiste · Nov 14, 2018

looncraz said:
Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.

I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.

HTC · Nov 14, 2018

londiste said:
I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.

More likely, a combination of both.

R0H1T · Nov 14, 2018

efikkan said:
You know very well what I'm talking about; all the vendors choose benchmarks which favors them at any time, which puts their product in the best possible position.

ALU->FPU communication? What do you mean by that? Conversion of ints to floats?

You do know how cache works, right? In sequential reads cache should be "transparent".
Cache is organized in banks. Zen's 8-way 512kB L2 is actually 8 separate 64kB caches (Skylake have 4-way 256kB, Haswell 8-way 256kB). Memory is stored in 64b cache lines, for sequential reads the cache lines will be evenly spread across the banks. Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.

Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.

What does random accesses have to do with this? Nothing can ever predict random accesses, they will fall through the cache and read directly from memory. The only thing that can marginally help here is the OoOE trying to dereference a pointer etc. as early as possible, but usually the limits to how far ahead the prefetcher can "see", and of course all branching logic and other pointers may limit the room for early execution here. Once again this has to do with the efficiency of the prediction, not the latency of the cache.

You mean that each four cores shares one L3 cache?
L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores. And when it is, it's mostly code, not data. When it comes to writing, the CPU engages a write-lock to discard a cache line from all caches, any latency here comes down to the entire memory structure, not the L3.
I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.
I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.

And that's fine so long as the benchmarks aren't outright fudged or in some cases the competition's system deliberately gimped. Best case implies just that & should always be taken with a grain of salt.

What you're forgetting though is that without these "best case" benches there would be little to no IPC gain, for Intel, in the last few years. Take the FP numbers out, heavily influenced by AVX2 or AVX512, & you have virtually 0 IPC gains for close to 4 years, if not more. That's because x86 has pretty much reached the end of the line so far as IPC gains are concerned. The biggest performance gains this decade have come from tweaking cache hierarchy, DDR4, AVX & clock speeds. That's not the case for ARM but it's also not a part of this debate.

londiste · Nov 14, 2018

ARM is not magic. The gains ARM has done have already been there for a long time in x86 and other architectures (VFP, NEON, SVE, 64-bit, multiple cores, out-of-order, prediction). In addition to that due to it being small(er), simple(r) and cheap(er) ARM has had a process node advantage for a couple generations now. As ARM improves, it will start running into similar problems as other architectures.

WikiFM · Nov 14, 2018

Mark Little said:
Anandtech does a great job with their Bench tool on their website. It helps with conversations like this one. Here are the Full package load power measurements for the last four Intel generations:
i7-6700K 82.55W
i7-7700K 95.14W
i7-8700K 150.91W
i9-9900K 168.48W
There is a major change between the 7th and 8th generations. However, Intel rates them all as 95W. You don't see a problem with this?
Source: https://www.anandtech.com/bench/CPU-2019/2194

EDIT: And if you look at all the CPUS at Full package load at that link, you will see almost all fall below or within +10% of the rated TDP across desktop, HEDT and server chips from both AMD and Intel. Only the 8700K and the 9900K are way off. This is deceptive advertising at its worst to try and look competitive and cover up being stuck on the same process node.

In that same chart the 8086K which is a tiny bit faster than 8700K consumes 100W, why is that?

Valantar said:
The 8121U has its AVX512 units (as well as pretty much everything else) disabled.

None of us will be buying EPYC. That's why we're taking what they've said about it and attempting to extrapolate what this means for Ryzen 3000 and TR3. Also, it's interesting to discuss when someone makes some actual innovations in this space, even if we're not in the target market.

I'd actually consider the same, even if I'm very happy with my 1600X.

8121U is not disabled: https://ark.intel.com/products/136863/Intel-Core-i3-8121U-Processor-4M-Cache-up-to-3_20-GHz. And even if it was the case it has them.

R0H1T · Nov 14, 2018

londiste said:
ARM is not magic. The gains ARM has done have already been there for a long time in x86 and other architectures (VFP, NEON, SVE, 64-bit, multiple cores, out-of-order, prediction). In addition to that due to it being small(er), simple(r) and cheap(er) ARM has had a process node advantage for a couple generations now. As ARM improves, it will start running into similar problems as other architectures.

Alright & where did I say that? The reason ARM is doing better than x86 atm is because x86 has been tweaked & improved over 4 decades, the gains decade on decade have been massive. ARM is only realizing these (massive) gains since 2k, so they may still have a decade or half to hit the wall which x86 seems to be running into right now. They could, of course hit the physics wall first.

Why do I see this dismissive tone when talking about ARM here, do you buy phones with Intel Inside? Why do you think that is, do you treat (one of) AMD vs Nvidia the same way?

londiste · Nov 14, 2018

R0H1T said:
Why do I see this dismissive tone when talking about ARM here, do you buy phones with Intel Inside?

I apologize, this wasn't so much about your post but the apparent general impression that ARM is something completely different and has long way to go. It doesn't.

The other part is why are you dismissive about x86? There have been some tests on single core performance here and there from Sandy Bridge forward (7 years). For example:
https://m.sweclockers.com/test/23426-amd-ryzen-7-1800x-och-7-1700x/29
23% in Cinebench is not too bad. And Cinebench does only 128-bit AVX (which is from Sandy Bridge/Bulldozer era).

Intel has been stuck on Skylake and derivatives for 3 years but I would not necessarily put this down to inability to improve IPC. Core count is the clear focus for the last few years.

Valantar · Nov 14, 2018

WikiFM said:
8121U is not disabled: https://ark.intel.com/products/136863/Intel-Core-i3-8121U-Processor-4M-Cache-up-to-3_20-GHz. And even if it was the case it has them.

I stand corrected. I guess something on that chip had to avoid getting cut, lord knows everything else is.

R0H1T · Nov 14, 2018

londiste said:
I apologize, this wasn't so much about your post but the apparent general impression that ARM is something completely different and has long way to go. It doesn't.

The other part is why are you dismissive about x86? There have been some tests on single core performance here and there from Sandy Bridge forward (7 years). For example:
https://m.sweclockers.com/test/23426-amd-ryzen-7-1800x-och-7-1700x/29
23% in Cinebench is not too bad. And Cinebench does not do AVX.

Intel has been stuck on Skylake and derivatives for 3 years but I would not necessarily put this down to inability to improve IPC. Core count is the clear focus for the last few years.

So far as raw performance is concerned, no x86 is still the team (Intel & AMD) to beat. Howsoever, as I've noted, IPC gains outside of AVX assisted (FP) workload have been hard to come by. Can you deny that?

I've also noted that the biggest changes have been in cache, memory, clock speeds & arguably HT or SMT for AMD. Admittedly ARM also benefits from that, but again the point is ARM are coming from a much smaller base (number) & so their gains are incredible. The biggest servers, supercomputers will still be vastly x86 based, but as you've said that's down to more cores. IMO (chip) interconnect technologies like UPI or IF are the next & perhaps the last hurdle before x86 reaches it's peak. I don't see the same kind of progress in the next decade, as we've seen since 2010 unless there's some major breakthrough. The future is dedicated (hardware) accelerators, that is where I see computing realm headed. The core wars have just begun but even there physics will catch up pretty soon.

bucketface · Nov 14, 2018

This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.

londiste · Nov 14, 2018

R0H1T said:
I've also noted that the biggest changes have been in cache, memory, clock speeds & arguably HT or SMT for AMD.

Caches are integral part of any contemporary CPU.
Memory speeds have increased, yes. Well, latency not so much but bandwidth for sure. Memory improvement will continue. DDR5 is on its way and for some implementations, GDDR/HBM with their respective up- and downsides. How much this affects results depends on benchmark. In case of Cinebench, it has a very low scaling with faster memory.
The Cinebench test I linked is at the same clock speeds and single core/thread so no HT/SMP in play.

R0H1T said:
So far as raw performance is concerned, no x86 is still the team (Intel & AMD) to beat. Howsoever, as I've noted, IPC gains outside of AVX assisted (FP) workload have been hard to come by. Can you deny that?

I do not know if I would want to separate FP from general CPU performance. FP has been a part of x86 for a long time - in coprocessors since the beginning, integrated in and Pentium onwards. Improving parts of the instruction set is part of CPU evolution.

R0H1T said:
Admittedly ARM also benefits from that, but again the point is ARM are coming from a much smaller base (number) & so their gains are incredible.

Comparisons with ARM are difficult. In addition to integrating performance-improving aspects (many of them tried-and-true) ARM has been moving to scale its architecture(s) higher and higher, largely funded and driven by smartphones. Higher-performing ARM CPUs are not small and they do consume considerable amounts of power.

Actually, since you mentioned AVX and other instruction-level improvements being a bit suspect when comparing IPC, ARM has gained a lot of its IPC and almost all of its FP performance from just that. Since ARM's focus is different, they are also going for more cores rather than more performance, especially in the last few years. They also benefit from not having to be compatible

Rest of your post I wholeheartely agree with.

looncraz · Nov 14, 2018

londiste said:
I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.

Ryzen basically double-dips when you have higher memory clocks. Intel gains, Zen gains double.

bug · Nov 14, 2018

If it seems too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Processor	Ryzen 7800X3D
Motherboard	ROG STRIX B650E-F GAMING WIFI
Memory	2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s)	INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage	2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s)	42" LG C2 OLED, 27" ASUS PG279Q
Case	Thermaltake Core P5
Power Supply	Fractal Design Ion+ Platinum 760W
Mouse	Corsair Dark Core RGB Pro SE
Keyboard	Corsair K100 RGB
VR HMD	HTC Vive Cosmos

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

System Name	Hotbox
Processor	AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard	ASRock Phantom Gaming B550 ITX/ax
Cooling	LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory	32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s)	PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage	2TB Adata SX8200 Pro
Display(s)	Dell U2711 main, AOC 24P2C secondary
Case	SSUPD Meshlicious
Audio Device(s)	Optoma Nuforce μDAC 3
Power Supply	Corsair SF750 Platinum
Mouse	Logitech G603
Keyboard	Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software	Windows 10 Pro

System Name	HTC's System
Processor	Ryzen 5 5800X3D
Motherboard	Asrock Taichi X370
Cooling	NH-C14, with the AM4 mounting kit
Memory	G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s)	Sapphire Pulse 6600 8 GB
Storage	1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s)	LG 27UD58
Case	Fractal Design Define R6 USB-C
Audio Device(s)	Onboard
Power Supply	Corsair TX 850M 80+ Gold
Mouse	Razer Deathadder Elite
Software	Ubuntu 20.04.6 LTS

System Name	N/A
Processor	Intel Core i5 3570
Motherboard	Gigabyte B75
Cooling	Coolermaster Hyper TX3
Memory	12 GB DDR3 1600
Video Card(s)	MSI Gaming Z RTX 2060
Storage	SSD
Display(s)	Samsung 4K HDR 60 Hz TV
Case	Eagle Warrior Gaming
Audio Device(s)	N/A
Power Supply	Coolermaster Elite 460W
Mouse	Vorago KM500
Keyboard	Vorago KM500
Software	Windows 10
Benchmark Scores	N/A

Processor	5800x3d
Motherboard	Asus B550 Gaming-F
Cooling	Ek 240 Aio
Memory	Gskill Trident Neo 4000 18-22-22-42 @3800 fclk 1900
Video Card(s)	2080ti
Storage	1 TB Nvme
Power Supply	Seasonic 750w
Software	Win 11