Sunday, November 12th 2017

AMD "Zen 2" IPC 29 Percent Higher than "Zen"

AMD reportedly put out its IPC (instructions per clock) performance guidance for its upcoming "Zen 2" micro-architecture in a version of its Next Horizon investor meeting, and the numbers are staggering. The next-generation CPU architecture provides a massive 29 percent IPC uplift over the original "Zen" architecture. While not developed for the enterprise segment, the stopgap "Zen+" architecture brought about 3-5 percent IPC uplifts over "Zen" on the backs of faster on-die caches and improved Precision Boost algorithms. "Zen 2" is being developed for the 7 nm silicon fabrication process, and on the "Rome" MCM, is part of the 8-core chiplets that aren't subdivided into CCX (8 cores per CCX).

According to Expreview, AMD conducted DKERN + RSA test for integer and floating point units, to arrive at a performance index of 4.53, compared to 3.5 of first-generation Zen, which is a 29.4 percent IPC uplift (loosely interchangeable with single-core performance). "Zen 2" goes a step beyond "Zen+," with its designers turning their attention to critical components that contribute significantly toward IPC - the core's front-end, and the number-crunching machinery, FPU. The front-end of "Zen" and "Zen+" cores are believed to be refinements of previous-generation architectures such as "Excavator." Zen 2 gets a brand-new front-end that's better optimized to distribute and collect workloads between the various on-die components of the core. The number-crunching machinery gets bolstered by 256-bit FPUs, and generally wider execution pipelines and windows. These come together yielding the IPC uplift. "Zen 2" will get its first commercial outing with AMD's 2nd generation EPYC "Rome" 64-core enterprise processors.

Update Nov 14: AMD has issued the following statement regarding these claims.
As we demonstrated at our Next Horizon event last week, our next-generation AMD EPYC server processor based on the new 'Zen 2' core delivers significant performance improvements as a result of both architectural advances and 7nm process technology. Some news media interpreted a 'Zen 2' comment in the press release footnotes to be a specific IPC uplift claim. The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications. We will provide additional details on 'Zen 2' IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.
Source: Expreview
Add your own comment

162 Comments on AMD "Zen 2" IPC 29 Percent Higher than "Zen"

#126
efikkan
looncraz said:

Together, this CPU was being hammered during testing by two workloads that do quite well with instruction level parallelism (ILP) - the magic behind IPC with x86.

We can't read anything more from these results other than Zen 2 is ~30% faster when doing mixed integer and floating point workloads.
Doing one benchmark which is highly superscalar doesn't give us IPC, that gives us a best case performance for one type of workload.
Not to mention that the clock speeds are unknown. They have to be completely fixed to benchmark anything close to IPC.

looncraz said:

However, that particular scenario is actually very common. For games, specifically, we should see a large jump - mixed integer, branch, and floating point work loads with significant cross communication is exactly what the cores see in heavy gaming loads - Intel has won here because they have a unified scheduler, making it easier to get FPU results back to dependent instructions which will execute on an ALU (which might even be the same port on Intel...), it looks like AMD has aimed for superiority on this front.
Games are in fact one of the workloads that is the least superscalar and have the most branch and cache mispredictions. The reason why AMD scale well for certain superscalar workloads (like Blender and certain encoding and compression tasks) is that Zen have more "brute force" through ALUs/FPUs on execution ports, but fall short in gaming due a weaker front-end/prefetcher. Intel have few execution ports but achieve higher efficiency through better prediction and caching.
Posted on Reply
#127
bug
looncraz said:
AMD calls it cTDP.

You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard. Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU). AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.
Yes, I've mentioned that briefly before. The thing is, any motherboard that can run the chip at full TDP can be restricted to cTDP. But that doesn't work the other way around. I don't expect all (and especially the cheaper) motherboards that can run CPUs at 95W to be able to also push in excess of 150W even momentarily. Imagine the outrage that would stem from Intel printing 150W on the CPU box when users find out they can't hit that with every motherboard. The alternative would be to mandate 150W+ support on all motherboards and drive prices up for everyone.

Bottom line, I just don't see a problem here. Clearly everyone into tech can figure out how to run these CPUs. And those who can't aren't probably spending the money on these. All I see here is a (typical by now) "omg! Intel did X, they're screwing end users!!!" reaction. When CPUs get this complex, specs get this complex too, that's all there is to it.
Posted on Reply
#128
londiste
What makes this hard for Intel and confusing for us is AVX (well, technically AVX2 and all the 256-bit stuff). That is about 40% power on top of what CPU uses without it.
From what I can see from reviews, without AVX 9900K actually does consume close enough to the rated 95W.

That is also something we can look forward to analyzing in Zen2.

looncraz said:
AMD calls it cTDP.

You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard. Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU). AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.
You would think it is that simple and straightforward especially with the perfect power management (at least on the package level) on Ryzens. I have a 2400G on that same Gigabyte board that most reviewers got. It used to have cTDP option in the BIOS but that went missing after a BIOS update. Similarly, the damn board actually had an MCE-like setting that indeed used to run my poor 2400G at 95-100W.
Posted on Reply
#129
looncraz
efikkan said:
Doing one benchmark which is highly superscalar doesn't give us IPC, that gives us a best case performance for one type of workload.
Not to mention that the clock speeds are unknown. They have to be completely fixed to benchmark anything close to IPC.
AMD gave actual IPC values (they have perfcounters, so they know exactly the IPC). In a single workload, it's tough to get high IPC without SIMD. Still, the type of workloads were designed to exploit integer to FPU communications and the ability to keep them fed.


efikkan said:

Games are in fact one of the workloads that is the least superscalar and have the most branch and cache mispredictions. The reason why AMD scale well for certain superscalar workloads (like Blender and certain encoding and compression tasks) is that Zen have more "brute force" through ALUs/FPUs on execution ports, but fall short in gaming due a weaker front-end/prefetcher. Intel have few execution ports but achieve higher efficiency through better prediction and caching.
Superscalar isn't what I was talking about regarding games (my comment probably did make that a bit confusing, though) - it's the ability of the FPU to get results back to the ALUs that games need most. AMD showed they can do that at least 29% better. In all likelihood, it's probably exactly 33% higher and the 29% is a result of an ALU bottleneck (adds, subtractions, movs, etc. have likely not improved - but you can't really get those to be any better, anyway).

Ryzen uses the FPU as a coprocessor, so there's a good chunk of delay from when instructions are decoded and when they are executed... and dependent integer or memory operations are processed in parallel as far as possible. This isn't strictly superscalar simply because the execution steps become disjointed after classification as floating point, memory, or integer, and the front end generates multiple instruction streams from a single instruction stream.
Posted on Reply
#130
efikkan
looncraz said:
AMD gave actual IPC values (they have perfcounters, so they know exactly the IPC).
They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.

looncraz said:

it's the ability of the FPU to get results back to the ALUs that games need most. AMD showed they can do that at least 29% better. In all likelihood, it's probably exactly 33% higher and the 29% is a result of an ALU bottleneck (adds, subtractions, movs, etc. have likely not improved - but you can't really get those to be any better, anyway).
Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.
Posted on Reply
#131
Valantar
bug said:
Oh, gee, that's so simple to explain. Try saying that to the average buyer, see how it fares :D
"This number," *points at label showing 95W/3.6GHz* "is baseline performance, what you get with a cheap mobo and cooler. This other number," *points to where it says 180W/4.7GHz* "is what you can reach if you invest in better cooling and a more solidly built motherboard. It's around 30% faster on paper, though YMMV."

That wasn't so hard, now was it?
Posted on Reply
#132
R0H1T
efikkan said:
They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.


Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.
That's because it's nearly impossible to replicate test conditions from one test bench to another, one chip to another. AMD can't lie to their investors, their claims are true even if cherry picked. Yes ~ let's assume best case, or best of 3 (runs) but what you seem to be doing here is Intel/Nivdia (AMD?) lied in the past so this is also a lie.
Posted on Reply
#133
looncraz
efikkan said:
They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.
No doubt it was cherry picked - there's a reason those tests specifically leveraged the hardware AMD mentioned as being improved. This is the peak improvement (outside of 256 bit floating point, which should roughly double in performance) we should expect. It's the upper bounds... but it's still a valid test because we can gleam from it several things.

dkern can be extremely useful for testing the front end... but you need to have a rather large vector upon which to operate for that to happen... and we just don't know what AMD was doing with dkern since it's just a function that operates on data. However, dkern has branches, does integer or floating point comparisons, decrements, subtraction, division, and operates on potentially large amounts of data (usually image or scientific data, being a statistical smoothing method).

RSA in this situation could be used to decrypt or encrypt the data being accessed or could be an entirely other program being used... or AMD ran two benchmarks and averaged the results... they were very unclear.

RSA has a few tight loops, multiplication, division, comparisons and branches within loops, and potentially significant bandwidth utilization (simply jumping to L2 counts as significant in this context).

efikkan said:

Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.
They are usually bottlenecked by ALU->FPU communication or, as you say, cache miss penalties. Ryzen's sequential cache performance is very good, but it falls flat with random accesses, so that is definitely a significant role - and hopefully something AMD has resolved with Zen 2. A 33% improvement in ALU->FPU throughput means ~10% improvement for many CPU bottlenecked games per cycle. That puts them roughly on par with Intel for those games. Others that are thrashing the cache (which would be a bad game engine - of which there are plenty (I'm looking at you Hitman!)) won't care at all about that improvement (or very little... or even "dislike" it). Here, of course, Zen 2 will need to have reduced semi-random access latencies. It doesn't help that each core advertises access to 8MiB of L3 but only seems to search 4MiB worth of tags before jumping to the IMC. An L4 would help here - we wouldn't be hitting memory latencies for in-page random access, in the very least, but would be much closer to Intel's ~20ns figures.
Posted on Reply
#134
Valantar
efikkan said:
They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.
Calling something solely mentioned in an endnote and not used as a marketing point whatsoever "marketing BS" is... a stretch. Yes, we should hold AMD to the same standard as Nvidia and Intel, but this isn't even close to the crap they've pulled earlier (or AMD, for that matter). This is "presenting" (well, not really, more like "not quite omitting") very specific numbers from a very specific benchmark in a very specific context, and not making a fuss about it. AMD hasn't at any point said "Zen2 has 29% improved IPC from Zen." If they did, that would be marketing BS. And, again:
Valantar said:
We need to remember that IPC is workload dependent. There might be a 29% increase in IPC in certain workloads, but generally, when we talk about IPC it is average IPC across a wide selection of workloads. This also applies when running test suites like SPEC or GeekBench, as they run a wide variety of tests stressing various parts of the core. What AMD has "presented" (it was in a footnote, it's not like they're using this for marketing) is from two specific workloads. This means that a) this can very likely be true, particularly if the workloads are FP-heavy, and b) this is very likely not representative of total average IPC across most end-user-relevant test suites. In other words, this can be both true (in the specific scenarios in question) and misleading (if read as "average IPC over a broad range of workloads").
Given that AMD hasn't pushed this as a selling point, they haven't done anything even remotely wrong. They can't be blamed for inept journalists and/or fanboys taking specific statements out of context.
Posted on Reply
#135
efikkan
R0H1T said:
That's because it's nearly impossible to replicate test conditions from one test bench to another, one chip to another. AMD can't lie to their investors, their claims are true even if cherry picked. Yes ~ let's assume best case, or best of 3 (runs) but what you seem to be doing here is Intel/Nivdia (AMD?) lied in the past so this is also a lie.
You know very well what I'm talking about; all the vendors choose benchmarks which favors them at any time, which puts their product in the best possible position.

looncraz said:

They are usually bottlenecked by ALU->FPU communication or, as you say, cache miss penalties.
ALU->FPU communication? What do you mean by that? Conversion of ints to floats?

looncraz said:

Ryzen's sequential cache performance is very good, but it falls flat with random accesses, so that is definitely a significant role - and hopefully something AMD has resolved with Zen 2.
You do know how cache works, right? In sequential reads cache should be "transparent".
Cache is organized in banks. Zen's 8-way 512kB L2 is actually 8 separate 64kB caches (Skylake have 4-way 256kB, Haswell 8-way 256kB). Memory is stored in 64b cache lines, for sequential reads the cache lines will be evenly spread across the banks. Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.

Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.

What does random accesses have to do with this? Nothing can ever predict random accesses, they will fall through the cache and read directly from memory. The only thing that can marginally help here is the OoOE trying to dereference a pointer etc. as early as possible, but usually the limits to how far ahead the prefetcher can "see", and of course all branching logic and other pointers may limit the room for early execution here. Once again this has to do with the efficiency of the prediction, not the latency of the cache.

looncraz said:

It doesn't help that each core advertises access to 8MiB of L3 but only seems to search 4MiB worth of tags before jumping to the IMC. An L4 would help here - we wouldn't be hitting memory latencies for in-page random access, in the very least, but would be much closer to Intel's ~20ns figures.
You mean that each four cores shares one L3 cache?
L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores. And when it is, it's mostly code, not data. When it comes to writing, the CPU engages a write-lock to discard a cache line from all caches, any latency here comes down to the entire memory structure, not the L3.
I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.
I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.
Posted on Reply
#136
looncraz
efikkan said:

ALU->FPU communication? What do you mean by that? Conversion of ints to floats?
Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.

efikkan said:

In sequential reads cache should be "transparent".
Ideally, yes, you should never have a stall with streaming data... but there's a difference between operating with data right off a data bus, within the register file, hitting the 1ns latency of L1D, or hitting the 3~4ns latency of the L2. On Zen, at least, there's no real bandwidth penalty for hitting the L2.

efikkan said:

Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.
That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.



efikkan said:

Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.
Most of Intel's front end isn't necessarily better than Zen's (just different). Remember that the 6900k has a 20MiB L3. Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...

Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.


efikkan said:

You mean that each four cores shares one L3 cache?
L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores.
Data sharing is insanely common in multi-threaded programs. If it wasn't we wouldn't have to worry so much about lock contention. I have a good ~20 years of MT programming experience (from way back in the BeOS 4.5 era) with numerous programming languages, operating systems, and devices. I directly tested Zen's inter-core and inter-CCX communications to discover that there was a fast-path (low latency, low bandwidth) communication path for small data packets before AMD detailed the command fabric. I discovered it by accident because I couldn't explain how I was getting data between Core 0 to Core7 (on different CCXes) with only something like a 20ns penalty versus going from core 0 to core 1 (same CCX, neighboring cores).... but I digress...

efikkan said:

I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.
Yes, they copied Zen (kind of a joke...). Though AMD uses a 'mostly' exclusive design - though we don't fully know, AFAIK, what they mean by that.


efikkan said:


I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.
It's not the 4MiB per core - it's what happens in many scenarios when a core tries to access beyond that basic block...

Posted on Reply
#137
londiste
looncraz said:
Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.
I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.
Posted on Reply
#138
HTC
londiste said:
I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.
More likely, a combination of both.
Posted on Reply
#139
R0H1T
efikkan said:
You know very well what I'm talking about; all the vendors choose benchmarks which favors them at any time, which puts their product in the best possible position.


ALU->FPU communication? What do you mean by that? Conversion of ints to floats?


You do know how cache works, right? In sequential reads cache should be "transparent".
Cache is organized in banks. Zen's 8-way 512kB L2 is actually 8 separate 64kB caches (Skylake have 4-way 256kB, Haswell 8-way 256kB). Memory is stored in 64b cache lines, for sequential reads the cache lines will be evenly spread across the banks. Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.

Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.

What does random accesses have to do with this? Nothing can ever predict random accesses, they will fall through the cache and read directly from memory. The only thing that can marginally help here is the OoOE trying to dereference a pointer etc. as early as possible, but usually the limits to how far ahead the prefetcher can "see", and of course all branching logic and other pointers may limit the room for early execution here. Once again this has to do with the efficiency of the prediction, not the latency of the cache.


You mean that each four cores shares one L3 cache?
L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores. And when it is, it's mostly code, not data. When it comes to writing, the CPU engages a write-lock to discard a cache line from all caches, any latency here comes down to the entire memory structure, not the L3.
I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.
I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.
And that's fine so long as the benchmarks aren't outright fudged or in some cases the competition's system deliberately gimped. Best case implies just that & should always be taken with a grain of salt.

What you're forgetting though is that without these "best case" benches there would be little to no IPC gain, for Intel, in the last few years. Take the FP numbers out, heavily influenced by AVX2 or AVX512, & you have virtually 0 IPC gains for close to 4 years, if not more. That's because x86 has pretty much reached the end of the line so far as IPC gains are concerned. The biggest performance gains this decade have come from tweaking cache hierarchy, DDR4, AVX & clock speeds. That's not the case for ARM but it's also not a part of this debate.
Posted on Reply
#140
londiste
ARM is not magic. The gains ARM has done have already been there for a long time in x86 and other architectures (VFP, NEON, SVE, 64-bit, multiple cores, out-of-order, prediction). In addition to that due to it being small(er), simple(r) and cheap(er) ARM has had a process node advantage for a couple generations now. As ARM improves, it will start running into similar problems as other architectures.
Posted on Reply
#141
WikiFM
Mark Little said:
Anandtech does a great job with their Bench tool on their website. It helps with conversations like this one. Here are the Full package load power measurements for the last four Intel generations:
i7-6700K 82.55W
i7-7700K 95.14W
i7-8700K 150.91W
i9-9900K 168.48W
There is a major change between the 7th and 8th generations. However, Intel rates them all as 95W. You don't see a problem with this?
Source: https://www.anandtech.com/bench/CPU-2019/2194

EDIT: And if you look at all the CPUS at Full package load at that link, you will see almost all fall below or within +10% of the rated TDP across desktop, HEDT and server chips from both AMD and Intel. Only the 8700K and the 9900K are way off. This is deceptive advertising at its worst to try and look competitive and cover up being stuck on the same process node.
In that same chart the 8086K which is a tiny bit faster than 8700K consumes 100W, why is that?

Valantar said:
The 8121U has its AVX512 units (as well as pretty much everything else) disabled.

None of us will be buying EPYC. That's why we're taking what they've said about it and attempting to extrapolate what this means for Ryzen 3000 and TR3. Also, it's interesting to discuss when someone makes some actual innovations in this space, even if we're not in the target market.

I'd actually consider the same, even if I'm very happy with my 1600X.
8121U is not disabled: https://ark.intel.com/products/136863/Intel-Core-i3-8121U-Processor-4M-Cache-up-to-3_20-GHz. And even if it was the case it has them.
Posted on Reply
#142
R0H1T
londiste said:
ARM is not magic. The gains ARM has done have already been there for a long time in x86 and other architectures (VFP, NEON, SVE, 64-bit, multiple cores, out-of-order, prediction). In addition to that due to it being small(er), simple(r) and cheap(er) ARM has had a process node advantage for a couple generations now. As ARM improves, it will start running into similar problems as other architectures.
Alright & where did I say that? The reason ARM is doing better than x86 atm is because x86 has been tweaked & improved over 4 decades, the gains decade on decade have been massive. ARM is only realizing these (massive) gains since 2k, so they may still have a decade or half to hit the wall which x86 seems to be running into right now. They could, of course hit the physics wall first.

Why do I see this dismissive tone when talking about ARM here, do you buy phones with Intel Inside? Why do you think that is, do you treat (one of) AMD vs Nvidia the same way?
Posted on Reply
#143
londiste
R0H1T said:
Why do I see this dismissive tone when talking about ARM here, do you buy phones with Intel Inside?
I apologize, this wasn't so much about your post but the apparent general impression that ARM is something completely different and has long way to go. It doesn't.

The other part is why are you dismissive about x86? There have been some tests on single core performance here and there from Sandy Bridge forward (7 years). For example:
https://m.sweclockers.com/test/23426-amd-ryzen-7-1800x-och-7-1700x/29
23% in Cinebench is not too bad. And Cinebench does only 128-bit AVX (which is from Sandy Bridge/Bulldozer era).

Intel has been stuck on Skylake and derivatives for 3 years but I would not necessarily put this down to inability to improve IPC. Core count is the clear focus for the last few years.
Posted on Reply
#145
R0H1T
londiste said:
I apologize, this wasn't so much about your post but the apparent general impression that ARM is something completely different and has long way to go. It doesn't.

The other part is why are you dismissive about x86? There have been some tests on single core performance here and there from Sandy Bridge forward (7 years). For example:
https://m.sweclockers.com/test/23426-amd-ryzen-7-1800x-och-7-1700x/29
23% in Cinebench is not too bad. And Cinebench does not do AVX.

Intel has been stuck on Skylake and derivatives for 3 years but I would not necessarily put this down to inability to improve IPC. Core count is the clear focus for the last few years.
So far as raw performance is concerned, no x86 is still the team (Intel & AMD) to beat. Howsoever, as I've noted, IPC gains outside of AVX assisted (FP) workload have been hard to come by. Can you deny that?

I've also noted that the biggest changes have been in cache, memory, clock speeds & arguably HT or SMT for AMD. Admittedly ARM also benefits from that, but again the point is ARM are coming from a much smaller base (number) & so their gains are incredible. The biggest servers, supercomputers will still be vastly x86 based, but as you've said that's down to more cores. IMO (chip) interconnect technologies like UPI or IF are the next & perhaps the last hurdle before x86 reaches it's peak. I don't see the same kind of progress in the next decade, as we've seen since 2010 unless there's some major breakthrough. The future is dedicated (hardware) accelerators, that is where I see computing realm headed. The core wars have just begun but even there physics will catch up pretty soon.
Posted on Reply
#146
bucketface
This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.
Posted on Reply
#147
londiste
R0H1T said:
I've also noted that the biggest changes have been in cache, memory, clock speeds & arguably HT or SMT for AMD.
Caches are integral part of any contemporary CPU.
Memory speeds have increased, yes. Well, latency not so much but bandwidth for sure. Memory improvement will continue. DDR5 is on its way and for some implementations, GDDR/HBM with their respective up- and downsides. How much this affects results depends on benchmark. In case of Cinebench, it has a very low scaling with faster memory.
The Cinebench test I linked is at the same clock speeds and single core/thread so no HT/SMP in play.

R0H1T said:
So far as raw performance is concerned, no x86 is still the team (Intel & AMD) to beat. Howsoever, as I've noted, IPC gains outside of AVX assisted (FP) workload have been hard to come by. Can you deny that?
I do not know if I would want to separate FP from general CPU performance. FP has been a part of x86 for a long time - in coprocessors since the beginning, integrated in and Pentium onwards. Improving parts of the instruction set is part of CPU evolution.

R0H1T said:
Admittedly ARM also benefits from that, but again the point is ARM are coming from a much smaller base (number) & so their gains are incredible.
Comparisons with ARM are difficult. In addition to integrating performance-improving aspects (many of them tried-and-true) ARM has been moving to scale its architecture(s) higher and higher, largely funded and driven by smartphones. Higher-performing ARM CPUs are not small and they do consume considerable amounts of power.

Actually, since you mentioned AVX and other instruction-level improvements being a bit suspect when comparing IPC, ARM has gained a lot of its IPC and almost all of its FP performance from just that. Since ARM's focus is different, they are also going for more cores rather than more performance, especially in the last few years. They also benefit from not having to be compatible :)

Rest of your post I wholeheartely agree with.
Posted on Reply
#148
looncraz
londiste said:
I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.
Ryzen basically double-dips when you have higher memory clocks. Intel gains, Zen gains double.
Posted on Reply
#149
bug
If it seems too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
Posted on Reply
#150
looncraz
bucketface said:
This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.
It's a mixed floating point and integer workload with what should be a pretty good amount of hits through the L2. It tells us what the core can do on its own in a roughly ideal situation to extract IPC. Papermaster showed exactly why... there's no missing explanation for the specific benchmark result.

From what we know, the breakdown in performance improvements. for this workload, probably looks something such as this:
  • Fetch: . . . . . . 0-5% . . . . . . . .(from L2/L3/IMC)
  • Dispatch: . . . 30~35% . . . . . (next instruction counter, larger uop cache, wider dispatch width)
  • ALU: . . . . . . . 5-15% . . . . . . . (instructions in play are all too simple to see much improvement, so this would be the predictor improvement as it relates to these simple tests)
  • FPU: . . . . . . . 15-33% . . . . . . (non-AVX workload, advantage comes from load bandwidth doubling).
  • Retire: . . . . . .70~80% . . . . . .(from doubling of retirement bandwidth - 128-bit to 256-bit - not 100% because of naturally imperfect scaling)
These values would average together to become the IPC increase for this particular workload. These should be the ranges to expect for any program going through the CPU... with some major caveats - such as the fetch and ALU performance not being well represented in this workload - and the dispatch and retire ruling the day.

___________________

Also, x86 has plenty of room for improvement. We just have to start walking away from relative energy efficiency.

If we had a process that allowed us to execute and fetch memory with almost no power usage, we would easily double IPC. Everything in a modern CPU is a compromise for power efficiency... including how aggressively you do predictive computation.

Heck, if we created a semi-dedicated pipeline for predictions and left another dedicated path for in-order execution (leaving instruction bubbles and all, but with power gating), we would see cache miss penalties drop close to zero as we could execute both possibilities for a branch outcome then just move over each stage results after a branch prediction is shown true - removing the instruction bubble with a single cycle latency and resulting in nearly perfect prediction performance. This is insane in the world where power consumption is important... you will be executing (partly or in full) nearly every instruction in a program - even for branches not taken... we're talking about potentially more than doubling how much is executed for every clock cycle. Still, this would be something like a 50% IPC increase.
Posted on Reply
Add your own comment