Sunday, November 12th 2017

AMD "Zen 2" IPC 29 Percent Higher than "Zen"

AMD reportedly put out its IPC (instructions per clock) performance guidance for its upcoming "Zen 2" micro-architecture in a version of its Next Horizon investor meeting, and the numbers are staggering. The next-generation CPU architecture provides a massive 29 percent IPC uplift over the original "Zen" architecture. While not developed for the enterprise segment, the stopgap "Zen+" architecture brought about 3-5 percent IPC uplifts over "Zen" on the backs of faster on-die caches and improved Precision Boost algorithms. "Zen 2" is being developed for the 7 nm silicon fabrication process, and on the "Rome" MCM, is part of the 8-core chiplets that aren't subdivided into CCX (8 cores per CCX).

According to Expreview, AMD conducted DKERN + RSA test for integer and floating point units, to arrive at a performance index of 4.53, compared to 3.5 of first-generation Zen, which is a 29.4 percent IPC uplift (loosely interchangeable with single-core performance). "Zen 2" goes a step beyond "Zen+," with its designers turning their attention to critical components that contribute significantly toward IPC - the core's front-end, and the number-crunching machinery, FPU. The front-end of "Zen" and "Zen+" cores are believed to be refinements of previous-generation architectures such as "Excavator." Zen 2 gets a brand-new front-end that's better optimized to distribute and collect workloads between the various on-die components of the core. The number-crunching machinery gets bolstered by 256-bit FPUs, and generally wider execution pipelines and windows. These come together yielding the IPC uplift. "Zen 2" will get its first commercial outing with AMD's 2nd generation EPYC "Rome" 64-core enterprise processors.

Update Nov 14: AMD has issued the following statement regarding these claims.
As we demonstrated at our Next Horizon event last week, our next-generation AMD EPYC server processor based on the new 'Zen 2' core delivers significant performance improvements as a result of both architectural advances and 7nm process technology. Some news media interpreted a 'Zen 2' comment in the press release footnotes to be a specific IPC uplift claim. The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications. We will provide additional details on 'Zen 2' IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.
Source: Expreview
Add your own comment

162 Comments on AMD "Zen 2" IPC 29 Percent Higher than "Zen"

#151
Valantar
bug said:
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
Clarified? They have never claimed otherwise.
Posted on Reply
#152
bug
Valantar said:
Clarified? They have never claimed otherwise.
Maybe "put an end to speculations" would have been clearer?
Posted on Reply
#153
GlacierNine
bug said:
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
Posted on Reply
#154
bug
GlacierNine said:
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
Yeah, well, posting in a hurry between two compiles will do that ;)

Edit: fixed
Posted on Reply
#155
looncraz
bug said:
Maybe "put an end to speculations" would have been clearer?
They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
Posted on Reply
#156
bug
looncraz said:
They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
At least it puts an upper bound on expectations, so I'm good.
Posted on Reply
#157
looncraz
bug said:
At least it puts an upper bound on expectations, so I'm good.
That it does - it looks around 30% for any real world task is going to be max. It's kind of a way to temper the whole "we doubled the FPU" thing, I think.

Much better for headlines to read 29% improvement rather than 100% improvement.. especially when some programs will see 5% benefit and others 20%, with a few as high as 30%.

Sadly, it's hard to predict Cinebench scores for this since Cinebench relies much more heavily on branch prediction and prefetch, but we can guess it will be at least 10%, but no more than 30% - and very very unlikely to see 30%. But that's where we were before, except with a lower upper bound. I still think it will be about 15% on average.
Posted on Reply
#158
efikkan
looncraz said:
Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
Let's say you write A + B + C + D in your code,
this will be executed as ((A + B) + C) + D,
and while each addition should only take a few cycles, the CPU would have to wait up to 18 cycles before the next operation can even start. While the exact timing and bandwidth slightly vary between CPU architectures, this principle should largely be the same for any pipelined architecture.

looncraz said:


That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.
[IMG]https://www.techpowerup.com/forums/proxy.php?image=http%3A%2F%2Fwww.legitreviews.com%2Fwp-content%2Fuploads%2F2017%2F02%2F256stride.jpg&hash=d773e945ff3b8d7744693d707498599a[/IMG]
Judging by that image, there is no issue with L3 cache at all.

looncraz said:

Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...
Sure, the memory latency can be much worse due to Zen's core structure. But you were the one arguing for Zen needing better cache. I'm the one pointing out that Zen have a decent cache on paper, but Intel is much better at utilizing their cache through a better front-end.

looncraz said:

Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.
This has to do with AMD's Infinity Fabric being tied to the memory speed.
Intel see no significant difference with speeds above 2666 MHz, even memory with slightly better timings have virtually no impact.

looncraz said:

Data sharing is insanely common in multi-threaded programs.
Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.
Posted on Reply
#159
looncraz
efikkan said:
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
It actually doesn't, though that used to be the case. Today, fetch and decode chunks of instructions and then determine dependencies. We tag instructions as dependent upon others and try to reroute non-dependent instructions around them before then processing the dependent instruction.

For most instructions, the core knows (within reason) how long it should take to execute and get the result back, so we don't wait for the result - we schedule the instruction so that it is already ready to be fed into a pipeline when the result is available... we want the decoded instruction tagged and sitting in the scheduler - and we want the instruction whose result we need to carry a matching tag.

An add instruction takes a single cycle, but it takes time to get the result. Intel uses a bus they call, unimaginatively, but accurately, the "result bus." This bus is fed by the store pipelines and each execution pipeline. The load and execution pipelines can read results directly off this bus if the timing works out correctly.

So, A + B + C + D would execute as mov A, result; add B, result; add C, result; add D, result;.

One trick here, which is by no means obvious or necessarily done, would be to keep the instructions in two places. You keep a copy in the scheduler, tagged, as you send the dependent instruction down the pipelines one after the other, so the next instruction can get its result from the previous instruction from the result bus.

The way Intel describes what they do (despite admitting that the execution pipelines can read from the result bus) is to send the result back to the scheduler, where the dependent instruction is waiting for the data. I genuinely suspect they do both (otherwise there's no need for an execution pipeline to read from the result bus..)... it just depends on how dependably the execution while occur within a given time frame.

efikkan said:

Judging by that image, there is no issue with L3 cache at all.
That image is using a 256-byte stride which hides the full effect of the random-access issue until it exceeds about the cache size as Ryzen can predict the access pattern well enough.



You can see the (unsurprising) excellent sequential performance (which is only 2.8ns on my 2700X) ... and them the abysmal random access performance.

Intel's in-page random access performance is several times better. This is the cache performance issue that is hurting Ryzen - and it relates to how often the L3 prefetch ends up hitting the IMC instead of being able to stay within the L3. This happens increasingly more after a single core uses more than 4MiB of data. By 6MiB you have a ~50% miss rate that result in hitting memory latency.

My 2700X results are better - because my IMC latency is only 61.9ns with 3600MT/s memory.

If Zen 2 can bring that down by another 20ns while increasing how much L3 each core can access, it's going to be a big deal. My 2700X only has 9.5ns latency to the L3 - if I had 40ns latency to main memory and 16MiB of cache to access, in-page random access should fall to the 20~30ns region (depending on page size).

efikkan said:

Intel is much better at utilizing their cache through a better front-end.
Zen's front end is extremely good. As is Intel's.

Zen's has higher throughput potential (8 uops vs 4 uops), but Intel has fusion - so that 4 uops is sometimes 7 uops... and Zen's 8 uops is sometimes 4 uops...
Intel's branch predictor seems to be better, but that's about it.

The first Zen bottleneck (if you want to call it that) is when Intel can dispatch 7 uops and Zen can only dispatch 6. Intel isn't always able to dispatch 7 uops, but Zen can never exceed 6. That's a potential 16.7% advantage to Intel.

The next Intel advantage is in their unified scheduler - which allows accessing results without going back to the scheduler. Zen, AFAICT, needs to send results back to the forwarding muxes or the PRF. This is only a couple cycles - and AMD makes up for it by having four ALU full featured pipelines and 6 independent schedulers. Being only 14 deep keeps things simple to manage, but it may mean results could need to be fetched from the L1D (3 cycle penalty) more often.

efikkan said:

Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.
If you mean between L3 caches to mean between each CCX or die - yes, that's true. Everything is always referenced as a memory address, the LLC acts as the insulator to main memory. However, cross communication does occur for certain volatile memory. This seems to happen via the command bus, but it also happens via the data fabric. This is probably magic that happens through the IMC without going to system memory, which would explain the latency results with core to core communication (simple test - fixed affinity, each core accessing the same memory addresses, just reading and updating a simple struct - time, accessing core, and a mutex... each core that gains the mutex records the time difference between the last access, which core made that access, updates the struct, and moves on). This showed that handling the mutex could, PEAK, take only 10ns between the CCXes (this could even be a timing mechanism inaccuracy, since this all reanalysis), but usually took way more... strong clusters at 20ns, 40~50ns, and a good half at 100ns or more (which means out to main memory).

Multi-threaded apps share data across cores, it's as simple as that, and mutexes and volatile memory are all something the CPU can figure out with ease, so optimizing for those, in very least, has been done.

efikkan said:

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.
Yes, it will be extremely interesting to see how the newly unified IMC that's spread far across the IO die will work to solve some of these issues.
Posted on Reply
#160
Vario
'The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications,' AMD's clarification continues. 'We will provide additional details on "Zen 2" IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.'
https://bit-tech.net/news/tech/cpus/amd-downplays-29-percent-zen-2-ipc-boost-reports/1/
Posted on Reply
#161
$ReaPeR$
well.. based on their recent track record its quite plausible that with zen 2 they will reach and maybe even overtake (by a small margin) intel on the single core ipc, their only disadvantage/problem is basically clocks and that can be fixed relatively easily with a smaller node.
Posted on Reply
#162
Adam Krazispeed
btarunr said:
AMD's "59% higher" claims for Zen1 over Excavator invited the same ridicule.

Lisa Su is very careful about the guidance she puts out.
actually.... it was an IPC uplift of 52% from "excavator"

if this is another 29% MORE IPC Over zen 1, then Intel is done.....
Posted on Reply
Add your own comment