• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD "Zen 2" IPC 29 Percent Higher than "Zen"

Joined
Jun 5, 2016
Messages
69 (0.05/day)
This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.

It's a mixed floating point and integer workload with what should be a pretty good amount of hits through the L2. It tells us what the core can do on its own in a roughly ideal situation to extract IPC. Papermaster showed exactly why... there's no missing explanation for the specific benchmark result.

From what we know, the breakdown in performance improvements. for this workload, probably looks something such as this:
  • Fetch: . . . . . . 0-5% . . . . . . . .(from L2/L3/IMC)
  • Dispatch: . . . 30~35% . . . . . (next instruction counter, larger uop cache, wider dispatch width)
  • ALU: . . . . . . . 5-15% . . . . . . . (instructions in play are all too simple to see much improvement, so this would be the predictor improvement as it relates to these simple tests)
  • FPU: . . . . . . . 15-33% . . . . . . (non-AVX workload, advantage comes from load bandwidth doubling).
  • Retire: . . . . . .70~80% . . . . . .(from doubling of retirement bandwidth - 128-bit to 256-bit - not 100% because of naturally imperfect scaling)

These values would average together to become the IPC increase for this particular workload. These should be the ranges to expect for any program going through the CPU... with some major caveats - such as the fetch and ALU performance not being well represented in this workload - and the dispatch and retire ruling the day.

___________________

Also, x86 has plenty of room for improvement. We just have to start walking away from relative energy efficiency.

If we had a process that allowed us to execute and fetch memory with almost no power usage, we would easily double IPC. Everything in a modern CPU is a compromise for power efficiency... including how aggressively you do predictive computation.

Heck, if we created a semi-dedicated pipeline for predictions and left another dedicated path for in-order execution (leaving instruction bubbles and all, but with power gating), we would see cache miss penalties drop close to zero as we could execute both possibilities for a branch outcome then just move over each stage results after a branch prediction is shown true - removing the instruction bubble with a single cycle latency and resulting in nearly perfect prediction performance. This is insane in the world where power consumption is important... you will be executing (partly or in full) nearly every instruction in a program - even for branches not taken... we're talking about potentially more than doubling how much is executed for every clock cycle. Still, this would be something like a 50% IPC increase.
 
Last edited:
Joined
May 2, 2017
Messages
1,643 (1.72/day)
Processor AMD Ryzen 5 1600X
Motherboard Biostar X370GTN
Cooling Custom CPU+GPU water loop
Memory 16GB G.Skill TridentZ DDR4-3200 C16
Video Card(s) AMD R9 Fury X
Storage 500GB 960 Evo (OS ++), 500GB 850 Evo (Games)
Display(s) Dell U2711
Case NZXT H200i
Power Supply EVGA Supernova G2 750W
Mouse Logitech G602
Keyboard Lenovo Compact Keyboard with Trackpoint
Software Windows 10 Pro
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
Clarified? They have never claimed otherwise.
 

bug

Joined
May 22, 2015
Messages
6,808 (4.08/day)
Processor Intel i5-6600k (AMD Ryzen5 3600 in a box, waiting for a mobo)
Motherboard ASRock Z170 Extreme7+
Cooling Arctic Cooling Freezer i11
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V (@3200)
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 3TB Seagate
Display(s) HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Joined
Oct 5, 2017
Messages
465 (0.58/day)
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
 

bug

Joined
May 22, 2015
Messages
6,808 (4.08/day)
Processor Intel i5-6600k (AMD Ryzen5 3600 in a box, waiting for a mobo)
Motherboard ASRock Z170 Extreme7+
Cooling Arctic Cooling Freezer i11
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V (@3200)
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 3TB Seagate
Display(s) HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
Yeah, well, posting in a hurry between two compiles will do that ;)

Edit: fixed
 
Joined
Jun 5, 2016
Messages
69 (0.05/day)
Maybe "put an end to speculations" would have been clearer?
They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
 

bug

Joined
May 22, 2015
Messages
6,808 (4.08/day)
Processor Intel i5-6600k (AMD Ryzen5 3600 in a box, waiting for a mobo)
Motherboard ASRock Z170 Extreme7+
Cooling Arctic Cooling Freezer i11
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V (@3200)
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 3TB Seagate
Display(s) HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
At least it puts an upper bound on expectations, so I'm good.
 
Joined
Jun 5, 2016
Messages
69 (0.05/day)
At least it puts an upper bound on expectations, so I'm good.
That it does - it looks around 30% for any real world task is going to be max. It's kind of a way to temper the whole "we doubled the FPU" thing, I think.

Much better for headlines to read 29% improvement rather than 100% improvement.. especially when some programs will see 5% benefit and others 20%, with a few as high as 30%.

Sadly, it's hard to predict Cinebench scores for this since Cinebench relies much more heavily on branch prediction and prefetch, but we can guess it will be at least 10%, but no more than 30% - and very very unlikely to see 30%. But that's where we were before, except with a lower upper bound. I still think it will be about 15% on average.
 
Joined
Jun 10, 2014
Messages
1,824 (0.91/day)
Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
Let's say you write A + B + C + D in your code,
this will be executed as ((A + B) + C) + D,
and while each addition should only take a few cycles, the CPU would have to wait up to 18 cycles before the next operation can even start. While the exact timing and bandwidth slightly vary between CPU architectures, this principle should largely be the same for any pipelined architecture.

That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.
Judging by that image, there is no issue with L3 cache at all.

Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...
Sure, the memory latency can be much worse due to Zen's core structure. But you were the one arguing for Zen needing better cache. I'm the one pointing out that Zen have a decent cache on paper, but Intel is much better at utilizing their cache through a better front-end.

Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.
This has to do with AMD's Infinity Fabric being tied to the memory speed.
Intel see no significant difference with speeds above 2666 MHz, even memory with slightly better timings have virtually no impact.

Data sharing is insanely common in multi-threaded programs.
Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.
 
Joined
Jun 5, 2016
Messages
69 (0.05/day)
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
It actually doesn't, though that used to be the case. Today, fetch and decode chunks of instructions and then determine dependencies. We tag instructions as dependent upon others and try to reroute non-dependent instructions around them before then processing the dependent instruction.

For most instructions, the core knows (within reason) how long it should take to execute and get the result back, so we don't wait for the result - we schedule the instruction so that it is already ready to be fed into a pipeline when the result is available... we want the decoded instruction tagged and sitting in the scheduler - and we want the instruction whose result we need to carry a matching tag.

An add instruction takes a single cycle, but it takes time to get the result. Intel uses a bus they call, unimaginatively, but accurately, the "result bus." This bus is fed by the store pipelines and each execution pipeline. The load and execution pipelines can read results directly off this bus if the timing works out correctly.

So, A + B + C + D would execute as mov A, result; add B, result; add C, result; add D, result;.

One trick here, which is by no means obvious or necessarily done, would be to keep the instructions in two places. You keep a copy in the scheduler, tagged, as you send the dependent instruction down the pipelines one after the other, so the next instruction can get its result from the previous instruction from the result bus.

The way Intel describes what they do (despite admitting that the execution pipelines can read from the result bus) is to send the result back to the scheduler, where the dependent instruction is waiting for the data. I genuinely suspect they do both (otherwise there's no need for an execution pipeline to read from the result bus..)... it just depends on how dependably the execution while occur within a given time frame.

Judging by that image, there is no issue with L3 cache at all.
That image is using a 256-byte stride which hides the full effect of the random-access issue until it exceeds about the cache size as Ryzen can predict the access pattern well enough.



You can see the (unsurprising) excellent sequential performance (which is only 2.8ns on my 2700X) ... and them the abysmal random access performance.

Intel's in-page random access performance is several times better. This is the cache performance issue that is hurting Ryzen - and it relates to how often the L3 prefetch ends up hitting the IMC instead of being able to stay within the L3. This happens increasingly more after a single core uses more than 4MiB of data. By 6MiB you have a ~50% miss rate that result in hitting memory latency.

My 2700X results are better - because my IMC latency is only 61.9ns with 3600MT/s memory.

If Zen 2 can bring that down by another 20ns while increasing how much L3 each core can access, it's going to be a big deal. My 2700X only has 9.5ns latency to the L3 - if I had 40ns latency to main memory and 16MiB of cache to access, in-page random access should fall to the 20~30ns region (depending on page size).

Intel is much better at utilizing their cache through a better front-end.
Zen's front end is extremely good. As is Intel's.

Zen's has higher throughput potential (8 uops vs 4 uops), but Intel has fusion - so that 4 uops is sometimes 7 uops... and Zen's 8 uops is sometimes 4 uops...
Intel's branch predictor seems to be better, but that's about it.

The first Zen bottleneck (if you want to call it that) is when Intel can dispatch 7 uops and Zen can only dispatch 6. Intel isn't always able to dispatch 7 uops, but Zen can never exceed 6. That's a potential 16.7% advantage to Intel.

The next Intel advantage is in their unified scheduler - which allows accessing results without going back to the scheduler. Zen, AFAICT, needs to send results back to the forwarding muxes or the PRF. This is only a couple cycles - and AMD makes up for it by having four ALU full featured pipelines and 6 independent schedulers. Being only 14 deep keeps things simple to manage, but it may mean results could need to be fetched from the L1D (3 cycle penalty) more often.

Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.
If you mean between L3 caches to mean between each CCX or die - yes, that's true. Everything is always referenced as a memory address, the LLC acts as the insulator to main memory. However, cross communication does occur for certain volatile memory. This seems to happen via the command bus, but it also happens via the data fabric. This is probably magic that happens through the IMC without going to system memory, which would explain the latency results with core to core communication (simple test - fixed affinity, each core accessing the same memory addresses, just reading and updating a simple struct - time, accessing core, and a mutex... each core that gains the mutex records the time difference between the last access, which core made that access, updates the struct, and moves on). This showed that handling the mutex could, PEAK, take only 10ns between the CCXes (this could even be a timing mechanism inaccuracy, since this all reanalysis), but usually took way more... strong clusters at 20ns, 40~50ns, and a good half at 100ns or more (which means out to main memory).

Multi-threaded apps share data across cores, it's as simple as that, and mutexes and volatile memory are all something the CPU can figure out with ease, so optimizing for those, in very least, has been done.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.
Yes, it will be extremely interesting to see how the newly unified IMC that's spread far across the IO die will work to solve some of these issues.
 
Last edited:
Joined
Oct 21, 2005
Messages
5,667 (1.10/day)
Location
USA
System Name Small ATX Desktop
Processor Intel i5 8600K @ 4.6 GHz Core & 4.3 GHz Uncore @ 1.168V
Motherboard Asrock Z370 Taichi
Cooling Phanteks PH-TC14PE, MonoPlus, Fans: 2xThermalright TY143 / 2xCorsair SP-120L / 2xYate Loon D14BH-12
Memory G-Skill TridentZ 2X8GB DDR4 3200 CL14 F4-3200C14D-16GTZ @ 3200 14-14-14-34 2T @ 1.35V
Video Card(s) Zotac 1060 6GB Mini ZT-P10600A-10L with Arctic MonoPlus and Yate Loon D14BH-12 Fan
Storage OS Samsung 970 Pro 512GB NVMe, Games Phison E12 NVMe 1TB, HDD WD Black 4TB WD4005FZBX, 2xWD10EZEX
Display(s) 2x Asus PB258Q 2560x1440 25" IPS
Case Lian Li PC A05NB (Inverted Mobo)
Audio Device(s) Audiotechnica ATH M50X, Antlion Mod Mic 4, SYBA SD-CM-UAUD, Acoustic Research 2Ch Speakers
Power Supply Seasonic SS-660XP2 660 Watt Platinum
Mouse Zowie EC2A Mouse, Razer Naga Chroma '14, Corsair MM600, Inateck 900x300 XL pad, Tiger Gaming Skates.
Keyboard Filco Majestouch II Ninja TKL, Goldtouch GTC 0033 Ten Key, PS3 Controller
Software Win7 Pro 64 (Installed on Coffee Lake using AsRock's handy PS/2 Simulator in Bios)
'The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications,' AMD's clarification continues. 'We will provide additional details on "Zen 2" IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.'
https://bit-tech.net/news/tech/cpus/amd-downplays-29-percent-zen-2-ipc-boost-reports/1/
 
Joined
Jul 14, 2008
Messages
646 (0.15/day)
Location
Copenhagen, Denmark
System Name Ryzen/Laptop
Processor R9 3900X/i7 6700HQ
Motherboard AsRock X470 Taichi/Acer
Cooling Corsair H115i pro with 2 Noctua NF-A14 chromax/OEM
Memory G.Skill Trident Z 32GB @3200/16GB DDR4 2666 HyperX impact
Video Card(s) TUL Red Dragon Vega 56/Intel HD 530 - GTX 950m
Storage Samsung 970pro NVMe 512GB,Samsung 860evo 1TB, 3x4TB WD gold enterprise /Transcend 830s, 1TB Toshiba
Display(s) Philips FTV 32 inch + Dell 2407WFP-HC/OEM
Case Phanteks Enthoo Luxe/Acer Barebone
Audio Device(s) SoundBlasterX AE-5 (Dell A525)(HyperX Cloud Alpha)/mojo
Power Supply Seasonic focus+ 850 platinum (SSR-850PX)/165 Watt power brick
Mouse G502 Hero/M705 Marathon
Keyboard G19
Software Win10 pro 64bit
well.. based on their recent track record its quite plausible that with zen 2 they will reach and maybe even overtake (by a small margin) intel on the single core ipc, their only disadvantage/problem is basically clocks and that can be fixed relatively easily with a smaller node.
 
Joined
Jan 29, 2016
Messages
38 (0.03/day)
System Name GAMER_TR4-PC
Processor AMD Ryzen ThreadRipper 1900X 4.10Ghz (8 Core's/16 Thread's) Cooled with Enermax Liqtek 240 TR4 AIO
Motherboard Gigabyte X399 AORUS Gaming 7 (TR4)
Cooling Enermax TR4 AIO (CPU) 500+ TDP (240mm)
Memory G.Skill Flaire X 3200mhz DDR4 4X8GB (32GB) DDR4 @3600 Mhz CL16, (16,18,18,40) 1.38v (Quad-Channel)
Video Card(s) EVGA Nvidia Geforce GTX 1080 Ti SC2 HYBRID 11GB GDDR5x(dead, using old gtx 1080 8gb) fuck you jensen
Storage Sys,Samsung 960 Evo NVMe, Plextor M8Pe NVME & samsung 850Evo SSD(games)*2x1TB *WDBlack,&Seagate SSHD
Display(s) Dell 27"AMD FREESYNC 144hz LED LCD Montior Connected Via Display Port
Case Corsair Graphite Series 760T Full Tower Windowed/hinged side panels Case
Audio Device(s) Realtek HD ALC1220 codec / Onboard HD Audio*
Power Supply Corsair RM1000i usb 1000 Watt 80 Plus PSU
Mouse Logitech USB Wireless KB & MOUSE
Keyboard Logitech
Software Windows 10 Home x64bit
AMD's "59% higher" claims for Zen1 over Excavator invited the same ridicule.

Lisa Su is very careful about the guidance she puts out.


actually.... it was an IPC uplift of 52% from "excavator"

if this is another 29% MORE IPC Over zen 1, then Intel is done.....
 
Top