Sunday, November 12th 2017

AMD "Zen 2" IPC 29 Percent Higher than "Zen"

AMD reportedly put out its IPC (instructions per clock) performance guidance for its upcoming "Zen 2" micro-architecture in a version of its Next Horizon investor meeting, and the numbers are staggering. The next-generation CPU architecture provides a massive 29 percent IPC uplift over the original "Zen" architecture. While not developed for the enterprise segment, the stopgap "Zen+" architecture brought about 3-5 percent IPC uplifts over "Zen" on the backs of faster on-die caches and improved Precision Boost algorithms. "Zen 2" is being developed for the 7 nm silicon fabrication process, and on the "Rome" MCM, is part of the 8-core chiplets that aren't subdivided into CCX (8 cores per CCX).

According to Expreview, AMD conducted DKERN + RSA test for integer and floating point units, to arrive at a performance index of 4.53, compared to 3.5 of first-generation Zen, which is a 29.4 percent IPC uplift (loosely interchangeable with single-core performance). "Zen 2" goes a step beyond "Zen+," with its designers turning their attention to critical components that contribute significantly toward IPC - the core's front-end, and the number-crunching machinery, FPU. The front-end of "Zen" and "Zen+" cores are believed to be refinements of previous-generation architectures such as "Excavator." Zen 2 gets a brand-new front-end that's better optimized to distribute and collect workloads between the various on-die components of the core. The number-crunching machinery gets bolstered by 256-bit FPUs, and generally wider execution pipelines and windows. These come together yielding the IPC uplift. "Zen 2" will get its first commercial outing with AMD's 2nd generation EPYC "Rome" 64-core enterprise processors.

Update Nov 14: AMD has issued the following statement regarding these claims.
As we demonstrated at our Next Horizon event last week, our next-generation AMD EPYC server processor based on the new 'Zen 2' core delivers significant performance improvements as a result of both architectural advances and 7nm process technology. Some news media interpreted a 'Zen 2' comment in the press release footnotes to be a specific IPC uplift claim. The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications. We will provide additional details on 'Zen 2' IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.
Source: Expreview
Add your own comment

162 Comments on AMD "Zen 2" IPC 29 Percent Higher than "Zen"

#26
WikiFM
dj-electric said:
It is clear as day from the design of new EPYC. It includes 8 chiplets of 8 cores each next to the IO controller to complete 64 cores.
The chiplets themselves are quite small, and 2 of them could very possibly fit into a dual-chiplet AM4 CPU with 16 cores.


It is clear that chiplets have 8 cores, not 8 cores per CCX, that hasn't been confirmed yet.

R0H1T said:
It could still be 4 cores per CCX, from AT ~

The biggest downside from this being the insane number of IF links to make Rome o_O
Very pretty topology, where does it come from?

bug said:
You're right to point out historically numbers in advance din't do AMD ant favors. However, in this case we already know there was work left to do mainly around the memory controller. Some at AMD confirmed this much around Zen launch. So we knew there was (at least theoretical) untapped potential in Zen. Of course, the proof is still in the pudding, but unlike Bulldozer and Excavator (which everyone knew were built on shaky ground), I believe AMD is at least worth the benefit of doubt this time around. Plus, even if an average the improvement isn't 29%, but 20%, it would still be enough to gain a solid lead on Intel.
Could 20% be enough to have a lead on Intel? I thought Zen was still way behind Intel in single threaded performance or IPC.

Aquinus said:
The biggest benefit of moving I/O off to a different die is that it makes the CCXs smaller if you don't make them bigger because all of that logic isn't in the CCX anymore and is instead located in the centralized I/O hub. Smaller dies means better yields, better yields means an opportunity to add more cores.

Personally my concern is with latency but, I'm not sure if that's an unfounded issue or not. It's likely the case that it's more beneficial to move the I/O components. It's also possible that the I/O hub might not need to be done on the same process as the CCXs which might further improve yields if the larger die is being done on a more mature process.
So what gives better yields then? Smaller dies at 7nm or a huge one at 14nm? Yes the I/O die is done in GloFo's 14 nm.
Posted on Reply
#27
Vayra86
WikiFM said:
It is clear that chiplets have 8 cores, not 8 cores per CCX, that hasn't been confirmed yet.



Very pretty topology, where does it come from?



Could 20% be enough to have a lead on Intel? I thought Zen was still way behind Intel in single threaded performance or IPC.



So what gives better yields then? Smaller dies at 7nm or a huge one at 14nm? Yes the I/O die is done in GloFo's 14 nm.
15-20% is what they need to catch Intel clock-for-clock. Zen was way behind on *clocks*, not on IPC. But combine the two and you have a gap, yes. I do believe Zen 2 will comfortably close that gap, if it can clock to 4.5 ~ 4.6, Intel has nothing left to offer.
Posted on Reply
#28
WikiFM
Vayra86 said:
20% will put them on the level of Coffee Lake, give or take some insignificant workload specific gaps. Way behind on IPC? Not at all. Zen was way behind on *clocks*.
So CFL is clock to clock similar to Zen in IPC? Or in addition to higher IPC they clocked much faster? Anyway if Zen 2 can catch CFL, Intel should cancel Cannon Lake and launch Ice Lake next year to keep having the leadership. Intel should have published some preliminary data about IPC gains of Ice Lake by now.
Posted on Reply
#29
Vayra86
WikiFM said:
So CFL is clock to clock similar to Zen in IPC? Or in addition to higher IPC they clocked much faster? Anyway if Zen 2 can catch CFL, Intel should cancel Cannon Lake and launch Ice Lake next year to keep having the leadership. Intel should have published some preliminary data about IPC gains of Ice Lake by now.
Excuse my ninja edits.

CFL is ahead of Zen (1) and Zen 2 will probably close that gap, yes. Hopefully not just IPC but also clocks.

Intel should do a lot of things, but the reality is they have nothing on the table unless they can move to a smaller node.
Posted on Reply
#30
Octopuss
I don't care if it's only 10% above Zen+. I already considered buying the +, so this will only be better.
Posted on Reply
#31
Caqde
WikiFM said:

Could 20% be enough to have a lead on Intel? I thought Zen was still way behind Intel in single threaded performance or IPC.
They trade blows in the IPC department with the worst case AMD being 15% behind and best case 8% ahead. So depending on how things go with Zen 2 then it is possible that Zen 2 depending on the task will at least be level with Intel and in most cases be ahead in IPC. In the case of a 20% average IPC increase that would mean that clock for clock AMD would always be faster than any Coffeelake chip out there. But if this 29% increase is true then Intel has problems as even in the worst case with 85% of the performance a 29% boost means AMD is now ~9.7% faster clock for clock (20% would mean 2% faster).

For the source of this info ->
https://www.techspot.com/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/
Posted on Reply
#32
bug
WikiFM said:
Could 20% be enough to have a lead on Intel? I thought Zen was still way behind Intel in single threaded performance or IPC.
Neah, Zen's IPC is neck to neck with Intel's. Intel wins in singe-thread performance because having fewer cores they can push higher frequencies. But since they can't push 20% higher frequencies, 20% better IPC (even if maintaining the same clocks) will be enough to push AMD ahead.

(And yes, I'm aware there are specific scenarios where the IPC gap can be noticeable, but I'm talking about the average usecase here).
Posted on Reply
#33
WikiFM
Vayra86 said:
Excuse my ninja edits.

CFL is ahead of Zen (1) and Zen 2 will probably close that gap, yes. Hopefully not just IPC but also clocks.

Intel should do a lot of things, but the reality is they have nothing on the table unless they can move to a smaller node.
Intel should have (re)designed Ice Lake arch on 14+(++,+++) nm. It would be in the market by now, but they are so stubborn that the next arch will come till 10 nm. With that in mind next arch after Ice Lake would come in 7 nm by 2025:eek:?
Posted on Reply
#34
beautyless
I want AMD 8 cores that is as fast as 9900K and prices 350usd.
Posted on Reply
#35
Gungar
windwhirl said:
I think I'll keep my hopes for IPC improvement at 10-15 percent. Nearly 30% improvement is a bit too much to ask, although if it happens, well, that'd be nice.
Don't worry 10-15 percent IPC increase is already pipe dream. And i am not talking about specific application performance bump bullshit.
Posted on Reply
#36
qcmadness
29% IPC uplift claim is too much if the previous claim of "no dignificant bottleneck" of Zen is true.
Posted on Reply
#37
Fabio
it will be a goal if amd will be on par with intel, ipc wise. X86 is a more then mature arch., any improvement can only be small improvement. Yes improve latencies etc can be important in some scenarios, but 29% more ipc is madness. Sure, zen done +40% but we here we have excavator as a refer...
Posted on Reply
#38
WikiFM
Caqde said:
They trade blows in the IPC department with the worst case AMD being 15% behind and best case 8% ahead. So depending on how things go with Zen 2 then it is possible that Zen 2 depending on the task will at least be level with Intel and in most cases be ahead in IPC. In the case of a 20% average IPC increase that would mean that clock for clock AMD would always be faster than any Coffeelake chip out there. But if this 29% increase is true then Intel has problems as even in the worst case with 85% of the performance a 29% boost means AMD is now ~9.7% faster clock for clock (20% would mean 2% faster).

For the source of this info ->
https://www.techspot.com/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/
Just read the review, very nice but my conclusions are different than yours, the only win for Ryzen 2600X was PCMark in Gaming Score hahaha, that 8%. Ryzen 2600X is 5% slower on average on productivity and apps and 12% slower in gaming, against 8700K both a 4Ghz.

bug said:
Neah, Zen's IPC is neck to neck with Intel's. Intel wins in singe-thread performance because having fewer cores they can push higher frequencies. But since they can't push 20% higher frequencies, 20% better IPC (even if maintaining the same clocks) will be enough to push AMD ahead.

(And yes, I'm aware there are specific scenarios where the IPC gap can be noticeable, but I'm talking about the average usecase here).
Check that review https://www.techspot.com/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/page2.html you should find out that Ryzen 2600X is still behind Intel 8700K.
Posted on Reply
#39
BorgOvermind
Rome: 2x FP performance increase per core and FP increase per socket. That is significant even if it does not translate into real-world benchmarks.

Intel at a point was in the lead with 2 manufacturing steps.
Now Intel has nothing to answer this with and is behind in every aspect except marketing dirty tricks (oh... 'deals').
Posted on Reply
#41
Valantar
R0H1T said:
It could still be 4 cores per CCX, from AT ~

The biggest downside from this being the insane number of IF links to make Rome o_O
While you're right that we don't know yet that the CCXes have grown to 8 cores (though IMO this seems likely given that every other Zen2 rumor has been spot on), that drawing is ... nonsense. First off, it proposes using IF to communicate between CCXes on the same die, which even Zen1 didn't do. The sketch directly contradicts what AMD said about their design, and doesn't at all account for the I/O die and its role in inter-chiplet communication. The layout sketched out there is incredibly complicated, and wouldn't even make sense for a theoretical Zen1-based 8-die layout. Remember, IF uses PCIe links, and even in Zen1 the PCIe links were common across two CCXes. The CCXes do thus not have separate IF links, but share a common connection (through the L3 cache, IIRC) to the PCIe/IF complex. Making these separate would be a giant step backwards in terms of design and efficiency. Remember, the uncore part of even a 2-die Threadripper consumes ~60W. And that's with two internal links, 64 lanes of PCIe and a quad-channel memory controller. The layout in the sketch above would likely consume >200W for IF alone.

Now, let's look at that sketch. In it, any given CCX is one hop away from 3-4 other CCXes, 2 hops from 3-5 CCXes, and 3 hops away from the remaining 7-10 CCXes. In comparison, with EPYC (non-Rome) and TR, all cores are 1 hop away from each other (though the inter-CCX hop is shorter/faster than the die-to-die IF hop). Even if this is "reduced latency IF" as they call it, that would be ridiculous. And again: what role does the I/O die play in this? The IF layout in that sketch makes no use of it whatsoever, other than linking the memory controller and PCIe lanes to eight seemingly random CCXes. This would make NUMA management an impossible flustercuck on the software side, and substrate manufacturing (seriously, there are six IF links in between each chiplet there! The chiplets are <100mm2! This is a PCB, not an interposer! You can't get that kind of trace density in a PCB.) impossible on the hardware side. Then there's the issue of this design requiring each CCX to have 4 IF links, but 1/4 of the CCXes only gets to use 3 links, wasting die area.

On the other hand, let's look at the layout that makes sense both logically, hardware and software wise, and adds up with what AMD has said about EPYC: Each chiplet has a single IF interface, that connects to the I/O die. Only that, nothing more. The I/O die has a ring bus or similar interconnect that encompasses the 8 necessary IF links for the chiplets, an additional 8 for PCIe/external IF, and the memory controllers. This reduces the number of IF links running through the substrate from 30 in your sketch (6 per chiplet pair + 6 between them) to 8. It is blatantly obvious that the I/O die has been made specifically to make this possible. This would make every single core 1 hop (through the I/O die, but ultimately still 1 hop) away from any other core, while reducing the number of IF links by almot 1/4. Why else would they design that massive die?

Red lines. The I/O die handles low-latency shuffling of data between IF links, while also giving each chiplet "direct" access to DRAM and PCIe. All over the same single connection per chiplet. The I/O die is (at least at this time) a black box, so we don't know whether it uses some sort of ring bus, mesh topology, or large L4 cache (or some other solution) to connect these various components. But we do know that a layout like this is the only one that would actually work. (And yes, I know that my lines don't add up in terms of where the IF link is physically located on the chiplets. This is an illustration, not a technical drawing.)




More on-topic, we need to remember that IPC is workload dependent. There might be a 29% increase in IPC in certain workloads, but generally, when we talk about IPC it is average IPC across a wide selection of workloads. This also applies when running test suites like SPEC or GeekBench, as they run a wide variety of tests stressing various parts of the core. What AMD has "presented" (it was in a footnote, it's not like they're using this for marketing) is from two specific workloads. This means that a) this can very likely be true, particularly if the workloads are FP-heavy, and b) this is very likely not representative of total average IPC across most end-user-relevant test suites. In other words, this can be both true (in the specific scenarios in question) and misleading (if read as "average IPC over a broad range of workloads").
Posted on Reply
#42
btarunr
Editor & Senior Moderator
dj-electric said:
The chiplets themselves are quite small, and 2 of them could very possibly fit into a dual-chiplet AM4 CPU with 16 cores.
There are two ways AMD could built a 16-core AM4 processor:
  • Two 8-core chiplets with a smaller I/O die that has 2-channel memory, 32-lane PCIe gen 4.0 (with external redrivers), and the same I/O as current AM4 dies such as ZP or PiR.
  • A monolithic die with two 8-core CCX's, and fully integrated chipset like ZP or PiR. Such a die wouldn't be any bigger than today's PiR.
I think option two is more feasible for low-margin AM4 products.
Posted on Reply
#43
bug
btarunr said:
There are two ways AMD could built a 16-core AM4 processor:
  • Two 8-core chiplets with a smaller I/O die that has 2-channel memory, 32-lane PCIe gen 4.0 (with external redrivers), and the same I/O as current AM4 dies such as ZP or PiR.
  • A monolithic die with two 8-core CCX's, and fully integrated chipset like ZP or PiR. Such a die wouldn't be any bigger than today's PiR.
I think option two is more feasible for low-margin AM4 products.
At the same time, for low-margins 8 core is more than enough ;)
But let's wait and see.
Posted on Reply
#44
btarunr
Editor & Senior Moderator
bug said:
At the same time, for low-margins 8 core is more than enough ;)
But let's wait and see.
AMD wants to moar-koar the sh** out of Intel's R&D budget, so they spend their money on moar-koaring to keep up, because software ecosystem is finally waking up to moar-koar. At the same time, it's mindful that when Intel gets its 10 nm off the ground, it will introduce its first major IPC uplifts since 2015, or perhaps even since Nehalem. So it needs double-digit percentage IPC increments in addition to 100% core-count increases across the board, while keeping the energy-efficiency edge from 7 nm.

It's somewhat like the USA-PRC military equation. For every dollar that China spends on developing a new military technology, the US probably spends $5 to keep its edge (thanks to lubricating K-street, the hill, MIC, higher costs, etc.).
Posted on Reply
#45
Smartcom5
bug said:
Neah, Zen's IPC is neck to neck with Intel's. Intel wins in singe-thread performance because having fewer cores they can push higher frequencies. But since they can't push 20% higher frequencies, 20% better IPC (even if maintaining the same clocks) will be enough to push AMD ahead.

(And yes, I'm aware there are specific scenarios where the IPC gap can be noticeable, but I'm talking about the average usecase here).
Excuse me sir, but you misspelled IPS! When people will finally learn the difference ffs?!

There's the IPC, and then there's IPS.
IPC or I/c → Instructions per (Clock-) Cycle
IPS or I/s → Instructions per Second

The letter one, thus IPS, often is used synonymously with and for actual Single-thread-Performance – whereas AMD no longer and surely not to such an extent lags behind in numbers compared to Intel now as they did at the time Bulldozer was the pinnacle of the ridge.

Rule of thumb:
IPC does not scale with frequency but is rather fix·ed (within margins, depends on context and kind of [code-] instructions¹, you got the idea).
IPS is the fixed value of the IPC in a time-relation or at a time-figure pretty much like the formula → [ICODE]IPC×t[/ICODE], simply put.

So your definition of IPC quoted above would rather be called „Instructions per Clock at the Wall“ like IPC@W.
So please, stop using right terms and definitions for wrong contexts, learn the difference between those two and get your shit together please!

¹ The value IPC is (depending on kind) absolute² and fixed, yes.
However, it completely is crucially depending on the type and kind of instructions and can vary rather stark by using different kind of instructions – since, per definition, the figure IPC only reflects the value of how many instructions can be processed on average per (clock-) circle.

On synthetic code like instructions with low logical depth or level and algorithmic complexity, which are suited to be processed rather shortly, the resulting value is obviously pretty high – whereas on instructions with a rather high complexity and long length, the IPC-value can only reach rather low figures. In this particular matter, even the contrary can be the case, so that it needs more than one or even a multitude of cycles to process a single given complex instruction. In this regard we're speaking of the reciprocal multiplicative, thus the inverse (-value).
… which is also standardised as being defined as (Clock-) Cycles per Instruction or C/I, short → CPI.
² In terms of non-varying, as opposed to relative.

Read:
Wikipedia • Instructions per cycle
Wikipedia • Instructions per second
Wikipedia • Cycles per instruction



Smartcom
Posted on Reply
#46
intelzen
btarunr said:
when Intel gets its 10 nm off the ground, it will introduce its first major IPC uplifts since 2015, or perhaps even since Nehalem
since Sany Bridge it was year 2009 no more than 5% IPC gains from Intel, and in last 2 "generations" = 0% IPC gains... lets hope it will be in early 2020.
Posted on Reply
#47
Valantar
btarunr said:
AMD wants to moar-koar the sh** out of Intel's R&D budget, so they spend their money on moar-koaring to keep up, because software ecosystem is finally waking up to moar-koar. At the same time, it's mindful that when Intel gets its 10 nm off the ground, it will introduce its first major IPC uplifts since 2015, or perhaps even since Nehalem. So it needs double-digit percentage IPC increments in addition to 100% core-count increases across the board, while keeping the energy-efficiency edge from 7 nm.

It's somewhat like the USA-PRC military equation. For every dollar that China spends on developing a new military technology, the US probably spends $5 to keep its edge (thanks to lubricating K-street, the hill, MIC, higher costs, etc.).
While you have a point, wouldn't that also mean using partially disabled 16-core dice for even =/< 8-core chips (including the low end) given that this would then be the only chip with the required I/O? This sounds too inflexible to make sense for the wide range of SKUs needed for this market. Even if they push high-end MSDT to 16 cores, majority sales volume will be in the 4-6 core range (unless these chips are crazy cheap), with 8 cores likely being the enthusiast sweet spot. That would require a lot of partially disabled silicon. As such, doesn't it sound more likely to keep the chiplets across the range (possibly excluding mobile)? This might be slightly more expensive in assembly, but on the other hand disabling >/= 50% of your die for 80-90% of your sales doesn't exactly make economical sense either. I'd bet the former would be cheaper than the latter, as you'd get more than 2x the usable dice out of a wafer this way.
Posted on Reply
#48
Gasaraki
Prima.Vera said:
Bulldozer, Excavator, ... no thank you. No more hyping until the community benches are out. :rolleyes:
Remember when Ryzen first came out? That shit was hyped through the roof.

TheGuruStud said:
So 15% real world seems very doable. Oh, intel, luz. Better luck next time with your 15% in 8 yrs lol
So HOW long did AMD take to get "here" (Zen+)? They are still not ahead. We shall see Zen 2.
Posted on Reply
#49
Vayra86
WikiFM said:
Intel should have (re)designed Ice Lake arch on 14+(++,+++) nm. It would be in the market by now, but they are so stubborn that the next arch will come till 10 nm. With that in mind next arch after Ice Lake would come in 7 nm by 2025:eek:?
Should have... would they be able to? A new node enables a new design I think and the compromises to do it on 14nm would kill the advantage anyway. 14nm is clearly pushed to the limit, and even over it for some parts if you look at their stock temps, (9th gen hi).

Smartcom5 said:
Excuse me sir, but you misspelled IPS! When people will finally learn the difference ffs?!


Eh... IPS in my mind is In Plane Switching for displays.

He spelled it fine, you didn't read it right.
Posted on Reply
#50
bug
Smartcom5 said:
Excuse me sir, but you misspelled IPS! When people will finally learn the difference ffs?!

There's the IPC, and then there's IPS.
IPC or I/c → Instructions per (Clock-) Cycle
IPS or I/s → Instructions per Second

The letter one, thus IPS, often is used synonymously with and for actual Single-thread-Performance – whereas AMD no longer and surely not to such an extent lags behind in numbers compared to Intel now as they did at the time Bulldozer was the pinnacle of the ridge.

Rule of thumb:
IPC does not scale with frequency but is rather fix·ed (within margins, depends on context and kind of [code-] instructions¹, you got the idea).
IPS is the fixed value of the IPC in a time-relation or at a time-figure pretty much like the formula → [ICODE]IPC×t[/ICODE], simply put.

So your definition of IPC quoted above would rather be called „Instructions per Clock at the Wall“ like IPC@W.
So please, stop using right terms and definitions for wrong contexts, learn the difference between those two and get your shit together please!

¹ The value IPC is (depending on kind) absolute² and fixed, yes.
However, it completely is crucially depending on the type and kind of instructions and can vary rather stark by using different kind of instructions – since, per definition, the figure IPC only reflects the value of how many instructions can be processed on average per (clock-) circle.

On synthetic code like instructions with low logical depth or level and algorithmic complexity, which are suited to be processed rather shortly, the resulting value is obviously pretty high – whereas on instructions with a rather high complexity and long length, the IPC-value can only reach rather low figures. In this particular matter, even the contrary can be the case, so that it needs more than one or even a multitude of cycles to process a single given complex instruction. In this regard we're speaking of the reciprocal multiplicative, thus the inverse (-value).
… which is also standardised as being defined as (Clock-) Cycles per Instruction or C/I, short → CPI.
² In terms of non-varying, as opposed to relative.

Read:
Wikipedia • Instructions per cycle
Wikipedia • Instructions per second
Wikipedia • Cycles per instruction



Smartcom
No, I meant just what I said/wrote ;)
Posted on Reply
Add your own comment