Tuesday, September 24th 2019

AMD Could Release Next Generation EPYC CPUs with Four-Way SMT

AMD has completed design phase of its "Zen 3" architecture and rumors are already appearing about its details. This time, Hardwareluxx has reported that AMD could bake a four-way simultaneous multithreading technology in its Zen 3 core to enable more performance and boost parallel processing power of its data center CPUs. Expected to arrive sometime in 2020, Zen 3 server CPUs, codenamed "MILAN", are expected to bring many architectural improvements and make use of TSMC's 7nm+ Extreme Ultra Violet lithography that brings as much as 20% increase in transistor density.

Perhaps the biggest change we could see is the addition of four-way SMT that should allow a CPU to have four virtual threads per core that will improve parallel processing power and enable data center users to run more virtual machines than ever before. Four-way SMT will theoretically boost performance by dividing micro-ops into four smaller groups so that each thread could execute part of the operation, thus making the execution time much shorter. This being only one application of four-way SMT, we can expect AMD to leverage this feature in a way that is most practical and brings the best performance possible.
AMD isn't the first to implement this kind of solution to its processors. IBM has been making CPUs based on POWER ISA for years now that feature four or even eight-way SMT and they are one of the key reasons why POWER CPUs are so powerful. Nonetheless, we can hope to see more details about Zen 3 core design decisions as we approach 2020 and launch of Milan CPUs. Source: Hardwareluxx
Add your own comment

159 Comments on AMD Could Release Next Generation EPYC CPUs with Four-Way SMT

#126
efikkan
theoneandonlymrk, post: 4122562, member: 82332"
You know I half agree but disagree with your initial standpoint, this ideal software you speak of, do you have an example , because I would be surprised, CPU's are not fixed function ,they have built in coprocessors and on die coprocessors, modern code is also broke up into micro ops so I can't imagine that's an easy bit of code to know how to write never mind write.
All modern x86 microarchitectures convert x86 ISA into "RISC-like" microoperations. These are not only Intel or AMD specific, but microarchtecture specific or even specific down to the die configuration. Exposing these to write targeted code is not feasible. So it will be up to the CPU front-end to convert the x86 machine code into the native micro operations, assigning registers etc.
Posted on Reply
#127
theoneandonlymrk
efikkan, post: 4122569, member: 150226"
All modern x86 microarchitectures convert x86 ISA into "RISC-like" microoperations. These are not only Intel or AMD specific, but microarchtecture specific or even specific down to the die configuration. Exposing these to write targeted code is not feasible. So it will be up to the CPU front-end to convert the x86 machine code into the native micro operations, assigning registers etc.
Yes exactly where the optimisation is done.
And exactly the part that makes your point moot.
With a resource, any resource, it only gets used as much as it's regulated to at any moment, no code uses all of a CPU,s possible circuit based compute power, if that were allowed to happen modern core's would not last long or be efficient.
Hence why power virus are a thing and few of them really max a cpu's full spectrum of processing abilities.

My point was and is that resource use is paramount, no codes perfect, none yet Intel and Amd have to work their imperfect silicon into optimally running all sorts of code for many uses.
Posted on Reply
#128
efikkan
theoneandonlymrk, post: 4122570, member: 82332"
Yes exactly where the optimisation is done.
And exactly the part that makes your point moot.
With a resource, any resource, it only gets used as much as it's regulated to at any moment, no code uses all of a CPU,s possible circuit based compute power, if that were allowed to happen modern core's would not last long or be efficient.
Hence why power virus are a thing and few of them really max a cpu's full spectrum of processing abilities.
Do you mean my point about the irrelevance of SMT?
Well, 100% of it can never be utilized, due to power gating, and resources sharing execution ports.
But SMT is mostly about utilizing idle cycles due to cache misses and branch mispredictions, which leaves idle cycles for partial or the entire core.

theoneandonlymrk, post: 4122570, member: 82332"
My point was and is that resource use is paramount, no codes perfect, none yet Intel and Amd have to work their imperfect silicon into optimally running all sorts of code for many uses.
"Optimal" code is about implementing an algorithm solving a particular task in the most efficient way, not about utilizing every possible CPU resource 100% every clock cycle.
Posted on Reply
#129
theoneandonlymrk
efikkan, post: 4122574, member: 150226"
Do you mean my point about the irrelevance of SMT?
Well, 100% of it can never be utilized, due to power gating, and resources sharing execution ports.
But SMT is mostly about utilizing idle cycles due to cache misses and branch mispredictions, which leaves idle cycles for partial or the entire core.


"Optimal" code is about implementing an algorithm solving a particular task in the most efficient way, not about utilizing every possible CPU resource 100% every clock cycle.
But no modern pc is made to or actually runs one piece of code like that besides supercomputers, modern PC have many processes on the fly with multiple threads each , over a thousand on a typical pc, that's where SMt and HTT make they're money in optimization of core use.
Posted on Reply
#130
Lionheart
Camm, post: 4122059, member: 110377"
You answered your own question. They aren't in the desktop consumer scene as they don't make desktop consumer products.
Not really, I asked what they do... :wtf:
Posted on Reply
#131
londiste
theoneandonlymrk, post: 4122562, member: 82332"
You know I half agree but disagree with your initial standpoint, this ideal software you speak of, do you have an example , because I would be surprised, CPU's are not fixed function ,they have built in coprocessors and on die coprocessors, modern code is also broke up into micro ops so I can't imagine that's an easy bit of code to know how to write never mind write.
Linpack is a known to perform same or even worse with SMT. It is far from perfect load but good enough to negate the potential improvement from SMT.
Posted on Reply
#132
ratirt
londiste, post: 4122695, member: 169790"
Linpack is a known to perform same or even worse with SMT. It is far from perfect load but good enough to negate the potential improvement from SMT.
Is it using SMT and is it optimized? If an application can't use more threads and cores then of course it will work less efficient and won't scale with SMT.
Posted on Reply
#133
londiste
ratirt, post: 4122696, member: 165024"
Is it using SMT and is it optimized? If an application can't use more threads and cores then of course it will work less efficient and won't scale with SMT.
It is optimized. The problem is not with threads. 1 thread of Linpack running on 1 core is same or faster that 2 threads running on the same core with SMT enabled.

The idea of SMT is that this is done in hardware and you do not optimize for it and there are not too many generic ways of doing that. The main optimization on software side is awareness on operating system level (scheduler) about which cores are physical and which are logical. Threads are ideally assigned to physical cores first, then logical for best results.
Posted on Reply
#134
ratirt
londiste, post: 4122714, member: 169790"
It is optimized. The problem is not with threads. 1 thread of Linpack running on 1 core is same or faster that 2 threads running on the same core with SMT enabled.

The idea of SMT is that this is done in hardware and you do not optimize for it and there are not too many generic ways of doing that. The main optimization on software side is awareness on operating system level (scheduler) about which cores are physical and which are logical. Threads are ideally assigned to physical cores first, then logical for best results.
What I know is that the Linpack for AMD 3000 series (for instance and other ryzen processors) uses the OpenMP which is by all means not optimized for AMD. Optimized compiler and libraries for full support of new Ryzen architecture are also required. So there is still a lot to improve and I'm not talking about the hardware now.
Posted on Reply
#135
R-T-B
thesmokingman, post: 4121924, member: 91203"
Failed products doesn't demonstrate mastery...
Oh. My god.

No one was even trying to...

that wasn't even the...

why did I just read? This is worse than the bulldozer core "debate"

Screw it, you are all frog-god food now. So has decreed the giant green one. Blessed be his slime. I wash my hands of this.
Posted on Reply
#136
BorgOvermind
Quad SMT would help a lot in the server market and help AMD get some ground back. hopefully re-engaging with large partners like Dell.

As for the naming, I would of found something more worthy-sounding than those city names.
Posted on Reply
#137
Valantar
BorgOvermind, post: 4122825, member: 89504"
Quad SMT would help a lot in the server market and help AMD get some ground back. hopefully re-engaging with large partners like Dell.

As for the naming, I would of found something more worthy-sounding than those city names.
They're internal code names, not marketing names. What do they matter? Whether it's Rome, Milan, Turin or... Cinque Terre? or whatever - they're all EPYC + a 4-digit identifier when they go on sale. The generation is indicated within those four digits, so the code names are never officially used for marketing purposes. That enthusiasts adopt them as shorthand is our problem, not AMD's.
Posted on Reply
#138
Super XP
This is very interesting indeed. Scheduling this monstrous ZEN3 will have to be PERFECT. Personally I don't see this coming to desktop CPUs, because Windows OS will have a nightmare scheduling it. But you never know. AMD is all about innovation and 1sts. Can't wait for more official details by AMD to come out.
Posted on Reply
#139
thesmokingman
Speaking of IBM, they are in hot water for age discrimination. They wholly deny it but gdamn, they are full of it. Everyone knows once you get old they cut you. Old techs cost more then young techs
R-T-B, post: 4122741, member: 41983"
Oh. My god.

No one was even trying to...

that wasn't even the...

why did I just read? This is worse than the bulldozer core "debate"

Screw it, you are all frog-god food now. So has decreed the giant green one. Blessed be his slime. I wash my hands of this.
Haha, get a grip man. Whomever the initial dolt who started this created this context by trolling and stating that AMD is only catching up to Intel with 4 way SMT. Maybe you should read the earlier posts. My point is being first at somethhing you've done ludicrously bad at isn't something to brag about, in context. It's kinda ironic... considering the world runs on AMD64 and not the crap Intel had.
Posted on Reply
#140
GoldenX
This is great and all, but AMD, could we get some more love on OpenGL and Vulkan (yes, your own API), please?
Posted on Reply
#141
Camm
GoldenX, post: 4123227, member: 160319"
This is great and all, but AMD, could we get some more love on OpenGL and Vulkan (yes, your own API), please?
OpenGL under Windows I get has always been a problem (which somehow is fine under Linux), but what issue do you have with Vulkan?
Posted on Reply
#142
R-T-B
thesmokingman, post: 4123226, member: 91203"
Haha, get a grip man. Whomever the initial dolt who started this created this context by trolling and stating that AMD is only catching up to Intel with 4 way SMT. Maybe you should read the earlier posts.
You shouldn't take troll posts so seriously, dude. My grip is fine. I followed the context fine. No one else seemed too and the whole thing left me feeling mentally ill.

The dam just broke on your post, it wasn't just you. Doesn't matter though, the toad is always hungry.
Posted on Reply
#143
GoldenX
Camm, post: 4123265, member: 110377"
OpenGL under Windows I get has always been a problem (which somehow is fine under Linux), but what issue do you have with Vulkan?
It's falling behind Intel and Nvidia.
The Linux driver is a lot better for OpenGL, but it's also very unstable.
Posted on Reply
#144
Camm
GoldenX, post: 4123277, member: 160319"
It's falling behind Intel and Nvidia.
The Linux driver is a lot better for OpenGL, but it's also very unstable.
How so? Feature parity is generally fine, and AMD is still generally more performant under Vulkan than Nvidia. I don't mean to be harsh, but it feels like a very odd whinge. As for Linux OpenGL stability, AMDGPU has been much better and stable than Noveau or Nvidia's binary.
Posted on Reply
#145
GoldenX
Camm, post: 4123283, member: 110377"
How so? Feature parity is generally fine, and AMD is still generally more performant under Vulkan than Nvidia. I don't mean to be harsh, but it feels like a very odd whinge. As for Linux OpenGL stability, AMDGPU has been much better and stable than Noveau or Nvidia's binary.
I'm a tester for yuzu emulator, so I use it on my 270X, the OpenGL driver on Windows is stable, but so slow that an Intel IGP is faster than a Navi 5700XT. The Vulkan Windows one is "fine" (as fast as the OpenGL Nvidia one, which is great considering the Switch is an Nvidia tablet) but AMD already said that they won't implement some extensions "because it's too much work", those extensions already work on Intel and Nvidia. The OpenGL mesa (Linux) driver is faster, a lot faster, but it seems to only be stable on GCN2 and up, GCN1 is just a mess, both on radeonsi or amdgpu, it eats ram, crashes easily, and has geometry glitches everywhere. Haven't tested the RADV Vulkan driver yet.
All the money seems to be on Navi, but it also failed to give us a decent OpenGL driver.
Posted on Reply
#146
RichF
kapone32, post: 4121712, member: 181865"
I am not sure about the density argument either as the 2600k had 995 million transistors vs the FX 8350 with 1200 Million.
I think I recall someone claiming that Anandtech's die area figure is too high, at least when it comes to Piledriver. I vaguely recall that the actual die size for Piledriver was, according to this person, just below 300. But, the gist appears to be that AMD/GF didn't beat Intel in the Sandy area in terms of density. If the claim is true that Piledriver was below 300 then it looks like Piledriver/GF 32nm SOI was pretty close to Intel's Sandy Bridge E 4C.

Posted on Reply
#148
Valantar
HeitorGenius, post: 4123443, member: 190823"
AMD4 SLOT? :D
?
Posted on Reply
#149
Super XP
Valantar, post: 4123457, member: 171585"
?
2 x ?
Posted on Reply
#150
Camm
GoldenX, post: 4123286, member: 160319"
I'm a tester for yuzu emulator
We've seen this happen with both Dolphin & RPCS3 in the past, emudevs targeting Nvidia in particular, and thinking its broken on AMD's side, where as its a non-reference implementation of OpenGL from Nvidia that is the cause, and would daresay would be a huge contributing factor here.
Posted on Reply
Add your own comment