Tuesday, September 24th 2019

AMD Could Release Next Generation EPYC CPUs with Four-Way SMT

AMD has completed design phase of its "Zen 3" architecture and rumors are already appearing about its details. This time, Hardwareluxx has reported that AMD could bake a four-way simultaneous multithreading technology in its Zen 3 core to enable more performance and boost parallel processing power of its data center CPUs. Expected to arrive sometime in 2020, Zen 3 server CPUs, codenamed "MILAN", are expected to bring many architectural improvements and make use of TSMC's 7nm+ Extreme Ultra Violet lithography that brings as much as 20% increase in transistor density.

Perhaps the biggest change we could see is the addition of four-way SMT that should allow a CPU to have four virtual threads per core that will improve parallel processing power and enable data center users to run more virtual machines than ever before. Four-way SMT will theoretically boost performance by dividing micro-ops into four smaller groups so that each thread could execute part of the operation, thus making the execution time much shorter. This being only one application of four-way SMT, we can expect AMD to leverage this feature in a way that is most practical and brings the best performance possible.
AMD isn't the first to implement this kind of solution to its processors. IBM has been making CPUs based on POWER ISA for years now that feature four or even eight-way SMT and they are one of the key reasons why POWER CPUs are so powerful. Nonetheless, we can hope to see more details about Zen 3 core design decisions as we approach 2020 and launch of Milan CPUs. Source: Hardwareluxx
Add your own comment

159 Comments on AMD Could Release Next Generation EPYC CPUs with Four-Way SMT

#101
Lionheart
Can someone educate me on why we don't hear much about IBM anymore, they're still huge yet always in the background, do they make actual CPU's still or CPU architectures & if so why aren't they in the desktop consumer scene?
Posted on Reply
#102
Camm
Lionheart, post: 4122043, member: 52641"
Can someone educate me on why we don't hear much about IBM anymore, they're still huge yet always in the background, do they make actual CPU's still or CPU architectures & if so why aren't they in the desktop consumer scene?
You answered your own question. They aren't in the desktop consumer scene as they don't make desktop consumer products.
Posted on Reply
#103
TranceHead
Lionheart, post: 4122043, member: 52641"
Can someone educate me on why we don't hear much about IBM anymore, they're still huge yet always in the background, do they make actual CPU's still or CPU architectures & if so why aren't they in the desktop consumer scene?
Short version, Lenovo bought their consumer level business structure off them.
Posted on Reply
#104
londiste
Vya Domus, post: 4121837, member: 169281"
four virtual threads per core
Can we stop this already, it's 2019 for Christ sake, computer architecture isn't that cryptic anymore. There is nothing virtual about them, they are as physical as they can get, you could literally put your finger on the corresponding piece of silicon if you had a chip scaled up.
Virtual is perhaps not the right word but you cannot literally put a finger on the corresponding piece of silicon of an SMP thread because these are all in the same core and not even a separate part of it.

Imsochobo, post: 4121713, member: 66457"
When did Intel:
chiplet architecture.
Infinity fabric.
One chip fits all needs (Entry level desktop to top of the line server)
smt4 (still rumor stage)
CCX design.

Amd's 14NM was way way inferior to Intel's in density, performance and everything and still managed to pretty much match Intel's efficiency.
AMD's (technically GlobalFoundries' and TSMC's) 14nm is not way inferior to Intel's. Density is roughly the same, performance is not far off. Frequency ceiling is quite a bit higher for Intel's 14nm(+/++) but that's about it.

Intel was not first to implement these things but they have dabbled in pretty much everything.
chiplet architecture. - Pentium D
Infinity fabric. - QPI since 2008, now UPI.
One chip fits all needs (Entry level desktop to top of the line server) - This is not an optimal approach for performance or design but a pure cost efficiency decision.
smt4 (still rumor stage) - Xeon Phi
CCX design - What exactly do you mean by CCX design? Separate core complexes on the same die? Dual ringbus designs are not far from it. Pentium D with two glued cores is pretty much the same layout.

FordGT90Concept, post: 4121760, member: 60463"
I highly doubt AMD is going to divorce the core design between Epyc and Ryzen. If they are, it's fine; if not, AMD is pulling another Bulldozer with this one.
It will have zero effect on desktop. They can easily design the cores with number of SMT threads being configurable (both BIOS/UEFI and laser cutting). AMD probably will keep cores and dies the same across both Ryzen and EPYC and extra transistor cost for more SMT threads is not significant. Perhaps more accurately, parts of that will benefit core anyway and parts that do not are small.
Posted on Reply
#105
Xuper
More Transistor density means More heat density Like L3 Cache.It will getting too hot.20% doesn't mean 20% more heat , but can mean 50% or even more.
Posted on Reply
#106
notb
Lionheart, post: 4122043, member: 52641"
Can someone educate me on why we don't hear much about IBM anymore, they're still huge yet always in the background, do they make actual CPU's still or CPU architectures & if so why aren't they in the desktop consumer scene?
"We" who?
IBM is still one of the most talked about IT companies. But they don't make consumer products anymore, so they're out of scope on sites/forums like TPU.

If you would go on a datacenter / cloud / AI / ML / quantum computing website or forum, IBM would appear way more frequently than AMD. Even more than Intel.
Posted on Reply
#107
Hugh Mungus
I would just like to say:

...

I've got nothing here. I mean, it's not like most people need even more multithreaded performance, especially with 8, 12 or 16 cores, so if this makes it into ryzen 4000:roll:
Posted on Reply
#108
Vya Domus
londiste, post: 4122077, member: 169790"
Virtual is perhaps not the right word but you cannot literally put a finger on the corresponding piece of silicon of an SMP thread because these are all in the same core and not even a separate part of it.
Of course you can, the added logic that is required to process multiple streams of instructions exists physically in silicon. It's not some unexplainable abstract entity so yes can definitely put your finger on it.
Posted on Reply
#109
londiste
Vya Domus, post: 4122093, member: 169281"
Of course you can, the added logic that is required to process multiple streams of instructions exists physically in silicon. It's not some unexplainable abstract entity so yes can definitely put your finger on it.
Process is all about threading, it is all in the frontend. There is no additional frontend and adding 4-way SMT is pretty much enlargening the existing pieces to fit more threads. It is mainly about queue sizes, more threads also means more cache offcore for it to be efficient.

"Process" here is all about management. Actual execution units are different anyway. Ryzen has 8-10 execution units in it. Squeezing more threads through to them is all about keeping as much of them occupied as possible. Both AMD and Intel have said recently that they are usually looking at 3-4 of these units being active at one time. The effort right now is to make sure they can feed more data in there.

That is the idea behind SMT. At any time when there are execution units idle, more work can be fed into them. If current thread does not utilize them, let's use another one. There are tradeoffs to this as frontend needs to be more capable, queues have to fit more entries etc. Complexity, die space and still possible stalls and SMT's efficiency tends to fall with more threads. In addition to that, SMT itself does not involve adding execution units (although that can still be done in core design regardless of SMT). Right now, both Zen(+/2) and Skylake can pretty much do 2 of any specific operation at once (not all of them but most), but not more.
Posted on Reply
#110
1d10t
It is still unclear how AMD will implement 4 way SMT in their future EPYC, are they taking IBM Power PC clustered SMT or go cascade block like SUN SPARC ?
Either way, future Ryzen desktop will likely have same core count, but doubling threads :D
Posted on Reply
#111
HTC
1d10t, post: 4122099, member: 110464"
It is still unclear how AMD will implement 4 way SMT in their future EPYC, are they taking IBM Power PC clustered SMT or go cascade block like SUN SPARC ?
Either way, future Ryzen desktop will likely have same core count, but doubling threads :D
Doubtful: they'll likely have Epyc with full 4-way SMT, TR with 3-wat SMT and desktop with "standard" 2-way SMT. Done this way also helps tremendously with segmentation.
Posted on Reply
#112
ratirt
HTC, post: 4122100, member: 51238"
Doubtful: they'll likely have Epyc with full 4-way SMT, TR with 3-wat SMT and desktop with "standard" 2-way SMT. Done this way also helps tremendously with segmentation.
Or they will all have 4-way SMT because the segmentation is already applied and AMD doesn't need more prominent segmentation. Epyc, TR and desktop. They are already different with features set.
Posted on Reply
#113
Valantar
Wow, this thread went off the rails so quickly you would think it was actually four parallelized threads executing on the same hardware.



On a more serious note, hasn't 4-way SMT beein in the cards for Zen3 since the first design goals for this architecture were presented? I explicitly remember reading about this a couple of years ago (and thinking "that sounds very server specific"). Unfortunately can't remember where I read this, but I wouldn't be surprised if it was one of AnandTech's articles (possibly from a Hot Chips presentation or some such?).

Nonetheless, can we please stop arguing about ridiculous semantics, such as where exactly the threshold for "innovation" in the CPU space lies? No, 4-way SMT is nothing new in and of itself, and as stated above, IBM does it in their Power8 arch (and 8-way too), Intel did it in Xeon Phi and Larrabee, and IIRC there are companies working on this for ARM server hardware. So: AMD is not first to do this (think that was IBM?), not the first to do this in a widely distributed chip (IBM again), not the first to do this in x86 (that was Intel), but the first to do this in (what will be) a widely distributed x86-based chip. Does that qualify as innovation? Who knows? And frankly, who cares? It's a new feature in this space regardless of how much or little AMD can run around screaming "FIRST!!!1!!1!one" like a 14-year-old. Server and datacenter customers will love this. Now please stop arguing over meaningless semantics.

As for consumer uses, the questions of the value (and potential of performance loss) are legitimate. SMT inevitably means sharing resources between threads (as scheduling threads that exclusively use different parts of the core is entirely utopian), meaning that one or more threads can and will need to wait for others to finish using the parts of the core that they need next. That means lower ST performance. Then there's the Windows scheduler, which already struggles with unequal cores, as seen with Ryzen 3000 and the widely documented issues of not scheduling demanding tasks to the known fastest core. It will need a rather fundamental revamp for this to be viable for end-user applications at all. Not something that really ought to be an issue for MS, but they'll need to make a serious effort - and they might not want to, as this is the kind of feature they can charge a serious premium for in Windows Server.

My biggest worry is that the focus on this means less focus on architectural IPC improvements (yes, one can argue that better SMT improves IPC, but that's another can of worms) for Zen3. I hope they have enough tricks up their sleeves for another 10% bump or so.
Posted on Reply
#114
Vya Domus
One doesn't have to argue about SMT and IPC, in terms of percentages SMT brought forward one of the biggest increases in IPC ever, historically speaking.

A two way SMT core can potentially bring 40-50% higher IPC if the conditions are right. Few other features have been this impactful.
Posted on Reply
#115
phill
This sounds really interesting to me... AMD... what are you cooking up now I wonder :)
Posted on Reply
#116
1d10t
The way I read it, AMD gonna maxed out upcoming Zen 3 with current lithography, both in clocks and core count.So in that manner, giving 4 way SMT and higher clock as 7nm maturing, it still give them advantage in competition :rolleyes:

HTC, post: 4122100, member: 51238"
Doubtful: they'll likely have Epyc with full 4-way SMT, TR with 3-wat SMT and desktop with "standard" 2-way SMT. Done this way also helps tremendously with segmentation.
You can only switch on and off actually, so 3 way SMT is not possible :D
Posted on Reply
#117
londiste
Valantar, post: 4122111, member: 171585"
IIRC there are companies working on this for ARM server hardware.
https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality
There are others but they (now owned by Marvell) should be the most prominent one.

Vya Domus, post: 4122112, member: 169281"
A two way SMT core can potentially bring 40-50% higher IPC if the conditions are right. Few other features have been this impactful.
SMT benefit tends to be in 30-35% range for desktop processors (and in general for current server parts). Not sure about calling this IPC though. Yes, it is the same core but a different thread.
Posted on Reply
#118
Valantar
londiste, post: 4122127, member: 169790"
https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality
There are others but they (now owned by Marvell) should be the most prominent one.
Thanks, I knew I had seen it somewhere :)
Vya Domus, post: 4122112, member: 169281"
One doesn't have to argue about SMT and IPC, in terms of percentages SMT brought forward one of the biggest increases in IPC ever, historically speaking.

A two way SMT core can potentially bring 40-50% higher IPC if the conditions are right. Few other features have been this impactful.
londiste, post: 4122127, member: 169790"
SMT benefit tends to be in 30-35% range for desktop processors (and in general for current server parts). Not sure about calling this IPC though. Yes, it is the same core but a different thread.
That's the can of worms I was alluding to. While it is undoubtedly true that the hardware is processing more instructions per clock cycle, it is only doing so by executing multiple discrete processing threads - which, given the high-level similarity to having multiple cores, is generally not seen as "pure IPC" which is generally a measure of single-threaded instructions per clock cycle (precisely to exclude misleading multi-core comparisons). Muddying this further, it would then (theoretically) be possible to "increase IPC" by improving SMT hardware utilization without affecting ST performance whatsoever - could you then call that an IPC increase? There's a reason I called this a can of worms. And the only feasible solution to managing it is to keep the definition as simple as possible, i.e. limited to single-thread performance (regardless of the validity of arguments for including SMT).
Posted on Reply
#119
Vya Domus
londiste, post: 4122127, member: 169790"
SMT benefit tends to be in 30-35% range for desktop processors (and in general for current server parts). Not sure about calling this IPC though. Yes, it is the same core but a different thread.
I wrote stuff that scales well into the 40% range when SMT is enabled myself, granted that is with ideal conditions, no branching, coalesced memory access, etc. 40% is realistic for compute intensive tasks in servers and desktop. The only reasons SMT doesn't scale that well with your average consumer software is because most of the time enough ILP can be extracted without the need of multiple hardware threads or the bottleneck is somewhere else.

There aren't a million ways to increase IPC, instruction level parallelism is pretty much then only way to do it in modern CPUs, the decode/add/multiply/etc logic has been optimized to death already. SMT does just that, it increases the ILP per core, there is no reason to say it doesn't increase IPC.

If my software is say 10% faster when SMT is enabled what else could that possibly mean other than the fact that the average IPC has increased ?
Posted on Reply
#120
Vayra86
Xaled, post: 4121802, member: 158027"
These are so random scores, but yeah most of the time when SMT is off frames are higher.
As core counts is already more than enough for gaming, Why AMD isnt making something similar to what is being done in some mobile phones? Half of cores with high clocks and others with low clocks for other apps/general use and no SMT at al? Is it that hard really?
Ehh. What?

When SMT is off frames are higher ONLY flies if the game is the only thing you run - and ONLY in a very tiny subset of actual games. The examples are very, very rare and the gain is very very minimal. The biggest advantage for no SMT (actually: no HT -) is that your CPU might clock a tiny bit higher, and thát gives you an edge in game FPS. Another small one, at best. Its the type of minmaxing for the last 2%, best case... and situationally.

Definitely worth the trade off to just keep SMT... as soon as the CPU can spend time using SMT on the same core as the game, there are more real cores available to process the game. @FordGT90Concept that's how it works both ways I would say. And if you're using BOINC... and gaming... is a small sacrifice so problematic for the ability to multi task like that? And... how many people actually do this?

I think HT/SMT have long since proven to be a negligible drawback versus a noticeable win, despite lots of testing to prove otherwise, very little has ever been found and if it was there, it wasn't much at all.

FordGT90Concept, post: 4121784, member: 60463"
Because you have four threads sharing the same underlying execution resources. If one of those is a game, and the others are something like BOINC, only 25%-40% of the execution time is spent on game thread which means fewer frames per second. Server loads, they care about efficiency over response times which is diametrically opposed to what games (consumer in general) needs.
I would say, look at the degree of control you have over an AMD CPU. If it does harm performance, just disable it, and nobody loses anything, right?
Posted on Reply
#121
Xx Tek Tip xX
This reminds me of the "EXCLUSIVE" zen3 has 4 way SMT from redgamingtech and numerous other crap rumors, I'll believe it when I see it which probably won't happen because it's a next to no chance of 4 way SMT happening.
Posted on Reply
#122
Smartcom5
notb, post: 4121921, member: 165619"
[quote=ShrimpBrime, post: 4121913, member: 185158"]HT, it's Intel's Hyper Threading© Technology.....
HT is a marketing name. SMT is the idea behind it.
And to not have heard about Xeon Phi is quite an achievement for a "PC enthusiast"
It seems you're really new in this...[/quote]You're wrong, both of you.
HT is commonly known as the abbreviation for AMD's HyperTransport.
HTT it is what Intel's Hyper-Threading Technology is commonly shortened to.

You are welcome!
Posted on Reply
#123
notb
Smartcom5, post: 4122459, member: 97031"
You're wrong, both of you.
HT is commonly known as the abbreviation for AMD's HyperTransport.
HTT it is what Intel's Hyper-Threading Technology is commonly shortened to.
https://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html
"Intel HT Technology" - HT is the official acronym.

I'm not sure I've ever seen anyone use "HTT".

Putting the technological terminology differences aside, I'm not even sure if we agree on the meaning of "commonly"...

Posted on Reply
#124
efikkan
Vya Domus, post: 4122112, member: 169281"
One doesn't have to argue about SMT and IPC, in terms of percentages SMT brought forward one of the biggest increases in IPC ever, historically speaking.

A two way SMT core can potentially bring 40-50% higher IPC if the conditions are right. Few other features have been this impactful.
As some have pointed out, IPC is single thread. What you probably meant is saturation of the core resources, but it's important to understand that SMT even i perfect conditions never exceeds the performance of a single "optimal" thread. It's simply a way to let other threads utilize the resources the other thread doesn't use, scaling towards one "optimal thread".

There are several factors that impacts IPC. One way is to add more execution resources (ALUs, FPUs, AGUs etc.) which boosts your peak performance, but can leave resources unsaturated. Secondly, there are front-end, latency and cache improvements which improve the utilization of the execution resources you already have. Since SMT relies on exploiting idle resources of the CPU core for other threads, the ever increasing efficiency of CPU architectures is actually making SMT less and less useful for generic tasks, as efficiency gains in front-end and cache will ultimately consume the "gains" of SMT.

SMT was introduced at a time when single core CPUs were mostly idle due to stalls in the CPU pipeline, and the cost of implementing SMT in silicon was minuscule. But these days as the gains of SMT are shrinking, and the security implications of SMT makes the silicon costs ever increasing, it's actually time to drop it, not extend it further with 4-way or even 8-way SMT. Today, SMT only really makes sense for server workloads where latency is irrelevant and total throughput of massive amounts of requests (or work items) is the primary goal. SMT is really a relic of the past, and 2020 is not the year to push it further.

While future gains in CPU performance wouldn't get close to the improvements we saw in the 80s and the 90s, it's important to remember that the reason "stagnant" single thread performance for the last ~4+ years is not due to any theoretical performance limit in IPC. Obviously we are now at a "clock wall" for the current type of semiconductors, but the primary reason for the (Intel's) stagnant CPU selection is the node problems causing two years of delays to Ice Lake(Sunny Cove), which they claim offer 18% IPC gains. Both Intel and AMD have their 2-3 next architectures lined up, and theoretically it is absolutely possible to achieve ~50% better IPC over Skylake with just continuing to add more execution resources, improving cache, reducing latency and improving the front-end.

But even beyond that, single thread performance will not hit a wall any time soon. Quite the opposite, we are now on the verge of the largest single thread gain since the 90s. Since Pentium(1993), x86 CPUs have become increasingly superscalar, which obviously does wonders for peak performance, but also keeps widening the gap between minimum and average vs. peak performance, as the CPU becomes more sensitive to the code to keep the resources fully saturated. As anyone familiar with machine code would know, there are two major causes for this lack of saturation; cache misses and branch mispredictions. Optimizing for cache misses can be done fairly efficiently, but branch mispredictions are harder to deal with. Largely it's about removing bloat, but you will usually still have enough of it left to hold back performance. And in the greater scope of even a function, most branching only have local effects, but the CPU can't know that, so when there is a branch misprediction it has to flush the pipeline, even if some of the calculations may still be "good". This is because a lot of context is lost between your high level code and machine code, and even the best prediction models will only get you so far without getting some extra "help". I know Intel is researching a solution to this problem, where basically you have these dependencies between branching implied in machine code (e.g. this branch only affects this code over here, but not the bigger flow of the program), I believe they call it "threadlets" or something, and would probably done by having chains of instructions that are independent of branching in others, like sort of a "thread" that only exists virtually for a few dozen instructions. While this would at least require recompilation of software, it would greatly improve the CPU front-end's ability to reason about true dependencies between calculations, instead of having to assume the pipeline needs to be flushed. Gains in single threaded performance of 2-3x should not be unreasonable. While what I'm describing here may seem a little out of scope, it's actually not, as this would practically eliminate SMT. But don't expect this to be implemented in shipping products yet, it's still experimental, I would expect it 5-10 years down the road.

Vya Domus, post: 4122176, member: 169281"
I wrote stuff that scales well into the 40% range when SMT is enabled myself, granted that is with ideal conditions, no branching, coalesced memory access, etc. 40% is realistic for compute intensive tasks in servers and desktop. The only reasons SMT doesn't scale that well with your average consumer software is because most of the time enough ILP can be extracted without the need of multiple hardware threads or the bottleneck is somewhere else.
Actually, you got this the wrong way. In ideal conditions, SMT would not be needed at all, the only reason why there are gains from SMT is that threads don't saturate the CPU enough. When you have ideal software as you said, branch and cache optimized, it will saturate the CPU very well.

SMT is mostly useful for server workloads where you have an "endless" supply of "work chunks" that can be done in parallel, very typical for a server running worker threads for Java code or scripts. This is code which can't be cache optimized and is heavily abstracted, so the CPU will more or less constantly stall. This is where 4-way and even 8-way SMT makes sense (like Power CPUs), and even then the execution part of the CPU will be largely idle, the bottleneck will be the front-end and the caches, otherwise you could make a 32-way SMT CPU and scale on.

Vya Domus, post: 4122176, member: 169281"
If my software is say 10% faster when SMT is enabled what else could that possibly mean other than the fact that the average IPC has increased ?
Oh, there can be so many, too much to discuss here. It depends how many threads you spawn, how they are synchronized and of course how your application is "disturbed" by background threads.
Posted on Reply
#125
theoneandonlymrk
efikkan, post: 4122554, member: 150226"
As some have pointed out, IPC is single thread. What you probably meant is saturation of the core resources, but it's important to understand that SMT even i perfect conditions never exceeds the performance of a single "optimal" thread. It's simply a way to let other threads utilize the resources the other thread doesn't use, scaling towards one "optimal thread".

There are several factors that impacts IPC. One way is to add more execution resources (ALUs, FPUs, AGUs etc.) which boosts your peak performance, but can leave resources unsaturated. Secondly, there are front-end, latency and cache improvements which improve the utilization of the execution resources you already have. Since SMT relies on exploiting idle resources of the CPU core for other threads, the ever increasing efficiency of CPU architectures is actually making SMT less and less useful for generic tasks, as efficiency gains in front-end and cache will ultimately consume the "gains" of SMT.

SMT was introduced at a time when single core CPUs were mostly idle due to stalls in the CPU pipeline, and the cost of implementing SMT in silicon was minuscule. But these days as the gains of SMT are shrinking, and the security implications of SMT makes the silicon costs ever increasing, it's actually time to drop it, not extend it further with 4-way or even 8-way SMT. Today, SMT only really makes sense for server workloads where latency is irrelevant and total throughput of massive amounts of requests (or work items) is the primary goal. SMT is really a relic of the past, and 2020 is not the year to push it further.

While future gains in CPU performance wouldn't get close to the improvements we saw in the 80s and the 90s, it's important to remember that the reason "stagnant" single thread performance for the last ~4+ years is not due to any theoretical performance limit in IPC. Obviously we are now at a "clock wall" for the current type of semiconductors, but the primary reason for the (Intel's) stagnant CPU selection is the node problems causing two years of delays to Ice Lake(Sunny Cove), which they claim offer 18% IPC gains. Both Intel and AMD have their 2-3 next architectures lined up, and theoretically it is absolutely possible to achieve ~50% better IPC over Skylake with just continuing to add more execution resources, improving cache, reducing latency and improving the front-end.

But even beyond that, single thread performance will not hit a wall any time soon. Quite the opposite, we are now on the verge of the largest single thread gain since the 90s. Since Pentium(1993), x86 CPUs have become increasingly superscalar, which obviously does wonders for peak performance, but also keeps widening the gap between minimum and average vs. peak performance, as the CPU becomes more sensitive to the code to keep the resources fully saturated. As anyone familiar with machine code would know, there are two major causes for this lack of saturation; cache misses and branch mispredictions. Optimizing for cache misses can be done fairly efficiently, but branch mispredictions are harder to deal with. Largely it's about removing bloat, but you will usually still have enough of it left to hold back performance. And in the greater scope of even a function, most branching only have local effects, but the CPU can't know that, so when there is a branch misprediction it has to flush the pipeline, even if some of the calculations may still be "good". This is because a lot of context is lost between your high level code and machine code, and even the best prediction models will only get you so far without getting some extra "help". I know Intel is researching a solution to this problem, where basically you have these dependencies between branching implied in machine code (e.g. this branch only affects this code over here, but not the bigger flow of the program), I believe they call it "threadlets" or something, and would probably done by having chains of instructions that are independent of branching in others, like sort of a "thread" that only exists virtually for a few dozen instructions. While this would at least require recompilation of software, it would greatly improve the CPU front-end's ability to reason about true dependencies between calculations, instead of having to assume the pipeline needs to be flushed. Gains in single threaded performance of 2-3x should not be unreasonable. While what I'm describing here may seem a little out of scope, it's actually not, as this would practically eliminate SMT. But don't expect this to be implemented in shipping products yet, it's still experimental, I would expect it 5-10 years down the road.


Actually, you got this the wrong way. In ideal conditions, SMT would not be needed at all, the only reason why there are gains from SMT is that threads don't saturate the CPU enough. When you have ideal software as you said, branch and cache optimized, it will saturate the CPU very well.

SMT is mostly useful for server workloads where you have an "endless" supply of "work chunks" that can be done in parallel, very typical for a server running worker threads for Java code or scripts. This is code which can't be cache optimized and is heavily abstracted, so the CPU will more or less constantly stall. This is where 4-way and even 8-way SMT makes sense (like Power CPUs), and even then the execution part of the CPU will be largely idle, the bottleneck will be the front-end and the caches, otherwise you could make a 32-way SMT CPU and scale on.


Oh, there can be so many, too much to discuss here. It depends how many threads you spawn, how they are synchronized and of course how your application is "disturbed" by background threads.
You know I half agree but disagree with your initial standpoint, this ideal software you speak of, do you have an example , because I would be surprised, CPU's are not fixed function ,they have built in coprocessors and on die coprocessors, modern code is also broke up into micro ops so I can't imagine that's an easy bit of code to know how to write never mind write.
Posted on Reply
Add your own comment