Thursday, August 19th 2021

Intel Thread Director Makes "Alder Lake" Hybrid Architecture Work

Intel in its Architecture Day presentation Thread Director, a hardware component present on the "Alder Lake" silicon, which makes the Hybrid architecture of the processor work flawlessly. "Alder Lake-S" is the first desktop processor with two kinds of x86 CPU cores—the larger Performance P-cores, and the smaller Efficient E-cores, which work in a setup not unlike big.LITTLE by Arm.

The x86-based "Alder Lake" processor has a much more complex ISA, and the E-cores don't have all of the instruction sets or hardware capabilities that the P-cores do. The two cores operate at very different performance/Watt bands, and are optimized for vastly different workloads. At the same time, sending a workload to the wrong kind of core could not only impact performance, but also crash, due to an ISA mismatch. Intel realized that it will take a lot more than mere OS-level awareness to solve the problem, and so innovated the Thread Director.
Put simply, Intel Thread Director is a highly specialized hardware abstraction layer (HAL) that interfaces with the operating system and software on one side; and the two groups of CPU cores, on the other. Its job is to analyze a workload, distribute it among the P-core or E-core clusters, at a granular level (i.e. thread-level). If specific threads of an application don't invoke certain kinds of instructions and are determined to be low-priority, they're dispatched to the E-core cluster. Threads that lose priority are parked onto the E-cores from the P-cores, too.
The P-cores get priority when a thread requires instructions exclusive to P-cores (such as AVX-512 or DLBoost). Thread Director also works with the OS kernel to discern background tasks from foreground/priority ones. This probably works with a software-side component that's included with the Chipset INF software, if not an exclusive driver. Thread Director ensures that lightweight or low-priority tasks don't needlessly invoke P-cores, and when the system is idling, the processor's power management can probably gate power to P-cores for major power savings (this is assuming Alder Lake features a power-gating technology similar to "Lakefield.").
Intel will recommend Windows 11 as the most optimal OS for "Alder Lake," as it meets Thread Director half way with OS Scheduler awareness of hybrid processor architectures. It remains to be seen, however, whether Thread Director requires this.
Add your own comment

23 Comments on Intel Thread Director Makes "Alder Lake" Hybrid Architecture Work

#1
_Flare
My opinion is, that this is the right and needed move to not completely rely on Microsoft, ... Bulldozer got special handling on Linux, but not really on Windows.
Intel did better with Lakefield but had to learn a similar lesson.

That kind of abstraction seems needed if the common software-plattforms aka OSes arent capable yet.
Nvidia needed to involve a bunch of blackboxing when they went to the multiple frontend approach in 2008 with the GT200 µarch (GTX 280), so abstraction is not always bad or slow.
Posted on Reply
#2
ZoneDymo
"which makes the Hybrid architecture of the processor work >>flawlessly<<"

these are big claims intel...also why does it matter that windows 11 "meets thread director half way"? is thread director so overloaded that it cant actually handle the task on its own? and if so why is it not made better then?....like it seems to me that if there is hardware onboard that does this, then windows 10 should work just fine
Posted on Reply
#3
Chrispy_
Well duh, of course Intel needs a hardware scheduler to send instructions to two different sets of cores with incompatible ISAs. They're making it out that they're geniuses for creating this "Thread Director" because they don't trust the OS scheduler to do it.

In reality, the OS scheduler cannot do it, and this is merely a required solution to a problem of Intel's own making (the problem of incompatible ISAs caused because the Tremont based E-cores aren't designed from the ground up to work alongside Raptor Cove P-cores, Intel are just re-using some old Atom architecture that was designed for a very different market and purpose, originally).

big.LITTLE works on Linux because both types of ARM core use the same ISA. There is no problem that needs solving there, and if Intel had designed a smaller E-core that used the same ISA like a proper, built-for-pupose design should, they wouldn't need this additional layer of hardware scheduler Thread Director.
Posted on Reply
#4
londiste
_FlareBulldozer got special handling on Linux, but not really on Windows.
As a sidenote, it actually didn't. Linux basically handled each Bulldozer module as 1 physical core and 2 logical cores - that is the same as SMT. AMD tried to get Microsoft and Windows to handle each module as 2 physical cores which was less efficient.
_FlareIntel did better with Lakefield but had to learn a similar lesson.
I bet Lakefield was in large part a test of how Windows (and to smaller degree Linux) scheduler handles different cores. The delay from Lakefield to Alder Lake and a hardware(-assisted) solution seems to indicate it did not go all too well despite Lakefield-specific updates to scheduler that Microsoft did.
Posted on Reply
#5
ncrs
Chrispy_In reality, the OS scheduler cannot do it, and this is merely a required solution to a problem of Intel's own making (the problem of incompatible ISAs caused because the Tremont based E-cores aren't designed from the ground up to work alongside Raptor Cove P-cores, Intel are just re-using some old Atom architecture that was designed for a very different market and purpose, originally).
But the ISAs are not incompatible. Intel has disabled AVX-512 in Alder Lake. This makes them equal, as far as we know.
Posted on Reply
#6
Vayra86
Chrispy_Well duh, of course Intel needs a hardware scheduler to send instructions to two different sets of cores with incompatible ISAs. They're making it out that they're geniuses for creating this "Thread Director" because they don't trust the OS scheduler to do it.

In reality, the OS scheduler cannot do it, and this is merely a required solution to a problem of Intel's own making (the problem of incompatible ISAs caused because the Tremont based E-cores aren't designed from the ground up to work alongside Raptor Cove P-cores, Intel are just re-using some old Atom architecture that was designed for a very different market and purpose, originally).

big.LITTLE works on Linux because both types of ARM core use the same ISA. There is no problem that needs solving there, and if Intel had designed a smaller E-core that used the same ISA like a proper, built-for-pupose design should, they wouldn't need this additional layer of hardware scheduler Thread Director.
Well if they want to expand on big Little, an important consideration is the fact that they CAN actually design the big and the little cores with maximum freedom. Long term that will give them the largest amount of flexibility, nobody can predict the future of how long this idea will remain viable. I mean, this can be a new design win, if the execution is correct and it nets an advantage. Its a new step in variable performance/power/heat.
Posted on Reply
#7
Chrispy_
ncrsBut the ISAs are not incompatible. Intel has disabled AVX-512 in Alder Lake. This makes them equal, as far as we know.
I'm going to literally repeat what the article says.

"The x86-based "Alder Lake" processor has a much more complex ISA, and the E-cores don't have all of the instruction sets or hardware capabilities that the P-cores do."

I guess, if we're arguing over semantics, I am "compatible" with pepperoni pizza, but you cannot give both me and a pepperoni pizza the same instructions and expect the same outcome.
Posted on Reply
#8
ncrs
Chrispy_I'm going to literally repeat what the article says.

"The x86-based "Alder Lake" processor has a much more complex ISA, and the E-cores don't have all of the instruction sets or hardware capabilities that the P-cores do."

I guess, if we're arguing over semantics, I am "compatible" with pepperoni pizza, but you cannot give both me and a pepperoni pizza the same instructions and expect the same outcome.
The article also states:
"The P-cores get priority when a thread requires instructions exclusive to P-cores (such as AVX-512 or DLBoost)."
And yet it seems that AVX-512 is not supported at all, which makes either it, or AnandTech's article I quoted invalid. Hence "as far as we know".

Edit: Also Lakefield, Alder Lake's predecessor, used unified architecture by disabling non-compatible parts in P-cores.
Posted on Reply
#9
Vya Domus
The thing is this has been done before in mobile SoCs. Most of the high end chips have hardware schedulers that keep track of the type of instructions and target different cores based on it, the problem with it is that multi core performance usually scales terribly. Basically most of the instructions end up being executed on one or the other cluster but rarely on both concurrently. There are many reasons for that but the point is in a smartphone this doesn't really matter, on a desktop though ... it's gonna be rough.
Posted on Reply
#10
Chrispy_
ncrsThe article also states:
"The P-cores get priority when a thread requires instructions exclusive to P-cores (such as AVX-512 or DLBoost)."
And yet it seems that AVX-512 is not supported at all, which makes either it, or AnandTech's article I quoted invalid. Hence "as far as we know".

Edit: Also Lakefield, Alder Lake's predecessor, used unified architecture by disabling non-compatible parts in P-cores.
You're confusing priority and compatibility. They mean that insructions sent to P-cores get cache and power budget priority.

AVX-512 is an abortion anyway; AVX2 is all that will survive now having adopted the only bit of AVX-512 worth keeping and no implementation of AVX-512 to date has been successful from a performance/Watt perspective. Not a single tear will be shed for AVX-512's hot and melty death.
Posted on Reply
#11
ncrs
Chrispy_You're confusing priority and compatibility. They mean that insructions sent to P-cores get cache and power budget priority.

AVX-512 is an abortion anyway; AVX2 is all that will survive now having adopted the only bit of AVX-512 worth keeping and no implementation of AVX-512 to date has been successful from a performance/Watt perspective. Not a single tear will be shed for AVX-512's hot and melty death.
What you wrote doesn't make sense. The article specifically mentions "instructions exclusive to P-cores". That has nothing to do with cache or power budgets. It's about the ISAs being supposedly incompatible, which is in direct opposition to the Anandtech article and how Lakefield behaves.

Properly utilized AVX-512 has amazing perf/watt gains, but the key issue is properly utilized. You have to know why and how you're using the instructions (short computations don't make sense because of the latency, frequency and power penalties, for example):
Posted on Reply
#12
Vya Domus
Chrispy_no implementation of AVX-512 to date has been successful from a performance/Watt perspective.
The wider the SIMD the better the performance/watt and AVX-512 is undoubtedly better in this regard. The problem with it is that it just doesn't matter for commercial software.
ncrsYou have to know why and how you're using the instructions (short computations don't make sense because of the latency, frequency and power penalties, for example):
This is the problem and the paradox with wide SIMD on a CPU, only short computations make sense. The problem is neither latency of frequency or anything like that, it's cache and memory bandwidth. As soon as you go off die in terms of memory access the performance drops catastrophically. Even with AVX-512 if you do something as simple as a matrix multiplication you're looking at maybe 5% of the peak FLOP performance which is horrendous. If you need wide SIMD, use the GPU instead.
Posted on Reply
#13
Chrispy_
Here let me explain the difference in instruction set to you with a Venn diagram:

instructions that run on E-cores can run on P-cores.
instructions that run on P-cores might run on E-cores.
instructions that use P-core exclusive ISA cannot run on E-cores.

How hard is it to understand that E-core's ISA is a subset of P-core's ISA?
Posted on Reply
#14
ncrs
Chrispy_Here let me explain the difference in instruction set to you with a Venn diagram:

instructions that run on E-cores can run on P-cores.
instructions that run on P-cores might run on E-cores.
instructions that use P-core exclusive ISA cannot run on E-cores.

How hard is it to understand that E-core's ISA is a subset of P-core's ISA?
Because you have no sources stating that they are, in fact, incompatible.

If they were incompatible then it would mean that operating systems that are not aware of this would not work properly on the CPU, and I don't think Intel would allow this. You know, backwards compatibility being their strong point, and all...

Edit: AnandTech actually asked Intel about Windows 10 (at the bottom), and it is able to run on the CPU while not being aware/compatible with Intel Thread Director.
Posted on Reply
#15
Vya Domus
ncrsIf they were incompatible then it would mean that operating systems that are not aware of this would not work properly on the CPU, and I don't think Intel would allow this. You know, backwards compatibility being their strong point, and all...
That's the point of the hardware schedulers. Samsung, in their SoCs for example, made it so that their big cores could only execute ARM64 and all 32 bit instruction would get routed on the other cores which could run 32 bit instructions. The idea was to save power since 32 bit code was presumably not that demanding and the big cores would could not run it as efficiently.
Posted on Reply
#16
Chrispy_
ncrsBecause you have no sources stating that they are, in fact, incompatible.
Are you blind?
THIS ARTICLE YOU'RE REPLYING TO IS THE SOURCE
"The x86-based "Alder Lake" processor has a much more complex ISA, and the E-cores don't have all of the instruction sets or hardware capabilities that the P-cores do"
Posted on Reply
#17
ncrs
Vya DomusThat's the point of the hardware schedulers. Samsung, in their SoCs for example, made it so that their big cores could only execute ARM64 and all 32 bit instruction would get routed on the other cores which could run 32 bit instructions. The idea was to save power since 32 bit code was presumably not that demanding and the big cores would could not run it as efficiently.
And would such SoC be able to run an unmodified OS? I have serious doubts.
I've read nothing that would suggest ITD is a hardware scheduler capable of this. (I know I sound like a broken record) The AnandTech article said:
We asked Intel about where an initial thread will go before the scheduling kicks in. I was told that a thread will initially get scheduled on a P-core unless they are full, then it goes to an E-core until the scheduler determines what the thread needs, then the OS can be guided to upgrade the thread. In power limited scenarios, such as being on battery, a thread may start on the E-core anyway even if the P-cores are free.
So if P-cores are full, and the E-core gets a load with an instruction it can't handle it would create a situation that a ITD-unaware OS would not expect. If ITD is capable of autonomously moving a thread/process between E- and P- cores this again would create a situation most OS' are not designed for. Such a design is a compatibility nightmare.
Chrispy_Are you blind?
THIS ARTICLE YOU'RE REPLYING TO IS THE SOURCE
"The x86-based "Alder Lake" processor has a much more complex ISA, and the E-cores don't have all of the instruction sets or hardware capabilities that the P-cores do"
It's not. It is an interpretation of PR slides, a poor one at that since it mentions that Alder Lake supports AVX-512 while in fact it doesn't.
Posted on Reply
#20
Chrispy_
ncrsNo, I did not listen to a 2h15m presentation. Do you have a timecode that says that E- and P- cores have differing ISA levels?
Jesus wept.

You won't accept a paraphrased version from someone whose literal job description is to publish summarised press releases direct from the source, but you also won't watch the source either.

I give up. Are you a millenial, perchance?
Posted on Reply
#21
ncrs
Chrispy_Jesus wept.

You won't accept a paraphrased version from someone whose literal job description is to publish summarised press releases direct from the source, but you also won't watch the source either.

I give up. Are you a millenial, perchance?
I won't accept a paraphrased version that is in direct opposition to away more in-depth article at AnandTech.
Will you or will you not provide a direct quote from Intel that E- and P- cores are ISA-incompatible?
Posted on Reply
#22
Chrispy_
Holy shit. Watch the damn video. It's timestamped and chaptered FFS.

It's the first guy that Raja introduces. If you don't understand the words he uses for about four minutes about how they whittled down Gracemont by stripping parts of the ISA that weren't essential for efficiency I cannot help you. There is no more spoonfeeding.
Posted on Reply
#23
First Strike
Chrispy_Well duh, of course Intel needs a hardware scheduler to send instructions to two different sets of cores with incompatible ISAs. They're making it out that they're geniuses for creating this "Thread Director" because they don't trust the OS scheduler to do it.

In reality, the OS scheduler cannot do it, and this is merely a required solution to a problem of Intel's own making (the problem of incompatible ISAs caused because the Tremont based E-cores aren't designed from the ground up to work alongside Raptor Cove P-cores, Intel are just re-using some old Atom architecture that was designed for a very different market and purpose, originally).

big.LITTLE works on Linux because both types of ARM core use the same ISA. There is no problem that needs solving there, and if Intel had designed a smaller E-core that used the same ISA like a proper, built-for-pupose design should, they wouldn't need this additional layer of hardware scheduler Thread Director.
What you are describing is microarchitecturally impossible. You should educate yourself on cpu microarchitectures before continuing a multi-page rampage.

A very simple question. What exactly are the incompatible instructions? If it is a real thing, it should have been digged out of Linux codebase months early.

CPU only understands a instruction AFTER the decoder stage inside the microarchitecture. Especially for a variable-length ISA such as x86, there is literally no way other than a decoder to analyze the instructions. However, the Thread Director is clearly an uncore component, and Intel mentions nothing about it having decoders nor they reversed the microarchitecturally pipelines (e.g. parking a thread when an incompat instruction has been decoded and moves to god-knows-where pipeline stage.

Intel would make a much bigger news if they realized your world-shakening design. Lakefield is still a single ISA.

@btarunr You should really going through your word again to see if there are more stuff that Intel's marketing paraphrases manipulates you to believe. Soon the world will cite YOU as the SOURCE of a hetero-ISA architecture.

One thing good about Dr. Ian Cutress, is that he actually has a PhD degree in EECS. So his article directly dismissed the false impression of incompatible ISA that Intel tries to sell.

On slides, Intel just said different "mix" of instructions, NOT different instructions. Such wording is intended to confuse audience into believing they have something more powerful. @btarunr
Posted on Reply
Add your own comment
Copyright © 2004-2021 www.techpowerup.com. All rights reserved.
All trademarks used are properties of their respective owners.