Monday, July 13th 2020

Intel "Alder Lake" CPU Core Segmentation Sketched

Intel's 12th Gen Core "Alder Lake-S" desktop processors in the LGA1700 package could see the desktop debut of Intel's Hybrid Technology that it introduced with the mobile segment "Lakefield" processor. Analogous to Arm big.LITTLE, Intel Hybrid Technology is a multi-core processor topology that sees the combination of high-performance CPU cores with smaller high-efficiency cores that keep the PC ticking through the vast majority of the time/tasks when the high-performance cores aren't needed and hence power-gated. The high-performance cores are woken up only as needed. "Lakefield" combines one "Sunny Cove" high-performance core with four "Tremont" low-power cores. "Alder Lake-S" will take this concept further.

According to Intel slides leaked to the web by HXL (aka @9550pro), the 10 nm-class "Alder Lake-S" silicon will physically feature 8 "Golden Cove" high-performance cores, and 8 "Gracemont" low-power cores, along with a Gen12 iGPU that comes in three tiers - GT0 (iGPU disabled), GT1 (some execution units disabled), and GT2 (all execution units enabled). In its top trim with 125 W TDP, "Alder Lake-S" will be a "16-core" processor with 8 each of "Golden Cove" and "Gracemont" cores enabled. There will be 80 W TDP models with the same 8+8 core configuration, which are probably "locked" parts. Lastly, there the lower wrungs of the product stack will completely lack "small" cores, and be 6+0, with only high-performance cores. A recurring theme with all parts is the GT1 trim of the Gen12 iGPU.
Intel is innovating a way to reconcile the vast feature-set and ISA differences between its "big" and "small" cores. The big "Golden Cove" core supports certain AVX-512 instructions, besides TSX-NI (tensor operations, matrix multiplication), and FP16 (half precision floating point). The smaller "Gracemont" core lacks these instruction sets. So whenever the OS sends traffic that requires these instructions, the processor will be forced to wake up a "Golden Cove" core, and additional such cores as needed.

A quick reminder of the LGA1700 socket - this platform could see Intel introducing PCI-Express 5.0 I/O. There's also a possibility of DDR5 unbuffered memory support. The significant increase in pin-count for the mainstream-desktop segment is probably attributable to a Ryzen-like nucleation of platform I/O over from the PCH to the CPU socket, along with more CPU-attached PCIe lanes.
Sources: HXL (Twitter), VideoCardz, Zhihu (Forums)
Add your own comment

16 Comments on Intel "Alder Lake" CPU Core Segmentation Sketched

#1
Crackong
What is the point of big.LITTLE on Desktop Platform ?
Posted on Reply
#2
HwGeek
Maybe to offer "More cores" in PR materials to compete with AMD?
Posted on Reply
#3
biffzinker
Crackong
What is the point of big.LITTLE on Desktop Platform ?
Energy efficiency?
Posted on Reply
#4
watzupken
biffzinker
Energy efficiency?
This may be true, but is not a need for desktop processors that pulls power from the mains. To me there may be 2 issues here,

(1) Another layer of software optimization required for switching between high and low performance cores, which may cause issues due to buggy driver/ OS
(2) Consumer are forced to pay Intel extra for the Tremont cores that they don't need
HwGeek
Maybe to offer "More cores" in PR materials to compete with AMD?
I was thinking the same thing as well. Just so that they can advertise as having "up to 16 cores".
Posted on Reply
#5
londiste
That "available only when the big cores are enabled" sounds suspiciously like it should say when only big cores are enabled.
Posted on Reply
#6
ncrs
londiste
That "available only when the big cores are enabled" sounds suspiciously like it should say when only big cores are enabled.
That's my interpretation as well. I highly doubt that it will dynamically switch execution to big cores when AVX-512 is used. This would end in applications that try to detect AVX-512 at startup being locked to the big cores forever. Also I'm not sure if there's any OS capable of scheduling processes in an environment like that (different cores having different instruction sets), but I might be wrong ;)
Posted on Reply
#7
R0H1T
ncrs
That's my interpretation as well. I highly doubt that it will dynamically switch execution to big cores when AVX-512 is used. This would end in applications that try to detect AVX-512 at startup being locked to the big cores forever. Also I'm not sure if there's any OS capable of scheduling processes in an environment like that (different cores having different instruction sets), but I might be wrong ;)
Not sure why that'd be an issue, it really doesn't depend on the OS as much as the application & if Intel try to shoehorn this (AVX512) into something like a Lakefiled it'll fail harder than ever!
Posted on Reply
#8
ncrs
R0H1T
Not sure why that'd be an issue, it really doesn't depend on the OS as much as the application & if Intel try to shoehorn this (AVX512) into something like a Lakefiled it'll fail harder than ever!
Well it's the OS' responsibility to schedule processes onto cores. So either the OS becomes aware of differing instruction sets or somehow it passes this responsibility to the hardware. Either way will require OS modification and will make older systems unable to utilize this new CPU fully.
Posted on Reply
#9
R0H1T
It will that's a given & if the application supports it (AVX512) then ideally it ought to load the big core anyway, unless power or thermal constrained. We've had this same debate what 2 or 3 years back with the AIDA64 developer(?) & the consensus was similar to what I see in this thread, of course almost everyone agreed that big.LITTLE won't come to pass on x86 & which obviously didn't turn out so well. As for application being locked to a single core, that's the job of the scheduler & I don't remember anything abut instruction sets having a say in how processes & threads runs on an(y) OS.
Posted on Reply
#10
efikkan
I'm not a fan of this hybrid technology. I don't believe it belongs in desktop computers, and will contribute to making scheduling harder, especially with different ISA features on the various cores. It's hard to tell, this could turn out okay, or very bad (like Itanium).
ncrs
Well it's the OS' responsibility to schedule processes onto cores. So either the OS becomes aware of differing instruction sets or somehow it passes this responsibility to the hardware. Either way will require OS modification and will make older systems unable to utilize this new CPU fully.
It's not a problem to query a core to find out all the supported ISA features. I just hope the executables have this flagged as well.
Posted on Reply
#11
londiste
The implementation will likely be the same as ARM's BIG.little - same ISA features across all cores. This makes scheduling a lot easier. I don't really see Intel having a choice especially when Lakefield has already shown scheduling on it (at least on Windows) is tricky and needs refinement.
Posted on Reply
#12
dragontamer5788
I'm trying to figure out how the CPU decides if a thread / process needs AVX512 or not.

I know that there's "vzeroupper", which helps differentiate between SSE and AVX code. If a 256-bit register is "half full" of zeros and flagged by vzeroupper, then Windows knows to only save 128-bits instead of 256-bits between context switches. (community.intel.com/t5/Intel-ISA-Extensions/What-is-the-status-of-VZEROUPPER-use/td-p/1098375).

I'd imagine that a similar flag is used for AVX512. Saving 1/2 or 1/4th the registers is certainly a noble goal and probably already implemented in Linux and Windows. I'm not expert enough to know if that's the case for sure... but that'd be my guess for what Intel is going for here.
Posted on Reply
#13
Ashtr1x
Big Little for desktop, the reason is pretty clear, their x86 uArch innovation stagnated along with their lithography node R&D. Rocket Lake leaks give us hints already - Ring Bus. RKL has odd HT vs Physical Core design, that's the only thing which comes to my mind, they do not have that Ring Bus scaling with their post Skylake Architectures so they are relying on less cores with high clock speed scaling and ST performance at the loss of HT performance, probably to keep them relevant in gaming. But this is going to hit them in the SMT performance again, with Consoles going Zen 2 based CPUs and more people buying high core parts, this is not good at all, AMD's SMT is already very strong, Ryzen 4000 will probably decimate Intel Z400 and Z500, esp the Z500 doesn't have the damn Gen 4 lanes from Chipset. Horrible, since X570 did it 1 year back.

This doesn't have any damned benefit in the Desktop LGA processors, even in the Alienware Area51M series or Clevo P870DM series LGA notebooks nobody gives a fuck about the damn big little like phones, where the li-ion battery power sipping increases rapidly by the higher performance cores in the ARM SoC along with ton of other dedicated modules for RF/GPU/Memory etc. Maybe their Mobile might benefit but still at the loss of powerful cores it's a hogwash, when AMD's BGA processors are beating Intel BGA lineup at perf/efficiency, loss - loss unless the ST performance of 8 physical cores is higher along with those 8 or 4 HT cores (RKL has 4 cores HT disabled as per rumors) .

This requires a lot of OS work AGAIN, AMD's NUMA processors had already seen their lack of adoption even AMD abandoned them, X399 didn't have support for the TR3000, and afaik only Milan moved more parts to the powerful cloud service providers like AWS. So Apple also probably thought along with their R&D cash into A series ARM processors a huge waste of money to put into OS rewrite when their Mac sales are also just 10% of their profit cut better spend it on their own x86-ARM translation and A series SoC since many users are into ultra thin and light and don't care about BGA BS or not.

This doesn't paint a good picture as Intel doesn't have any confidence in their lineup also this looks like a temporary band aid again on the LGA1700. I hope AMD doesn't chase this bullshit and stay true to their Desktop performance x86 leadership. TBH This won't make to Xeon for sure, having a cheap arse crappy cores on the Xeon means server OS / Software / HW changes NO ONE wants to do that. Esp when Ryzen is piledriving and steamrolling with their EPYC and RYZEN CPUs on both Server and Consumer DIY.
Posted on Reply
#14
JayN
Intel's oneAPI could conceptually be extended to enable requesting a device of a different CPU type as easily as requesting a GPU, NNP or FPGA accelerator.
Posted on Reply
#15
dragontamer5788
Ashtr1x
AMD's NUMA processors had already seen their lack of adoption even AMD abandoned them,
Note: Zen2 is still NUMA. Its UMA-mode however is just far faster than Zen 1 or Zen+ was. As such, it is acceptable to run Zen2 in its UMA mode.

Netflix still uses Zen2 in its NUMA mode for far faster performance. See this for details: people.freebsd.org/~gallatin/talks/euro2019.pdf

Many programmers are unaware of the benefits of NUMA. So what Zen2 proves is that your UMA-mode needs to be reasonably fast, but not necessarily as fast as your NUMA mode. For the few programmers willing to go the extra mile and NUMA-optimize their code, NUMA will likely remain the fastest way of doing things as chiplets continue to become more common.

----------

Intel also does SubNumaClustering (SNC) for a similar effect on Skylake / Cascade Lake systems. When you have 16, 32, 64 cores... it turns out that some RAM locations are "closer" than others depending on the core you're using. The reality of NUMA is inevitable as we get more and more cores.

The question is if UMA-emulation (by round-robin distributing the data across all memory controllers) will remain fast enough that we can ignore the difficulties of NUMA in typical workloads. (IE: Zen2). But NUMA is the underlying ground truth of the physics and reality of these chips.
Posted on Reply
#16
JayN
watzupken
2) Consumer are forced to pay Intel extra for the Tremont cores that they don't need


I was thinking the same thing as well. Just so that they can advertise as having "up to 16 cores".
Gracemont adds avx2, I believe. It will be interesting to see if the avx2 operates comparatively well vs the AMD avx2. If so, then Intel should be able to match AMD simd processing results but with reduced chip area and power vs use of Intel large cores.

Tremont cores were about 1/4 the area of Sunny Cove cores on Lakefield, based on die pictures I've seen posted.
Posted on Reply
Add your own comment