Friday, July 17th 2020

Windows 10 Scheduler Aware of "Lakefield" Hybrid Topologies, Benchmarked

A performance review of the Intel Core i5-L16G7 "Lakefield" Hybrid processor (powering a Samsung Galaxy S notebook) was recently published by Golem.de, which provides an in-depth look at Intel's ambitious new processor design that sets in motion the two new philosophies Intel will build its future processors on - packaging modularity provided by innovative new chip packaging technologies such as Foveros; and Hybrid processing, where there are two sets of CPU cores with vastly different microarchitectures and significantly different performance/Watt curves that let the processor respond to different kinds of workloads while keeping power-draw low. This concept was commercially proliferated first by Arm, with its big.LITTLE topology that took to the market around 2013. The "Lakefield" i5-L16G7 combines a high-performance "Sunny Cove" CPU core with four smaller "Tremont" cores, and Gen11 iGPU.

The Golem.de report reveals that Windows 10 thread scheduler is aware of the hybrid multi-core topology of "Lakefield," and that it is able to classify workloads at a very advanced level so the right kind of core is in use at any given time. The "Sunny Cove" core is called upon when interactive vast serial processing loads are in demand. This could even be something like launching applications, new tabs in a multi-process web-browser, or less-parallelized media encoding. The four "Tremont" cores keep the machine "cruising," handling much of the operational workload of an application, and is also better tuned to cope with highly parallelized workloads. This is similar to a hybrid automobile, where the combustion engine provides tractive effort from 0 kph, while the electric motor sustains a cruising speed.
The Core i5-L16G7 has a rated SDP (scenario driven power) rating of 7 W. The package PL1 value is 7 W, too. Intel also gave the chip a PL2 value of 9.5 W, and a Tau value of 28 seconds. Some notebook vendors, however, are expected to set PL1 at 5 W. Raising it to 7 W will only be possible through a UEFI firmware update. Throughout Golem's testing, they observed that the "Sunny Cove" core kicks in during interactive workloads that require burst performance from the CPU, with the core typically clocked at 2.50 GHz, occasionally hitting 2.90 GHz. The smaller "Tremont" cores are typically clocked at 1.90 GHz during workloads, and can boost up to 2.70 GHz.

Perhaps the biggest dividend of topology-awareness by Windows OS scheduler is with the core rotation policy. By default, the Windows scheduler spreads a single-threaded workload across multiple cores (in sequence). AMD had to work with Microsoft to make Windows aware of the topology of its multi-CCX Ryzen processors, so workloads aren't spread between two CCX's if they don't have to. Similarly with "Lakefield," core rotation is localized to the "Tremont" cores.

Golem outlined the performance equation between the single "Sunny Cove" core and the four "Tremont" cores. A single "Sunny Cove" core has anywhere between 25-65% higher performance than a single "Tremont" core. On the other hand, the entire block of 4 "Tremont" cores offer 2x the performance of a single "Sunny Cove" core. This lends the two core blocks very different performance and power characteristics.
The Core i5-L16G7 tests consistently ahead of the Qualcomm Snapdragon 8xC 8-core (4 big+4LITTLE) processor that has the same 7 W TDP. Golem tested the processor across 25 tests, comparing it with i7-1065G7 ICL-U 15 W, an i5-10210U "Comet Lake-U" 15 W processor, and a Pentium Silver N5000 SoC that has just "Tremont" cores. Raising the power limits appears to increase performance of the i5-L16G7 by anywhere between 40-60%.
Much of what Intel learns from "Lakefield" will be implemented in future client-segment architectures such as "Meteor Lake," which will combine larger hybrid CPU core arrays to achieve high core counts. The i5-L16G7 allows notebook designers to make ultra portable devices with the power envelope of Snapdragon, but with the benefits of x86.

Find more benchmark results and commentary in the source link below.
Source: Golem.de
Add your own comment

35 Comments on Windows 10 Scheduler Aware of "Lakefield" Hybrid Topologies, Benchmarked

#1
Crackong
In short: i7 prices + ARM performance ?
Posted on Reply
#2
fynxer
"...first by Arm, with its big.LITTLE topology that took to the market around 2013."

Standing ovation to Intel for coming 7 year's late to the party.

Intel i5-L16G7 SoC at up to 7W
Consumer price: $281
OBS! This is the consumer price, NOT the manufacturer purchase price which is much lower.

Practically the whole computer is on this one chip including the memory and I/O (not the ssd though)

BUT you still have to pay $1000-1400 for a low performance computer with this CPU.
Posted on Reply
#3
Betty (Kung Pow)
Gratz to M$ for being way faster of supporting this little quirk of Intel than they were to support a completly new release by AMD with Zen...
Posted on Reply
#4
londiste
The Core i5-L16G7 tests consistently ahead of the Qualcomm Snapdragon 8xC 8-core (4 big+4LITTLE) processor that has the same 7 W TDP.
Not really. If a task is well threaded, 8xc with its 8 cores is simply bigger. In Golem's story, some of this may be down to Lakefield's 5W limit but that is a limited effect.
Posted on Reply
#5
Chrispy_
Am I missing something obvious here?

Ryzen 4800U - 8C/16T has a base clock of 1.8GHz at 10W. That's 8 full-fat cores with HT compared to Intel's single full-fat core and four Atoms at 7W.

Is that not on the charts because Intel would need to quadruple the scale just to fit it on the page?
Posted on Reply
#6
londiste
Chrispy_
Am I missing something obvious here?

Ryzen 4800U - 8C/16T has a base clock of 1.8GHz at 10W. That's 8 full-fat cores with HT compared to Intel's single full-fat core and four Atoms at 7W.

Is that not on the charts because Intel would need to quadruple the scale just to fit it on the page?
Yes. 4800U has 1.8GHz spec frequency at 15W. 8-core 4000U series Ryzens are already quite power-starved at stock. The difference between 5W and 15W is 3 times. What makes it worse for lower power limits is all uncore being small but constant power load, leaving relatively less for cores.
Posted on Reply
#7
cucker tarlson
multi - 0.7x perf of snap 8xx at 0.7x power
single - 1.25x perf of snap 8xx at 0.7x power

looks like very good perf increase for single core loads without hurting multi
Posted on Reply
#8
FreedomEclipse
~Technological Technocrat~
More importantly... can these 'little' cores be overclocked? :pimp:
Posted on Reply
#9
ppn
Why on earth is Core4 stuck on 200 Mhz during the Rendering test. And how is this a good thing. This scheduler is clueless.
Posted on Reply
#10
iO
ppn
Why on earth is Core4 stuck on 200 Mhz during the Rendering test. And how is this a good thing. This scheduler is clueless.
The Sunny Cove core is only meant for "interactive and responsive" tasks like starting applications and other bursty workloads. Rendering isnt interactive and 4 Tremont cores are more energy efficient at sustained and multithreaded loads.
Posted on Reply
#11
theoneandonlymrk
If this was out seven years ago I would be impressed but it's too late for this SKU to be a game changer.
I'm not sure the 12 core follow up will be any better either.
Posted on Reply
#12
Chrispy_
londiste
Yes. 4800U has 1.8GHz spec frequency at 15W. 8-core 4000U series Ryzens are already quite power-starved at stock. The difference between 5W and 15W is 3 times. What makes it worse for lower power limits is all uncore being small but constant power load, leaving relatively less for cores.
You're saying 5W and 15W, and I'm not sure why.

I'm saying 4800U is rated down to 10W (AMD's official spec for 1.8GHz base clock with a cTDP) and the article is saying 7W typical and 9.5W boost for these Intel Hybrids. Sure, 10W is still bigger than 7W, but it's not the power gulf that you're trying to make out it is.
Posted on Reply
#13
_Flare
golems editor wrote:
One SNC-core is 25 to 67 percent faster than one TNT-core, on the other side four TNT-cores offer roughly 2x performance of one SNC-core.
Sunny Cove should offer an average of 1.18x IPC compared to Skylake following intels marketing.
Tremont IPC vs Sunny Cove IPC

worst case 1: 1.67 = 60%
best case 1: 1.25 = 80%

This chart shows an average of 24% IPC-gain Sandy Bridge to Skylake.

IPC compared to Skylake:
Broadwell 97%
Haswell 94%
Ivy Bridge 85%
Sandy Bridge 81%

To wich Core µArchs IPC is the Tremont IPC similar?
Lets do the math:
Skylake to SunnyCove = x1.18
SunnyCove vs Tremont varies from x1.67 to x1.25
leads to
Worst case: 1.18 : 1.67 = 0.71 Tremont = 71% of Skylake IPC
Best case:
1.18 : 1.25 = 0.94 Tremont = 94% of Skylake IPC
Averaged between best and worst case, Tremont should perform with 82.5% of Skylakes IPC, wich is between SandyBridge and IvyBridge.
But the variance jumps from below SandyBridge to upto Haswell.
Posted on Reply
#14
dragontamer5788
Betty (Kung Pow)
Gratz to M$ for being way faster of supporting this little quirk of Intel than they were to support a completly new release by AMD with Zen...
Heterogeneous compute was implemented with ARM big.LITTLE architecture. Since this "Lakefield" is basically Intel's implementation of that architecture, Microsoft probably didn't have to do anything but port over big.LITTLE code to x86.

AMD Zen's CCX architecture is very new. I'm not sure if any other architecture has "localized" L3 cache to 4 physical cores with very slow (slower than DDR4) transfers between L3 clusters. I can very well see Microsoft having to start from scratch to implement a good scheduler on AMD Zen's architecture.
Posted on Reply
#15
Assimilator
Finally, tests that are actually useful because they're against a competitor part designed for the same low-power scenario!

And the results are very impressive considering this is a first-gen attempt. At only ~71% of the 8CX's power (5W vs 7W) Lakefield is faster in all non-synthetic benchmarks, despite having a 3-core deficit.

In short, Windows-on-ARM just died, along with Qualcomm's ambitions for ultra-portables in the x86 space. Conversely, Microsoft now has a reason to try their hand at Windows Phone again.

This is the most exciting and most important thing to happen in the CPU industry since Ryzen.
Betty (Kung Pow)
Gratz to M$ for being way faster of supporting this little quirk of Intel than they were to support a completly new release by AMD with Zen...
Yeah, it's pretty fast to support something when you've already added the code to support a long time ago. Funny how that works.
Chrispy_
Sure, 10W is still bigger than 7W, but it's not the power gulf that you're trying to make out it is.
Apparently you're incapable of doing basic math... 10W is ~42% more power than 7W and 100% more than 5W, which are massive proportions at this level. Especially when talking about devices with incredibly limited passive cooling capabilities.
Posted on Reply
#16
yeeeeman
Crackong
In short: i7 prices + ARM performance ?
Same price as the ARM version.
dragontamer5788
Heterogeneous compute was implemented with ARM big.LITTLE architecture. Since this "Lakefield" is basically Intel's implementation of that architecture, Microsoft probably didn't have to do anything but port over big.LITTLE code to x86.

AMD Zen's CCX architecture is very new. I'm not sure if any other architecture has "localized" L3 cache to 4 physical cores with very slow (slower than DDR4) transfers between L3 clusters. I can very well see Microsoft having to start from scratch to implement a good scheduler on AMD Zen's architecture.
Don't be stupid. The big bulk of the work is done by AMD/Intel. Microsoft just integrated the required changes and does validation. So the slow transition for Zen is AMD's fault.
Posted on Reply
#17
londiste
Chrispy_
You're saying 5W and 15W, and I'm not sure why.

I'm saying 4800U is rated down to 10W (AMD's official spec for 1.8GHz base clock with a cTDP) and the article is saying 7W typical and 9.5W boost for these Intel Hybrids. Sure, 10W is still bigger than 7W, but it's not the power gulf that you're trying to make out it is.
Golem's article says their tested thing had 5W PL1 and 9.5W PL2 with 28sec Tau. This does not affect bursty stuff but does affect multicore tests that last longer.
4800U is rated 1.8GHz at 15W. On 10W, the frequencies will drop, and probably quite sharply. Renoir's (similar to Matisse's) power limits seem to be more akin to what Intel does with 35% over stated TDP (effectively given temperatures allow it) being common, at least in H models and the lower U models I have seen reviews of.

There is a pretty big difference here.
Posted on Reply
#18
dragontamer5788
yeeeeman
Don't be stupid. The big bulk of the work is done by AMD/Intel. Microsoft just integrated the required changes and does validation. So the slow transition for Zen is AMD's fault.
www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-icac13.processorhetrogeneity.pdf
www.microsoft.com/en-us/research/wp-content/uploads/2012/05/main.pdf
docs.microsoft.com/en-us/windows-hardware/customize/power-settings/static-configuration-options-for-heterogeneous-power-scheduling

Microsoft has literally been working on heterogeneous schedulers for the last decade. I'm pretty sure this stuff was implemented far back as Windows8 (in some initial form, and probably was optimized over the years as big.LITTLE architectures came out).
Posted on Reply
#19
yeeeeman
I really do miss old times when review outlets were less biased and fanboyism was not even in discussion. I remember AT articles written by Anand, that were just as detailed for a via shitty CPU or the best Intel or AMD cpu.
Nowadays, people are full of hate. If Intel has had a rough time now, everyone is wishing their death. If Intel then comes and tries to create an interesting product everyone say it is shit and boring. If AMD would have come out with something similar, everyone would be praising it for how good it is. Dear people, remove the hate/love and try to see objectively what each company is doing. AMD has good products, I agree. But AMD also has bad products, very bad actually. Likewise for Intel, for Nvidia, etc. So lets learn to judge each situation separately and not introduce the same stupid hate on each product launch.
This product is a very interesting piece of design, silicon, power delivery, uArch, software, firmware, etc. It has some specific constraints that require a lot of effort, for example the very small package, like ARM SoCs. Stop being so ignorant and try to read every single piece of news with a clear mind. Intel will still launch great products in the future, don't get bogged into this 14nm+++++ hate. They didn't want to push it for so long, but making mistakes in the fab business is very costly and as we saw a multi year affair to fix. We should appreciate what AMD has done, but stop being so polarized and appreciate also what the competition does, because it might just be a good product.
londiste
Golem's article says their tested thing had 5W PL1 and 9.5W PL2 with 28sec Tau. This does not affect bursty stuff but does affect multicore tests that last longer.
4800U is rated 1.8GHz at 15W. On 10W, the frequencies will drop, and probably quite sharply. Renoir's (similar to Matisse's) power limits seem to be more akin to what Intel does with 35% over stated TDP being common, at least in H models and the lower U models I have seen reviews of.

There is a pretty big difference here.
Yeah, another big difference is that they are in two different leagues altogheter in terms of package sizes, power requirements, PCB minimum size, etc. Intel also can drop their i7 15W CPUs to 4.5W (which they actually do for a very long time), but Lakefield is not about beating bechmarks, it is about packaging, big-little in x86 space, die stacking, etc.
Posted on Reply
#20
londiste
yeeeeman
Yeah, another big difference is that they are in two different leagues altogheter in terms of package sizes, power requirements, PCB minimum size, etc. Intel also can drop their i7 15W CPUs to 4.5W (which they actually do for a very long time), but Lakefield is not about beating bechmarks, it is about packaging, big-little in x86 space, die stacking, etc.
Yup.
- Lakefield is reportedly 82mm^2 but probably gets added cost from new-ish packaging stuff. 8cx is ~112mm^2, Renoir is in the neighborhood of 150mm^2.
- PCB minimum size is a question of target market. Intel wants Lakefield to compete with high end of ARM and had engineered and packaged Lakefield to do that. Renoir is a mobile chip but is simply not aimed that small or low.
- Intel has been doing ULV for a long while and the results have not been very good. Couple cores at 1GHz on 4.5W last I checked... meh. Atom is pretty competitive with these as well.

While the lines have been and are blurred, there are different target segments, 2-3W, ~5W, <10W, 15W, 35-45W seem to be the main ones in mobile. each with different requirements in addition to power limit.
Posted on Reply
#21
yeeeeman
Renoir can't be fit in the same package as this, also renoir needs external memory, external MOSFETs instead of PMIC, which it all adds up to a massive space required compared to this tiny package that is Lakefield. Also, I think it can be clearly seen that Lakefield is very bottlenecked by bad process, because while in ST workloads it does very well and beats Snapdragon and N5000 and even keeps up with 1065G5, in multicore, it doesn't have enough power headroom to push the frequencies high enough. Bake this chip on the same process as 8cx and renoir are made and we'll talk again about results.
Bottom line is that process is a key element in this industry. I've said it many times that AMD's current success is not in small part thanks to TSMC process advantage. Heck, ARM success nowadays is in big part thanks to TSMC being better. So Intel really needs to get back to full speed on their process or just license TSMC or whatever.
Posted on Reply
#22
londiste
yeeeeman
Also, I think it can be clearly seen that Lakefield is very bottlenecked by bad process, because while in ST workloads it does very well and beats Snapdragon and N5000 and even keeps up with 1065G5, in multicore, it doesn't have enough power headroom to push the frequencies high enough.
Tremont cores are probably not designed to run at high frequencies but you are right about process power not allowing more anyway. Single-core runs on Ice Lake core that does well enough given the power allowed to it, keeping up with 1065G7 is not surprising in that aspect :)
Posted on Reply
#23
yeeeeman
As far as I understand the TNT cores run at 1.7-1.9Ghz full load all cores which is a tad low, given that they also have an IPC handicap vs Sunny Cove. 2-2.5Ghz would bring it inline with SD 8cx, but I am guessing the 10nm used for this is not the one in Tigerlake, which is supposedly much improved.
Posted on Reply
#24
InVasMani
The one key difference between what AMD's done with precision boost and chiplet's and Intel is doing here with bigLITTLE is they aren't just combining multiple chips together they are mixing different chip types together that can provide performance or efficiency targeted differences. Used intelligently that provided various benefits long like more prolonged turbo boost performance or higher peak turbo boost performance or just higher IPC for given workload from one chip or the other scaled to the same frequency. It's good to see Intel doing something a little. I've long felt a bigLITTLE approach will yield some of the easiest general improvements which could then be scaled and and intermixed a bit to suit certain consumers requirements. I still really feel FPGA's could be the best all around solution outside of combining a variety of ASIC's to really specifically maximize and prioritize a handful of the individuals use cases. Eventually this will be one of the few low hanging fruits left to leverage so it has to happen eventually for both Intel and AMD not to mention Nvidia on the GPU side this is how it is going to be moving forward one way or another w/o a break thru on the manufacturing side or quantum computers really taking a foothold.
Posted on Reply
#25
Chrispy_
Assimilator
Apparently you're incapable of doing basic math... 10W is ~42% more power than 7W and 100% more than 5W, which are massive proportions at this level.
That's just insulting; You and I both know I've posted complex calculations on these forums before and as a chartered engineer, even if I was a bad one, basic math skills are a given.

He's simply trying to take the worst/worst scenario to exaggerate his point by looking only at the extreme options in order to falsely prop up his argument.

The article clearly states it's 7W, not 5W.
"The Core i5-L16G7 has a rated SDP (scenario driven power) rating of 7 W"

AMD's offical cTDP values for a 4800U are 10-25W, and those official figures guarantee the 1.8GHz base clock assuming adequate cooling is provided. If your 4800U doesn't achieve 1.8GHz at 10W, RMA it because it's out of spec, ergo faulty.

10W compared to 7W may still be a sizeable 42% increase, but it's not the 200% increase he's trying to make it out to be. I'd certainly wager that a 10W 4800U is more than 42% faster than this 7W Core i5-L16G7....
Posted on Reply
Add your own comment