• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel 15th-Generation Arrow Lake-S Could Abandon Hyper-Threading Technology

Joined
Feb 3, 2017
Messages
3,510 (1.32/day)
Processor R5 5600X
Motherboard ASUS ROG STRIX B550-I GAMING
Cooling Alpenföhn Black Ridge
Memory 2*16GB DDR4-2666 VLP @3800
Video Card(s) EVGA Geforce RTX 3080 XC3
Storage 1TB Samsung 970 Pro, 2TB Intel 660p
Display(s) ASUS PG279Q, Eizo EV2736W
Case Dan Cases A4-SFX
Power Supply Corsair SF600
Mouse Corsair Ironclaw Wireless RGB
Keyboard Corsair K60
VR HMD HTC Vive
But there hasn't really been a concept of heterogenous cores in the same package until now, and Windows Server has definitely encountered some of the same pains in core scheduling that Linux has. Everything I've read so far indicates that the issues that desktop Windows has encountered around scheduling over heterogenous cores is regarding high-performance/low-latency applications like games, not the line-of-business stuff that you'd find running on servers.
ARM's big.LITTLE and its successor? Linux has had to deal with such scheduling things for a while now. So has MacOS due to Apple M-series SoCs. This should be a problem with several approaches taken to implement scheduling to those. Btw, with both 3DCache chiplets and Zen4c cores AMD seems to be heading in the same direction - not as extreme of a difference but still cores with different profiles. So Microsoft and Intel and probably AMD better put their heads together and figure out what works :)
 
Joined
Oct 28, 2012
Messages
1,159 (0.27/day)
Processor AMD Ryzen 3700x
Motherboard asus ROG Strix B-350I Gaming
Cooling Deepcool LS520 SE
Memory crucial ballistix 32Gb DDR4
Video Card(s) RTX 3070 FE
Storage WD sn550 1To/WD ssd sata 1To /WD black sn750 1To/Seagate 2To/WD book 4 To back-up
Display(s) LG GL850
Case Dan A4 H2O
Audio Device(s) sennheiser HD58X
Power Supply Corsair SF600
Mouse MX master 3
Keyboard Master Key Mx
Software win 11 pro
Lots of background tasks and editing a photo with complex filters while watching a 4K movie on another screen is another example that will load up 32 threads.
You must be using an old GPU or a really exotic codec if watching a 4k movie stress your CPU. Consuming 4k content should be handled by the video decoder.
Editing raw/lossless video is the only time when it should be acceptable to give up CPU resource on video playback.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.92/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Windows Server scheduler seems to work a bit different than desktop Windows variants.
It's just a boolean setting that hasn't changed for decades that mucks with the "fairness" of each scheduled thread. That is, how important your hungriest task is and how much more or less time it should be allocated because it's hungry for CPU cycles. Windows is not like Linux where you have multiple different CPU schedulers, all with their own tradeoffs and APIs you can implement should you feel so inclined to try and roll your own. Believe it or not, a lot of performance can be had with a well tuned scheduler. Limiting context switching can do wonders for cache locality and hit ratios, so a "smart" scheduler can actually reap a lot of benefits when done right, or massive performance regressions when done wrong. So to be honest, MS probably just doesn't want to screw with it because the NT kernel wasn't built for it in the same way the Linux kernel was.

and probably AMD better put their heads together and figure out what works
AMD probably understands that CPU scheduling is hard and having all the same kind of cores not only makes it easier to schedule, but easier to manufacture.
 
Joined
Apr 24, 2020
Messages
2,573 (1.73/day)
AMD probably understands that CPU scheduling is hard and having all the same kind of cores not only makes it easier to schedule, but easier to manufacture.

If big.LITTLE ends up winning, AMD could easily switch its cores to that technology and benefit after Linux Kernel / Windows Kernel have been optimized for it. It means that AMD will necessarily be later-to-market, but strategically choosing where to be a leader and where to be a follower is the job of the CEO / CTO.

Given that software is open-source, and that schedulers for big.LITTLE remains poor in practice (even if they theoretically can be fixed), there's a lot of sense in waiting for the future algorithms to be implemented, rather than creating a big.LITTLE chip prematurely.

-----------

Its not even clear if big.LITTLE will be a better plan than SMT (2-threads per core, or even IBM style 4-threads per core or 8-threads per core). SMT as you points out, is symmetrical. All cores can be treated the same and equivalent, which grossly eases any scheduling algorithm.

big.LITTLE remains a big deal however, because the nature of multithreading vs single-threaded applications naturally lines up to big cores vs little-cores. Long-running background tasks tend to need to be low-latency, and a full core on very low power like a LITTLE-core, is the ideal processor. Short high-speed video game / number-crunching user-interface spiffyness relies upon big-cores however. Apple M1 and Android UIs need surprising amounts of compute to remain responsive in all scenarios. But how do we write an operating-system scheduler that can automatically determine which threads or processes should go on which cores?
 
Joined
Feb 18, 2005
Messages
5,344 (0.76/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) 3x AOC Q32E2N (32" 2560x1440 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G602
Keyboard Logitech G613
Software Windows 10 Professional x64
But how do we write an operating-system scheduler that can automatically determine which threads or processes should go on which cores?
One would hope and expect that Microsoft would be able to figure this out, but given how apathetic towards desktop they've become since Azure took off, one would probably be mistaken.

OTOH, there's an equally compelling argument that Intel should be providing a "scheduling driver" for their CPUs, that replaces the default one used by Windows. How much work it would be to update the NT kernel to support that level of flexibility, is an unknown, but given how long that kernel has been around and how much tech debt it's built up, I'd say "a lot". Which leads back to my first point about Microsoft not caring anymore.
 
Joined
Apr 24, 2020
Messages
2,573 (1.73/day)
One would hope and expect that Microsoft would be able to figure this out, but given how apathetic towards desktop they've become since Azure took off, one would probably be mistaken.

OTOH, there's an equally compelling argument that Intel should be providing a "scheduling driver" for their CPUs, that replaces the default one used by Windows. How much work it would be to update the NT kernel to support that level of flexibility, is an unknown, but given how long that kernel has been around and how much tech debt it's built up, I'd say "a lot". Which leads back to my first point about Microsoft not caring anymore.

Honestly, I think the solution is "don't automatically figure it out". Instead, defer to the programmer.

Programmers in Linux and Windows land both have access to "Core Affinity" flags. (https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setprocessaffinitymask) and (https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html). Modern programmers likely will have to pay attention to affinity more often. Those without affinity will default to LITTLE cores (which are more plentiful, and will encourage programmers to pick big-cores when they need it).

Or... something like that. On the other hand, maybe this would just cause a bunch of programmers to copy/paste code to always use big-cores and that's also counter-productive. Hmmmmmmmmm. Its almost an economics game, the OS-writers don't really control the application-programmers. (Ex: Windows developers are unable to control video-game programmers from Call of Duty or whatever). But a system needs to be setup so that the resources of the computer are optimized.

-------

Given how things are right now, a scheduler is needed. But I'm not 100% convinced that a scheduler is the right solution overall.
 
Joined
Jan 2, 2019
Messages
63 (0.03/day)
Location
Calgary, Canada
>>...Losing Hyper-Threading could significantly impact Arrow Lake's multi-threaded application performance...
>>...
>>... Estimates suggest HT provides a 10-15% speedup across heavily-threaded workloads by enabling logical cores...

No and No for both cases. Period.

I regret to see that speculations regarding performance of HTT-enabled processing continues. Unfortunately, there is still misunderstanding of how actually HTT works!

Please take a look at a Video Technical Report:

Intel Hyper Threading Technology and Linpack Benchmark ( VTR-015 )

which I've published in May 2019. Take a look at Slide 19 and Slide 20 ( performance data and graphs for LINPACK tests ).

There are also performance data for matrix multiplication algorithms, Intel MKL vs. Strassen for Single-precision ( 24-bit ) and Double-precision ( 53-bit ).

I'd like to repeat that Peak Processing Power of HTT-enabled applications is achieved when only one, and Only One, out of two Logical Processors is used of the Physical Core.
 
Joined
Jun 10, 2014
Messages
2,907 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Honestly, getting rid of HT seems weird. It doesn’t really have any major downsides and leaving any performance on the table is unlike Intel. Unless they are THAT sure that the new approach will compensate and more?
Back when HT was introduced to Pentium 4, it was relatively "cheap" in terms of added complexity and die space, to harness a lot of clock cycles of "free" performance.
As CPU frontends have become vastly more efficient, the relative potential has decreased. Along with ever more complex and wide CPU designs, security concerns and added complexity have made HT ever more costly to implement and maintain. It has long been overdue for a replacement, or just dropping it outright. These are development resources and die space which could be better spent.

Both Intel and AMD uses HT/SMT for a reason still. Tons of software is optimized with this in mind.
Only indirectly, in terms of how many threads are spawn, etc.

Especially people with quad and hexa core chips should enable it for sure. Will make up for the lack of real cores.
Depends a lot on the workloads. SMT(HT) does wonders for some, not for others, and can sometimes introduce a lot of latency too.

We have to remember that a new design without HT wouldn't be the same as turning HT off on an old design. This would mean Intel could have prioritized a lot of resources on other features, either a replacement or other design considerations. So unless they screwed up*, there will be highly likely new benefits from dropping HT.

*) With large overhauls there is higher risk of unforeseen problems to delays, or even lead to disabled features.
The main design of Arrow Lake is complete, so without HT maybe with a next architecture.
Assuming Arrow Lake launches late this summer or fall, then the entire design was completed by summer 2023 (tape out), and the main design long before that.
So I think it's safe to assume that Intel and their trusted partners know which features are coming. ;)

Programmers in Linux and Windows land both have access to "Core Affinity" flags.<snip>
A little question on the side;
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?
 
Joined
Apr 24, 2020
Messages
2,573 (1.73/day)
A little question on the side;
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?

I have no experience, but I have knowledge... reading how others have solved this problem...

I know that there's various NUMA-code to query the capabilities of sets-of-cores. https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support . I expect Linux to have similar APIs though named differently of course. The main issue is that NUMA is about memory differences, not core-differences. So the focus of NUMA APIs is closer to malloc/free. (Yes, some bits of memory are on Core#1 or Core#50... but NUMA has a focus on memory). With big.LITTLE, there's newer APIs that handle the different cores but I don't know them quite as well.

For HPC, a common pattern I've seen is to have a startup-benchmark routine, where you attempt different strategies and perform a bit of self-optimizing / self-parameterization. So in practice, even the highest-performance programs ignore a lot of this, just collect practical data, and then self-tune around the problem.
 
Joined
Jan 2, 2019
Messages
63 (0.03/day)
Location
Calgary, Canada
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?

CPUID instruction provides more details about the processor features. OS API ( depends on an Operating System ) is recommended for Process or Thread affinity control and priorities.

Here is a piece of test-code I used for Windows ( any version that supports Get/Set ThreadAffinity API ) in order to control threads affinity:
...
SYSTEM_INFO si = { 0 };
::GetSystemInfo( &si );

RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev = 0;
RTulong dwThread1PrefferedCPU = 0;
DWORD dwErrorCode = 0;

hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();

bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );

RTint iCpuNum = ( 8 - 1 );
RTint iThreadAffinityMask = _RUN_ON_CPU_08; // Default Logical CPU 07 at the beginning of Verification
// Take into account that Logical CPUs are numbered from 0
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d\n"),
iCpuNum,
dwThreadAMPrev, dwErrorCode );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );

iCpuNum = 0;

for( iThreadAffinityMask = 1; iThreadAffinityMask < 256; iThreadAffinityMask *= 2 )
{
iCpuNum++;
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d - Thread Affinity: %3d\n"),
( iCpuNum - 1 ),
dwThreadAMPrev, dwErrorCode, iThreadAffinityMask );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );
}
...

Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?

It is very easy to get Time Stamp Counter value for a Logical CPU using RDTSC instruction:
...
// Test-Case 3 - Retrieving RDTSC values for Logical CPUs
{
CrtPrintf( RTU("\n\tTest-Case 3 - Retrieving RDTSC values for Logical CPUs - 1\n") );

RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev1 = 0;
RTulong dwThreadAMPrev2 = 0;
RTulong dwThread1PrefferedCPU = 0;

ClockV cvRdtscCPU1 = { 0 }; // RDTSC Value for Logical CPU1
ClockV cvRdtscCPU2 = { 0 }; // RDTSC Value for Logical CPU2

while( RTtrue )
{
hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();

bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ GetProcessAffinityMask ] failed\n") );
break;
}

bRc = SysSetPriorityClass( hProcess, REALTIME_PRIORITY_CLASS );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetPriorityClass ] failed\n") );
break;
}
bRc = SysSetThreadPriority( hThread, THREAD_PRIORITY_TIME_CRITICAL );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetThreadPriority ] failed\n") );
break;
}

dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
SysSleep( 0 );
cvRdtscCPU1.uiClockV = __rdtsc();

dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
// dwThreadAMPrev2 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_02 );
SysSleep( 0 );
cvRdtscCPU2.uiClockV = __rdtsc();

SysSetPriorityClass( hProcess, NORMAL_PRIORITY_CLASS );
SysSetThreadPriority( hThread, THREAD_PRIORITY_NORMAL );

CrtPrintf( RTU("\t\tRDTSC for Logical CPU1 : %.0f\n"), ( RTfloat )cvRdtscCPU1.uiClockV );
CrtPrintf( RTU("\t\tRDTSC for Logical CPU2 : %.0f\n"), ( RTfloat )cvRdtscCPU2.uiClockV );
CrtPrintf( RTU("\t\tRDTSC Difference: %.0f ( RDTSC2 - RDTSC1 )\n"),
( RTfloat )( cvRdtscCPU2.uiClockV - cvRdtscCPU1.uiClockV ) );
CrtPrintf( RTU("\t\tdwThreadAMPrev1 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev1 );
CrtPrintf( RTU("\t\tdwThreadAMPrev2 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev2 );

break;
}
}
...
 
Joined
Jan 2, 2019
Messages
63 (0.03/day)
Location
Calgary, Canada
Why sleep so long?
...
Sleep( 5000 ); // 5 seconds
...
It is used to see a Logical Processor switch in Windows Task Manager. Please take a look at:

Time Stamp Counters of Logical CPUs on a Multi-Core Computer System with Windows 7 ( VTR-184 )

in order to see how Logical Processors are switched during real time test processing.

I'd like to mention one more thing and it is Very Important to call Sleep( 0 ) after the switch is done because a couple of hundreds of nanoseconds are needed to make a real physical switch, for example, from a Logical Processor 1 to a Logical Processor 2.
 
Joined
Jun 10, 2014
Messages
2,907 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
For HPC, a common pattern I've seen is to have a startup-benchmark routine, where you attempt different strategies and perform a bit of self-optimizing / self-parameterization. So in practice, even the highest-performance programs ignore a lot of this, just collect practical data, and then self-tune around the problem.
I can see that working well for large batch applications, especially if they aren't latency sensitive, but for games or certain desktop applications, synchronization can quickly cause a lot of latency.

If CPUs continue to become more and more diverse, having generic and reliable ways to determine capabilities would be necessary, to scale well across generations as well as anything from low-end CPUs to high-end or multi-CPUs. Certain software could be very sensitive, and this could ultimately impact end user's purchasing choices.

For HPC users running configurable or even custom software, I can imagine doing even manual calibration would be desirable, they usually don't run on 1000 different hardware configurations. :)

CPUID instruction provides more details about the processor features. OS API ( depends on an Operating System ) is recommended for Process or Thread affinity control and priorities.

Here is a piece of test-code I used for Windows ( any version that supports Get/Set ThreadAffinity API ) in order to control threads affinity:
<snip>
Thanks. I've saved that for later. :)
I haven't had time to dive into handling "hybrid" CPU designs yet, but probably I have to eventually.
P.S. you might want to throw a spoiler tag around that code. ;)
 
Joined
Sep 1, 2020
Messages
2,061 (1.51/day)
Location
Bulgaria
Hmm, interesting video, especially the part with the multi-core test. It is clear to see why the performance is not equal to the sum of the IPC of all cores. Overlay?
 
Joined
Mar 18, 2023
Messages
618 (1.43/day)
System Name Never trust a socket with less than 2000 pins
No this is exactly for that reason, security on (shared) or private servers is a major issue. No one gives two hoots about HT on your gaming rig o_O

Until CPU vulnerabilities can be triggered from Javascript in a web browser. I don't agree that ignoring CPU vulnerabilities even on gaming rigs is a long-term option. Unless you truly have a dedicated gaming machine and another computer for all serious use.

Its not even clear if big.LITTLE will be a better plan than SMT (2-threads per core, or even IBM style 4-threads per core or 8-threads per core). SMT as you points out, is symmetrical. All cores can be treated the same and equivalent, which grossly eases any scheduling algorithm.

That is not true when you have threads/processes to schedule in the amount of more than 1 and less than the number of real cores.

Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.

Honestly, I think the solution is "don't automatically figure it out". Instead, defer to the programmer.

Programmers in Linux and Windows land both have access to "Core Affinity" flags. (https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setprocessaffinitymask) and (https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html). Modern programmers likely will have to pay attention to affinity more often. Those without affinity will default to LITTLE cores (which are more plentiful, and will encourage programmers to pick big-cores when they need it).

Or... something like that. On the other hand, maybe this would just cause a bunch of programmers to copy/paste code to always use big-cores and that's also counter-productive. Hmmmmmmmmm. Its almost an economics game, the OS-writers don't really control the application-programmers.

That scheme breaks down if you have more than one application on the machine. Even if you just have multiple instances of the same program. You can have each instance grab all the premium CPU resources.

When there are conflicting workloads there needs to be a separate authority to distribute compute resources, including access to premium cores. That is the OS scheduler.

I have personally experimented with deliberate core placement (in a server-class application) and I always gave up because the OS scheduler did a better job as far as overall system throughput is concerned.
 
Joined
Apr 24, 2020
Messages
2,573 (1.73/day)
That is not true when you have threads/processes to schedule in the amount of more than 1 and less than the number of real cores.

Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.

Fair point. SMT / Hyperthreads aren't perfectly symmetrical. (Logical core vs physical core matters). But its still easier than big.LITTLE. So maybe I was a bit too "absolute" with my earlier language, but the gist is still correct. big.LITTLE is much harder to schedule (and is much worse in practice than theory because of this difficulty). But it seems like a solvable problem nonetheless (just not solved yet, today).
 
Joined
Mar 18, 2023
Messages
618 (1.43/day)
System Name Never trust a socket with less than 2000 pins
Does HT impact power consumption in any meaningful way? If it does, maybe that is Intel's goal. Substitute HT with ecores to reduce power consumption

I don't think the issue is power, the issue is silicon (die) space. The machinery for HT on a P core is probably as big or bigger than an entire E core. HT brings you 15-20%, another E-core brings you 50% of a P-core.
 
Joined
Apr 24, 2020
Messages
2,573 (1.73/day)
I don't the issue is power, the issue is silicon (die) space. The machinery for HT on a P core is probably as big or bigger than an entire E core. HT brings you 15-20%, another E-core brings you 50% of a P-core.

I severely doubt that. HT recycles resources like no tomorrow. The "biggest" portion of the HT is just the decoder.

But the register files, reorder-buffer, L1 / L2 data-cache, vector-units, pipelines, and more are all shared / recycled. That's more than 80% of the core. The remaining 20% (decoder, branch prediction, L1-code cache) is still needed even for single-thread-per-core applications, but obviously needs to be bigger for a Hyperthread.
 
Joined
Mar 18, 2023
Messages
618 (1.43/day)
System Name Never trust a socket with less than 2000 pins
Fair point. SMT / Hyperthreads aren't perfectly symmetrical. (Logical core vs physical core matters). But its still easier than big.LITTLE. So maybe I was a bit too "absolute" with my earlier language, but the gist is still correct. big.LITTLE is much harder to schedule (and is much worse in practice than theory because of this difficulty). But it seems like a solvable problem nonetheless (just not solved yet, today).

Complete agreement.

Part of the problem here is that Windows, which supposedly has the most advanced integration with Intel's "thread director", is not open source. We can see the Linux scheduler, but they are fiddling so much with it just in the last 2 months that it is hard to see what is going on.

I severely doubt that. HT recycles resources like no tomorrow. The "biggest" portion of the HT is just the decoder.

But the register files, reorder-buffer, L1 / L2 data-cache, vector-units, pipelines, and more are all shared / recycled. That's more than 80% of the core. The remaining 20% (decoder, branch prediction, L1-code cache) is still needed even for single-thread-per-core applications, but obviously needs to be bigger for a Hyperthread.

Yeah, but how much bigger is a current P-core than a current E-core? I thought it is a factor of 10. So the math might still work out in favor of E-cores instead of HT - die-space wise.
 
Joined
Apr 24, 2020
Messages
2,573 (1.73/day)
Yeah, but how much bigger is a current P-core than a current E-core? I thought it is a factor of 10. So the math might still work out in favor of E-cores instead of HT - die-space wise.


1705959880526.png


I'd estimate 5x E-cores vs 1x P-Core, just spitballing by looking at this image. A lot of this is because 768kB of L2 cache per E-core (3MB shared between 4x E-cores) is just naturally going to be smaller than 2MB per P-core

You're right that this is a magnitude smaller than I thought, though not quite at the 1-to-10 odds like you initially assumed.
 
Joined
Jun 10, 2014
Messages
2,907 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
That scheme breaks down if you have more than one application on the machine. Even if you just have multiple instances of the same program. You can have each instance grab all the premium CPU resources.
<snip>
That's absolutely a fair point, and is one of my primary concerns too.
That's why I think it's mainly useful to know the amount of different classes of resources; like P-cores and E-cores, and whether these have SMT or not. And for non-x86, whether SMT is 2-way, 4-way or 8-way, or if there are more exotic core configurations (aren't there ARM designs with three different cores?). Assuming all "threads" are equal can result in sub-optimal performance in synchronous workloads.

As we all know, no piece of code will scale perfectly under all circumstances, but at least it will be useful to have some kind of feature detection so an application/game doesn't completely "sabotage" itself if Intel or AMD releases a new "unusual" P-core/E-core mix. :)
 
Joined
Jan 2, 2019
Messages
63 (0.03/day)
Location
Calgary, Canada
Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.

That is correct. A software developer should take into account what processor units could be used for processing.

It means, that in case of HPC- and Floating-Point-arithmetic-based processing a Floating Point Unit ( FPU ) needs to be used just by one thread (!). This is because there is just one FPU in a core and it is shared between Logical Processors.

For example, for Intel Xeon Phi processors with 64 cores and 4 hardware threads for a core ( 256 Logical Processors ) only one thread for a core needs to be used to achieve a Peak Processing Power. I've verified that rule on Intel Xeon Phi Processor 7210 and here are its specs:

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core )
Processor name: Intel Xeon Phi 7210
Packages ( sockets ): 1
Cores: 64
Processors ( CPUs ): 256
Cores per package: 64
Threads per core: 4
Peak Processing Power: 2.662 TFLOPs Calculated as follows: 1.30 * 64 * ( 512-bit / 32-bit ) * 2 / Note: Single-Precision ( 23-bit ) data type

Hmm, interesting video, especially the part with the multi-core test. It is clear to see why the performance is not equal to the sum of the IPC of all cores. Overlay?

For a Quad-core processor with two hardware threads a bar in the Windows Task Manager usually reaches ~98%-99% when one hardware thread for the core is used.

When thread affinity control is Not used a total sum could Not be equal to 100% because of a Non Deterministic nature of Non Real Time Operating Systems.
 
Joined
Dec 29, 2021
Messages
61 (0.07/day)
Location
Colorado
Processor Ryzen 7 7800X3D
Motherboard Asrock x670E Steel Legend
Cooling Arctic Liquid Freezr II 420mm
Memory 64GB G.Skill DDR5 CAS30 fruity LED RAM
Video Card(s) Nvidia RTX 4080 (Gigabyte)
Storage 2x Samsung 980 Pros, 3x spinning rust disks for ~20TB total storage
Display(s) 2x Asus 27" 1440p 165hz IPS monitors
Case Thermaltake Level 20XT E-ATX
Audio Device(s) Onboard
Power Supply Super Flower Leadex VII 1000w
Mouse Logitech g502
Keyboard Logitech g915
Software Windows 11 Insider Preview
Gamer Meld doesn't understand what multithreading means.


Edit: skip to 5 min for the full facepalm. Dude renamed the video after folks started correcting him in the comments.
 
Joined
Jan 22, 2024
Messages
86 (0.70/day)
Processor 7800X3D
Cooling Thermalright Peerless Assasin 120
Memory 32GB at 6000/30
Video Card(s) 7900 XT soon to be replaced by 4080 Super
Storage WD Black SN850X 4TB
Display(s) 1440p 360 Hz IPS + 34" Ultrawide 3440x1440 165 Hz IPS ... Hopefully going OLED this year
:rolleyes: Just go back to Intel, no one cares. You are arguing about a 2% difference in performance that no one really notices most of the time.
You seem to care :D

Hi,
MS doesn't give a crap about desktop only thing they care about is the mobile world it aligns with onedrive storage/....
Desktops are a thorn in their backside and wish they would go away and use lower carbon footprint as why lol
What are you talking about, their main income is literally from enterprise and desktop, with no focus on windows desktop they would loose the entire reason for companies to go both azure and office 365 subs

Microsoft sits at like 99% of marketshare in the enterprise sector
 
Last edited:
Joined
May 13, 2010
Messages
5,763 (1.12/day)
System Name RemixedBeast-NX
Processor Intel Xeon E5-2690 @ 2.9Ghz (8C/16T)
Motherboard Dell Inc. 08HPGT (CPU 1)
Cooling Dell Standard
Memory 24GB ECC
Video Card(s) Gigabyte Nvidia RTX2060 6GB
Storage 2TB Samsung 860 EVO SSD//2TB WD Black HDD
Display(s) Samsung SyncMaster P2350 23in @ 1920x1080 + Dell E2013H 20 in @1600x900
Case Dell Precision T3600 Chassis
Audio Device(s) Beyerdynamic DT770 Pro 80 // Fiio E7 Amp/DAC
Power Supply 630w Dell T3600 PSU
Mouse Logitech G700s/G502
Keyboard Logitech K740
Software Linux Mint 20
Benchmark Scores Network: APs: Cisco Meraki MR32, Ubiquiti Unifi AP-AC-LR and Lite Router/Sw:Meraki MX64 MS220-8P
Nah, disabling of HT will be paraded as the great security upgrade, and AMD cited as the unsafe company that still uses the inherently unsafe tech (although many of the security risks were Intel specific).
Safe for thier wallet

You seem to care :D


What are you talking about, their main income is literally from enterprise and desktop, with no focus on windows desktop they would loose the entire reason for companies to go both azure and office 365 subs

Microsoft sits at like 99% of marketshare in the enterprise sector
Nope... Linux dominates servers
 
Top