Intel 15th-Generation Arrow Lake-S Could Abandon Hyper-Threading Technology

londiste · Jan 22, 2024

Assimilator said:
But there hasn't really been a concept of heterogenous cores in the same package until now, and Windows Server has definitely encountered some of the same pains in core scheduling that Linux has. Everything I've read so far indicates that the issues that desktop Windows has encountered around scheduling over heterogenous cores is regarding high-performance/low-latency applications like games, not the line-of-business stuff that you'd find running on servers.

ARM's big.LITTLE and its successor? Linux has had to deal with such scheduling things for a while now. So has MacOS due to Apple M-series SoCs. This should be a problem with several approaches taken to implement scheduling to those. Btw, with both 3DCache chiplets and Zen4c cores AMD seems to be heading in the same direction - not as extreme of a difference but still cores with different profiles. So Microsoft and Intel and probably AMD better put their heads together and figure out what works

dyonoctis · Jan 22, 2024

Daven said:
Lots of background tasks and editing a photo with complex filters while watching a 4K movie on another screen is another example that will load up 32 threads.

You must be using an old GPU or a really exotic codec if watching a 4k movie stress your CPU. Consuming 4k content should be handled by the video decoder.
Editing raw/lossless video is the only time when it should be acceptable to give up CPU resource on video playback.

Aquinus · Jan 22, 2024

londiste said:
Windows Server scheduler seems to work a bit different than desktop Windows variants.

It's just a boolean setting that hasn't changed for decades that mucks with the "fairness" of each scheduled thread. That is, how important your hungriest task is and how much more or less time it should be allocated because it's hungry for CPU cycles. Windows is not like Linux where you have multiple different CPU schedulers, all with their own tradeoffs and APIs you can implement should you feel so inclined to try and roll your own. Believe it or not, a lot of performance can be had with a well tuned scheduler. Limiting context switching can do wonders for cache locality and hit ratios, so a "smart" scheduler can actually reap a lot of benefits when done right, or massive performance regressions when done wrong. So to be honest, MS probably just doesn't want to screw with it because the NT kernel wasn't built for it in the same way the Linux kernel was.

londiste said:
and probably AMD better put their heads together and figure out what works

AMD probably understands that CPU scheduling is hard and having all the same kind of cores not only makes it easier to schedule, but easier to manufacture.

dragontamer5788 · Jan 22, 2024

Aquinus said:
AMD probably understands that CPU scheduling is hard and having all the same kind of cores not only makes it easier to schedule, but easier to manufacture.

If big.LITTLE ends up winning, AMD could easily switch its cores to that technology and benefit after Linux Kernel / Windows Kernel have been optimized for it. It means that AMD will necessarily be later-to-market, but strategically choosing where to be a leader and where to be a follower is the job of the CEO / CTO.

Given that software is open-source, and that schedulers for big.LITTLE remains poor in practice (even if they theoretically can be fixed), there's a lot of sense in waiting for the future algorithms to be implemented, rather than creating a big.LITTLE chip prematurely.

-----------

Its not even clear if big.LITTLE will be a better plan than SMT (2-threads per core, or even IBM style 4-threads per core or 8-threads per core). SMT as you points out, is symmetrical. All cores can be treated the same and equivalent, which grossly eases any scheduling algorithm.

big.LITTLE remains a big deal however, because the nature of multithreading vs single-threaded applications naturally lines up to big cores vs little-cores. Long-running background tasks tend to need to be low-latency, and a full core on very low power like a LITTLE-core, is the ideal processor. Short high-speed video game / number-crunching user-interface spiffyness relies upon big-cores however. Apple M1 and Android UIs need surprising amounts of compute to remain responsive in all scenarios. But how do we write an operating-system scheduler that can automatically determine which threads or processes should go on which cores?

Assimilator · Jan 22, 2024

dragontamer5788 said:
But how do we write an operating-system scheduler that can automatically determine which threads or processes should go on which cores?

One would hope and expect that Microsoft would be able to figure this out, but given how apathetic towards desktop they've become since Azure took off, one would probably be mistaken.

OTOH, there's an equally compelling argument that Intel should be providing a "scheduling driver" for their CPUs, that replaces the default one used by Windows. How much work it would be to update the NT kernel to support that level of flexibility, is an unknown, but given how long that kernel has been around and how much tech debt it's built up, I'd say "a lot". Which leads back to my first point about Microsoft not caring anymore.

dragontamer5788 · Jan 22, 2024

Assimilator said:
One would hope and expect that Microsoft would be able to figure this out, but given how apathetic towards desktop they've become since Azure took off, one would probably be mistaken.

OTOH, there's an equally compelling argument that Intel should be providing a "scheduling driver" for their CPUs, that replaces the default one used by Windows. How much work it would be to update the NT kernel to support that level of flexibility, is an unknown, but given how long that kernel has been around and how much tech debt it's built up, I'd say "a lot". Which leads back to my first point about Microsoft not caring anymore.

Honestly, I think the solution is "don't automatically figure it out". Instead, defer to the programmer.

Programmers in Linux and Windows land both have access to "Core Affinity" flags. (https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setprocessaffinitymask) and (https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html). Modern programmers likely will have to pay attention to affinity more often. Those without affinity will default to LITTLE cores (which are more plentiful, and will encourage programmers to pick big-cores when they need it).

Or... something like that. On the other hand, maybe this would just cause a bunch of programmers to copy/paste code to always use big-cores and that's also counter-productive. Hmmmmmmmmm. Its almost an economics game, the OS-writers don't really control the application-programmers. (Ex: Windows developers are unable to control video-game programmers from Call of Duty or whatever). But a system needs to be setup so that the resources of the computer are optimized.

-------

Given how things are right now, a scheduler is needed. But I'm not 100% convinced that a scheduler is the right solution overall.

ScaLibBDP · Jan 22, 2024

>>...Losing Hyper-Threading could significantly impact Arrow Lake's multi-threaded application performance...
>>...
>>... Estimates suggest HT provides a 10-15% speedup across heavily-threaded workloads by enabling logical cores...

No and No for both cases. Period.

I regret to see that speculations regarding performance of HTT-enabled processing continues. Unfortunately, there is still misunderstanding of how actually HTT works!

Please take a look at a Video Technical Report:

Intel Hyper Threading Technology and Linpack Benchmark ( VTR-015 )

which I've published in May 2019. Take a look at Slide 19 and Slide 20 ( performance data and graphs for LINPACK tests ).

There are also performance data for matrix multiplication algorithms, Intel MKL vs. Strassen for Single-precision ( 24-bit ) and Double-precision ( 53-bit ).

I'd like to repeat that Peak Processing Power of HTT-enabled applications is achieved when only one, and Only One, out of two Logical Processors is used of the Physical Core.

efikkan · Jan 22, 2024

Onasi said:
Honestly, getting rid of HT seems weird. It doesn’t really have any major downsides and leaving any performance on the table is unlike Intel. Unless they are THAT sure that the new approach will compensate and more?

Back when HT was introduced to Pentium 4, it was relatively "cheap" in terms of added complexity and die space, to harness a lot of clock cycles of "free" performance.
As CPU frontends have become vastly more efficient, the relative potential has decreased. Along with ever more complex and wide CPU designs, security concerns and added complexity have made HT ever more costly to implement and maintain. It has long been overdue for a replacement, or just dropping it outright. These are development resources and die space which could be better spent.

Nekajo said:
Both Intel and AMD uses HT/SMT for a reason still. Tons of software is optimized with this in mind.

Only indirectly, in terms of how many threads are spawn, etc.

Nekajo said:
Especially people with quad and hexa core chips should enable it for sure. Will make up for the lack of real cores.

Depends a lot on the workloads. SMT(HT) does wonders for some, not for others, and can sometimes introduce a lot of latency too.

We have to remember that a new design without HT wouldn't be the same as turning HT off on an old design. This would mean Intel could have prioritized a lot of resources on other features, either a replacement or other design considerations. So unless they screwed up*, there will be highly likely new benefits from dropping HT.

*) With large overhauls there is higher risk of unforeseen problems to delays, or even lead to disabled features.

TumbleGeorge said:
The main design of Arrow Lake is complete, so without HT maybe with a next architecture.

Assuming Arrow Lake launches late this summer or fall, then the entire design was completed by summer 2023 (tape out), and the main design long before that.
So I think it's safe to assume that Intel and their trusted partners know which features are coming.

dragontamer5788 said:
Programmers in Linux and Windows land both have access to "Core Affinity" flags.<snip>

A little question on the side;
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?

dragontamer5788 · Jan 22, 2024

efikkan said:
A little question on the side;
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?

I have no experience, but I have knowledge... reading how others have solved this problem...

I know that there's various NUMA-code to query the capabilities of sets-of-cores. https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support . I expect Linux to have similar APIs though named differently of course. The main issue is that NUMA is about memory differences, not core-differences. So the focus of NUMA APIs is closer to malloc/free. (Yes, some bits of memory are on Core#1 or Core#50... but NUMA has a focus on memory). With big.LITTLE, there's newer APIs that handle the different cores but I don't know them quite as well.

For HPC, a common pattern I've seen is to have a startup-benchmark routine, where you attempt different strategies and perform a bit of self-optimizing / self-parameterization. So in practice, even the highest-performance programs ignore a lot of this, just collect practical data, and then self-tune around the problem.

ScaLibBDP · Jan 22, 2024

efikkan said:
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?

CPUID instruction provides more details about the processor features. OS API ( depends on an Operating System ) is recommended for Process or Thread affinity control and priorities.

Here is a piece of test-code I used for Windows ( any version that supports Get/Set ThreadAffinity API ) in order to control threads affinity:
...
SYSTEM_INFO si = { 0 };
::GetSystemInfo( &si );

RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev = 0;
RTulong dwThread1PrefferedCPU = 0;
DWORD dwErrorCode = 0;

hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();

bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );

RTint iCpuNum = ( 8 - 1 );
RTint iThreadAffinityMask = _RUN_ON_CPU_08; // Default Logical CPU 07 at the beginning of Verification
// Take into account that Logical CPUs are numbered from 0
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d\n"),
iCpuNum,
dwThreadAMPrev, dwErrorCode );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );

iCpuNum = 0;

for( iThreadAffinityMask = 1; iThreadAffinityMask < 256; iThreadAffinityMask *= 2 )
{
iCpuNum++;
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d - Thread Affinity: %3d\n"),
( iCpuNum - 1 ),
dwThreadAMPrev, dwErrorCode, iThreadAffinityMask );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );
}
...

efikkan said:
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?

It is very easy to get Time Stamp Counter value for a Logical CPU using RDTSC instruction:
...
// Test-Case 3 - Retrieving RDTSC values for Logical CPUs
{
CrtPrintf( RTU("\n\tTest-Case 3 - Retrieving RDTSC values for Logical CPUs - 1\n") );

RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev1 = 0;
RTulong dwThreadAMPrev2 = 0;
RTulong dwThread1PrefferedCPU = 0;

ClockV cvRdtscCPU1 = { 0 }; // RDTSC Value for Logical CPU1
ClockV cvRdtscCPU2 = { 0 }; // RDTSC Value for Logical CPU2

while( RTtrue )
{
hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();

bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ GetProcessAffinityMask ] failed\n") );
break;
}

bRc = SysSetPriorityClass( hProcess, REALTIME_PRIORITY_CLASS );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetPriorityClass ] failed\n") );
break;
}
bRc = SysSetThreadPriority( hThread, THREAD_PRIORITY_TIME_CRITICAL );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetThreadPriority ] failed\n") );
break;
}

dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
SysSleep( 0 );
cvRdtscCPU1.uiClockV = __rdtsc();

dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
// dwThreadAMPrev2 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_02 );
SysSleep( 0 );
cvRdtscCPU2.uiClockV = __rdtsc();

SysSetPriorityClass( hProcess, NORMAL_PRIORITY_CLASS );
SysSetThreadPriority( hThread, THREAD_PRIORITY_NORMAL );

CrtPrintf( RTU("\t\tRDTSC for Logical CPU1 : %.0f\n"), ( RTfloat )cvRdtscCPU1.uiClockV );
CrtPrintf( RTU("\t\tRDTSC for Logical CPU2 : %.0f\n"), ( RTfloat )cvRdtscCPU2.uiClockV );
CrtPrintf( RTU("\t\tRDTSC Difference: %.0f ( RDTSC2 - RDTSC1 )\n"),
( RTfloat )( cvRdtscCPU2.uiClockV - cvRdtscCPU1.uiClockV ) );
CrtPrintf( RTU("\t\tdwThreadAMPrev1 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev1 );
CrtPrintf( RTU("\t\tdwThreadAMPrev2 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev2 );

break;
}
}
...

TumbleGeorge · Jan 22, 2024

Why sleep so long?

ScaLibBDP · Jan 22, 2024

TumbleGeorge said:
Why sleep so long?

...
Sleep( 5000 ); // 5 seconds
...
It is used to see a Logical Processor switch in Windows Task Manager. Please take a look at:

Time Stamp Counters of Logical CPUs on a Multi-Core Computer System with Windows 7 ( VTR-184 )

in order to see how Logical Processors are switched during real time test processing.

I'd like to mention one more thing and it is Very Important to call Sleep( 0 ) after the switch is done because a couple of hundreds of nanoseconds are needed to make a real physical switch, for example, from a Logical Processor 1 to a Logical Processor 2.

efikkan · Jan 22, 2024

dragontamer5788 said:
For HPC, a common pattern I've seen is to have a startup-benchmark routine, where you attempt different strategies and perform a bit of self-optimizing / self-parameterization. So in practice, even the highest-performance programs ignore a lot of this, just collect practical data, and then self-tune around the problem.

I can see that working well for large batch applications, especially if they aren't latency sensitive, but for games or certain desktop applications, synchronization can quickly cause a lot of latency.

If CPUs continue to become more and more diverse, having generic and reliable ways to determine capabilities would be necessary, to scale well across generations as well as anything from low-end CPUs to high-end or multi-CPUs. Certain software could be very sensitive, and this could ultimately impact end user's purchasing choices.

For HPC users running configurable or even custom software, I can imagine doing even manual calibration would be desirable, they usually don't run on 1000 different hardware configurations.

ScaLibBDP said:
CPUID instruction provides more details about the processor features. OS API ( depends on an Operating System ) is recommended for Process or Thread affinity control and priorities.

Here is a piece of test-code I used for Windows ( any version that supports Get/Set ThreadAffinity API ) in order to control threads affinity:
<snip>

Thanks. I've saved that for later.

I haven't had time to dive into handling "hybrid" CPU designs yet, but probably I have to eventually.
P.S. you might want to throw a spoiler tag around that code.

TumbleGeorge · Jan 22, 2024

Hmm, interesting video, especially the part with the multi-core test. It is clear to see why the performance is not equal to the sum of the IPC of all cores. Overlay?

unwind-protect · Jan 22, 2024

R0H1T said:
No this is exactly for that reason, security on (shared) or private servers is a major issue. No one gives two hoots about HT on your gaming rig

Until CPU vulnerabilities can be triggered from Javascript in a web browser. I don't agree that ignoring CPU vulnerabilities even on gaming rigs is a long-term option. Unless you truly have a dedicated gaming machine and another computer for all serious use.

dragontamer5788 said:
Its not even clear if big.LITTLE will be a better plan than SMT (2-threads per core, or even IBM style 4-threads per core or 8-threads per core). SMT as you points out, is symmetrical. All cores can be treated the same and equivalent, which grossly eases any scheduling algorithm.

That is not true when you have threads/processes to schedule in the amount of more than 1 and less than the number of real cores.

Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.

dragontamer5788 said:
Honestly, I think the solution is "don't automatically figure it out". Instead, defer to the programmer.

Programmers in Linux and Windows land both have access to "Core Affinity" flags. (https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setprocessaffinitymask) and (https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html). Modern programmers likely will have to pay attention to affinity more often. Those without affinity will default to LITTLE cores (which are more plentiful, and will encourage programmers to pick big-cores when they need it).

Or... something like that. On the other hand, maybe this would just cause a bunch of programmers to copy/paste code to always use big-cores and that's also counter-productive. Hmmmmmmmmm. Its almost an economics game, the OS-writers don't really control the application-programmers.

That scheme breaks down if you have more than one application on the machine. Even if you just have multiple instances of the same program. You can have each instance grab all the premium CPU resources.

When there are conflicting workloads there needs to be a separate authority to distribute compute resources, including access to premium cores. That is the OS scheduler.

I have personally experimented with deliberate core placement (in a server-class application) and I always gave up because the OS scheduler did a better job as far as overall system throughput is concerned.

dragontamer5788 · Jan 22, 2024

unwind-protect said:
That is not true when you have threads/processes to schedule in the amount of more than 1 and less than the number of real cores.

Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.

Fair point. SMT / Hyperthreads aren't perfectly symmetrical. (Logical core vs physical core matters). But its still easier than big.LITTLE. So maybe I was a bit too "absolute" with my earlier language, but the gist is still correct. big.LITTLE is much harder to schedule (and is much worse in practice than theory because of this difficulty). But it seems like a solvable problem nonetheless (just not solved yet, today).

unwind-protect · Jan 22, 2024

ratirt said:
Does HT impact power consumption in any meaningful way? If it does, maybe that is Intel's goal. Substitute HT with ecores to reduce power consumption

I don't think the issue is power, the issue is silicon (die) space. The machinery for HT on a P core is probably as big or bigger than an entire E core. HT brings you 15-20%, another E-core brings you 50% of a P-core.

dragontamer5788 · Jan 22, 2024

unwind-protect said:
I don't the issue is power, the issue is silicon (die) space. The machinery for HT on a P core is probably as big or bigger than an entire E core. HT brings you 15-20%, another E-core brings you 50% of a P-core.

I severely doubt that. HT recycles resources like no tomorrow. The "biggest" portion of the HT is just the decoder.

But the register files, reorder-buffer, L1 / L2 data-cache, vector-units, pipelines, and more are all shared / recycled. That's more than 80% of the core. The remaining 20% (decoder, branch prediction, L1-code cache) is still needed even for single-thread-per-core applications, but obviously needs to be bigger for a Hyperthread.

unwind-protect · Jan 22, 2024

dragontamer5788 said:
Fair point. SMT / Hyperthreads aren't perfectly symmetrical. (Logical core vs physical core matters). But its still easier than big.LITTLE. So maybe I was a bit too "absolute" with my earlier language, but the gist is still correct. big.LITTLE is much harder to schedule (and is much worse in practice than theory because of this difficulty). But it seems like a solvable problem nonetheless (just not solved yet, today).

Complete agreement.

Part of the problem here is that Windows, which supposedly has the most advanced integration with Intel's "thread director", is not open source. We can see the Linux scheduler, but they are fiddling so much with it just in the last 2 months that it is hard to see what is going on.

dragontamer5788 said:
I severely doubt that. HT recycles resources like no tomorrow. The "biggest" portion of the HT is just the decoder.

But the register files, reorder-buffer, L1 / L2 data-cache, vector-units, pipelines, and more are all shared / recycled. That's more than 80% of the core. The remaining 20% (decoder, branch prediction, L1-code cache) is still needed even for single-thread-per-core applications, but obviously needs to be bigger for a Hyperthread.

Yeah, but how much bigger is a current P-core than a current E-core? I thought it is a factor of 10. So the math might still work out in favor of E-cores instead of HT - die-space wise.

dragontamer5788 · Jan 22, 2024

unwind-protect said:
Yeah, but how much bigger is a current P-core than a current E-core? I thought it is a factor of 10. So the math might still work out in favor of E-cores instead of HT - die-space wise.

Meteor Lake Die Shot and Architecture Analysis – Why Is Intel 4 Only A 40% Area Reduction Versus Intel 7?

Die shot analysis shows 40% area reduction on identical structures from Intel 7 to Intel 4. Analysis on Redwood Cove and Crestmont architecture, Foveros Omni, Intel's use of TSMC N3 in the GPU, and SOC/IO chiplets

www.semianalysis.com

I'd estimate 5x E-cores vs 1x P-Core, just spitballing by looking at this image. A lot of this is because 768kB of L2 cache per E-core (3MB shared between 4x E-cores) is just naturally going to be smaller than 2MB per P-core

You're right that this is a magnitude smaller than I thought, though not quite at the 1-to-10 odds like you initially assumed.

efikkan · Jan 22, 2024

unwind-protect said:
That scheme breaks down if you have more than one application on the machine. Even if you just have multiple instances of the same program. You can have each instance grab all the premium CPU resources.
<snip>

That's absolutely a fair point, and is one of my primary concerns too.
That's why I think it's mainly useful to know the amount of different classes of resources; like P-cores and E-cores, and whether these have SMT or not. And for non-x86, whether SMT is 2-way, 4-way or 8-way, or if there are more exotic core configurations (aren't there ARM designs with three different cores?). Assuming all "threads" are equal can result in sub-optimal performance in synchronous workloads.

As we all know, no piece of code will scale perfectly under all circumstances, but at least it will be useful to have some kind of feature detection so an application/game doesn't completely "sabotage" itself if Intel or AMD releases a new "unusual" P-core/E-core mix.

ScaLibBDP · Jan 23, 2024

unwind-protect said:
Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.

That is correct. A software developer should take into account what processor units could be used for processing.

It means, that in case of HPC- and Floating-Point-arithmetic-based processing a Floating Point Unit ( FPU ) needs to be used just by one thread (!). This is because there is just one FPU in a core and it is shared between Logical Processors.

For example, for Intel Xeon Phi processors with 64 cores and 4 hardware threads for a core ( 256 Logical Processors ) only one thread for a core needs to be used to achieve a Peak Processing Power. I've verified that rule on Intel Xeon Phi Processor 7210 and here are its specs:

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core )
Processor name: Intel Xeon Phi 7210
Packages ( sockets ): 1
Cores: 64
Processors ( CPUs ): 256
Cores per package: 64
Threads per core: 4
Peak Processing Power: 2.662 TFLOPs Calculated as follows: 1.30 * 64 * ( 512-bit / 32-bit ) * 2 / Note: Single-Precision ( 23-bit ) data type

TumbleGeorge said:
Hmm, interesting video, especially the part with the multi-core test. It is clear to see why the performance is not equal to the sum of the IPC of all cores. Overlay?

For a Quad-core processor with two hardware threads a bar in the Windows Task Manager usually reaches ~98%-99% when one hardware thread for the core is used.

When thread affinity control is Not used a total sum could Not be equal to 100% because of a Non Deterministic nature of Non Real Time Operating Systems.

Beermotor · Jan 23, 2024

Gamer Meld doesn't understand what multithreading means.

Edit: skip to 5 min for the full facepalm. Dude renamed the video after folks started correcting him in the comments.

Nekajo · Jan 23, 2024

Redwoodz said:
Just go back to Intel, no one cares. You are arguing about a 2% difference in performance that no one really notices most of the time.

You seem to care

ThrashZone said:
Hi,
MS doesn't give a crap about desktop only thing they care about is the mobile world it aligns with onedrive storage/....
Desktops are a thorn in their backside and wish they would go away and use lower carbon footprint as why lol

What are you talking about, their main income is literally from enterprise and desktop, with no focus on windows desktop they would loose the entire reason for companies to go both azure and office 365 subs

Microsoft sits at like 99% of marketshare in the enterprise sector

remixedcat · Jan 23, 2024

Bwaze said:
Nah, disabling of HT will be paraded as the great security upgrade, and AMD cited as the unsafe company that still uses the inherently unsafe tech (although many of the security risks were Intel specific).

Safe for thier wallet

Nekajo said:
You seem to care

What are you talking about, their main income is literally from enterprise and desktop, with no focus on windows desktop they would loose the entire reason for companies to go both azure and office 365 subs

Microsoft sits at like 99% of marketshare in the enterprise sector

Nope... Linux dominates servers

Processor	R5 5600X
Motherboard	ASUS ROG STRIX B550-I GAMING
Cooling	Alpenföhn Black Ridge
Memory	2*16GB DDR4-2666 VLP @3800
Video Card(s)	EVGA Geforce RTX 3080 XC3
Storage	1TB Samsung 970 Pro, 2TB Intel 660p
Display(s)	ASUS PG279Q, Eizo EV2736W
Case	Dan Cases A4-SFX
Power Supply	Corsair SF600
Mouse	Corsair Ironclaw Wireless RGB
Keyboard	Corsair K60
VR HMD	HTC Vive

Processor	AMD Ryzen 3700x
Motherboard	asus ROG Strix B-350I Gaming
Cooling	Deepcool LS520 SE
Memory	crucial ballistix 32Gb DDR4
Video Card(s)	RTX 3070 FE
Storage	WD sn550 1To/WD ssd sata 1To /WD black sn750 1To/Seagate 2To/WD book 4 To back-up
Display(s)	LG GL850
Case	Dan A4 H2O
Audio Device(s)	sennheiser HD58X
Power Supply	Corsair SF600
Mouse	MX master 3
Keyboard	Master Key Mx
Software	win 11 pro

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 4TB External
Display(s)	Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	96w Power Adapter
Mouse	Logitech MX Master 3
Keyboard	Logitech G915, GL Clicky
Software	MacOS 12.1

System Name	Firelance.
Processor	Threadripper 3960X
Motherboard	ROG Strix TRX40-E Gaming
Cooling	IceGem 360 + 6x Arctic Cooling P12
Memory	8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s)	MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage	2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s)	3x AOC Q32E2N (32" 2560x1440 75Hz)
Case	Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply	Fractal Design Ion+ 2 Platinum 760W
Mouse	Logitech G602
Keyboard	Logitech G613
Software	Windows 10 Professional x64

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

Processor	Ryzen 7 7800X3D
Motherboard	Asrock x670E Steel Legend
Cooling	Arctic Liquid Freezr II 420mm
Memory	64GB G.Skill DDR5 CAS30 fruity LED RAM
Video Card(s)	Nvidia RTX 4080 (Gigabyte)
Storage	2x Samsung 980 Pros, 3x spinning rust disks for ~20TB total storage
Display(s)	2x Asus 27" 1440p 165hz IPS monitors
Case	Thermaltake Level 20XT E-ATX
Audio Device(s)	Onboard
Power Supply	Super Flower Leadex VII 1000w
Mouse	Logitech g502
Keyboard	Logitech g915
Software	Windows 11 Insider Preview

Processor	7800X3D
Cooling	Thermalright Peerless Assasin 120
Memory	32GB at 6000/30
Video Card(s)	7900 XT soon to be replaced by 4080 Super
Storage	WD Black SN850X 4TB
Display(s)	1440p 360 Hz IPS + 34" Ultrawide 3440x1440 165 Hz IPS ... Hopefully going OLED this year

System Name	RemixedBeast-NX
Processor	Intel Xeon E5-2690 @ 2.9Ghz (8C/16T)
Motherboard	Dell Inc. 08HPGT (CPU 1)
Cooling	Dell Standard
Memory	24GB ECC
Video Card(s)	Gigabyte Nvidia RTX2060 6GB
Storage	2TB Samsung 860 EVO SSD//2TB WD Black HDD
Display(s)	Samsung SyncMaster P2350 23in @ 1920x1080 + Dell E2013H 20 in @1600x900
Case	Dell Precision T3600 Chassis
Audio Device(s)	Beyerdynamic DT770 Pro 80 // Fiio E7 Amp/DAC
Power Supply	630w Dell T3600 PSU
Mouse	Logitech G700s/G502
Keyboard	Logitech K740
Software	Linux Mint 20
Benchmark Scores	Network: APs: Cisco Meraki MR32, Ubiquiti Unifi AP-AC-LR and Lite Router/Sw:Meraki MX64 MS220-8P

Intel 15th-Generation Arrow Lake-S Could Abandon Hyper-Threading Technology

Resident Wat-man