Monday, January 22nd 2024

Intel 15th-Generation Arrow Lake-S Could Abandon Hyper-Threading Technology

A leaked Intel documentation we reported on a few days ago covered the Arrow Lake-S platform and some implementation details. However, there was an interesting catch in the file. The leaked document indicates that the upcoming 15th-Generation Arrow Lake desktop CPUs could lack Hyper-Threading (HT) support. The technical memo lists Arrow Lake's expected eight performance cores without any threads enabled via SMT. This aligns with previous rumors of Hyper-Threading removal. Losing Hyper-Threading could significantly impact Arrow Lake's multi-threaded application performance versus its Raptor Lake predecessors. Estimates suggest HT provides a 10-15% speedup across heavily-threaded workloads by enabling logical cores. However, for gaming, disabling HT has negligible impact and can even boost FPS in some titles. So Arrow Lake may still hit Intel's rumored 30% gaming performance targets through architectural improvements alone.

However, a replacement for the traditional HT is likely to come in the form of Rentable Units. This new approach is a response to the adoption of a hybrid core architecture, which has seen an increase in applications leveraging low-power E-cores for enhanced performance and efficiency. Rentable Units are a more efficient pseudo-multi-threaded solution that splits the first thread of incoming instructions into two partitions, assigning them to different cores based on complexity. Rentable Units will use timers and counters to measure P/E core utilization and send parts of the thread to each core for processing. This inherently requires larger cache sizes, where Arrow Lake is rumored to have 3 MB of L2 cache per core. Arrow Lake is also noted to support faster DDR5-6400 memory. But between higher clocks, more E-cores, and various core architecture updates, raw throughput metrics may not change much without Hyper-Threading.
Source: 3DCenter.org
Add your own comment

100 Comments on Intel 15th-Generation Arrow Lake-S Could Abandon Hyper-Threading Technology

#51
dyonoctis
DavenLots of background tasks and editing a photo with complex filters while watching a 4K movie on another screen is another example that will load up 32 threads.
You must be using an old GPU or a really exotic codec if watching a 4k movie stress your CPU. Consuming 4k content should be handled by the video decoder.
Editing raw/lossless video is the only time when it should be acceptable to give up CPU resource on video playback.
Posted on Reply
#52
Aquinus
Resident Wat-man
londisteWindows Server scheduler seems to work a bit different than desktop Windows variants.
It's just a boolean setting that hasn't changed for decades that mucks with the "fairness" of each scheduled thread. That is, how important your hungriest task is and how much more or less time it should be allocated because it's hungry for CPU cycles. Windows is not like Linux where you have multiple different CPU schedulers, all with their own tradeoffs and APIs you can implement should you feel so inclined to try and roll your own. Believe it or not, a lot of performance can be had with a well tuned scheduler. Limiting context switching can do wonders for cache locality and hit ratios, so a "smart" scheduler can actually reap a lot of benefits when done right, or massive performance regressions when done wrong. So to be honest, MS probably just doesn't want to screw with it because the NT kernel wasn't built for it in the same way the Linux kernel was.
londisteand probably AMD better put their heads together and figure out what works
AMD probably understands that CPU scheduling is hard and having all the same kind of cores not only makes it easier to schedule, but easier to manufacture.
Posted on Reply
#53
dragontamer5788
AquinusAMD probably understands that CPU scheduling is hard and having all the same kind of cores not only makes it easier to schedule, but easier to manufacture.
If big.LITTLE ends up winning, AMD could easily switch its cores to that technology and benefit after Linux Kernel / Windows Kernel have been optimized for it. It means that AMD will necessarily be later-to-market, but strategically choosing where to be a leader and where to be a follower is the job of the CEO / CTO.

Given that software is open-source, and that schedulers for big.LITTLE remains poor in practice (even if they theoretically can be fixed), there's a lot of sense in waiting for the future algorithms to be implemented, rather than creating a big.LITTLE chip prematurely.

-----------

Its not even clear if big.LITTLE will be a better plan than SMT (2-threads per core, or even IBM style 4-threads per core or 8-threads per core). SMT as you points out, is symmetrical. All cores can be treated the same and equivalent, which grossly eases any scheduling algorithm.

big.LITTLE remains a big deal however, because the nature of multithreading vs single-threaded applications naturally lines up to big cores vs little-cores. Long-running background tasks tend to need to be low-latency, and a full core on very low power like a LITTLE-core, is the ideal processor. Short high-speed video game / number-crunching user-interface spiffyness relies upon big-cores however. Apple M1 and Android UIs need surprising amounts of compute to remain responsive in all scenarios. But how do we write an operating-system scheduler that can automatically determine which threads or processes should go on which cores?
Posted on Reply
#54
Assimilator
dragontamer5788But how do we write an operating-system scheduler that can automatically determine which threads or processes should go on which cores?
One would hope and expect that Microsoft would be able to figure this out, but given how apathetic towards desktop they've become since Azure took off, one would probably be mistaken.

OTOH, there's an equally compelling argument that Intel should be providing a "scheduling driver" for their CPUs, that replaces the default one used by Windows. How much work it would be to update the NT kernel to support that level of flexibility, is an unknown, but given how long that kernel has been around and how much tech debt it's built up, I'd say "a lot". Which leads back to my first point about Microsoft not caring anymore.
Posted on Reply
#55
dragontamer5788
AssimilatorOne would hope and expect that Microsoft would be able to figure this out, but given how apathetic towards desktop they've become since Azure took off, one would probably be mistaken.

OTOH, there's an equally compelling argument that Intel should be providing a "scheduling driver" for their CPUs, that replaces the default one used by Windows. How much work it would be to update the NT kernel to support that level of flexibility, is an unknown, but given how long that kernel has been around and how much tech debt it's built up, I'd say "a lot". Which leads back to my first point about Microsoft not caring anymore.
Honestly, I think the solution is "don't automatically figure it out". Instead, defer to the programmer.

Programmers in Linux and Windows land both have access to "Core Affinity" flags. (learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setprocessaffinitymask) and (man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html). Modern programmers likely will have to pay attention to affinity more often. Those without affinity will default to LITTLE cores (which are more plentiful, and will encourage programmers to pick big-cores when they need it).

Or... something like that. On the other hand, maybe this would just cause a bunch of programmers to copy/paste code to always use big-cores and that's also counter-productive. Hmmmmmmmmm. Its almost an economics game, the OS-writers don't really control the application-programmers. (Ex: Windows developers are unable to control video-game programmers from Call of Duty or whatever). But a system needs to be setup so that the resources of the computer are optimized.

-------

Given how things are right now, a scheduler is needed. But I'm not 100% convinced that a scheduler is the right solution overall.
Posted on Reply
#56
ScaLibBDP
>>...Losing Hyper-Threading could significantly impact Arrow Lake's multi-threaded application performance...
>>...
>>... Estimates suggest HT provides a 10-15% speedup across heavily-threaded workloads by enabling logical cores...

No and No for both cases. Period.

I regret to see that speculations regarding performance of HTT-enabled processing continues. Unfortunately, there is still misunderstanding of how actually HTT works!

Please take a look at a Video Technical Report:

Intel Hyper Threading Technology and Linpack Benchmark ( VTR-015 )

which I've published in May 2019. Take a look at Slide 19 and Slide 20 ( performance data and graphs for LINPACK tests ).

There are also performance data for matrix multiplication algorithms, Intel MKL vs. Strassen for Single-precision ( 24-bit ) and Double-precision ( 53-bit ).

I'd like to repeat that Peak Processing Power of HTT-enabled applications is achieved when only one, and Only One, out of two Logical Processors is used of the Physical Core.
Posted on Reply
#57
efikkan
OnasiHonestly, getting rid of HT seems weird. It doesn’t really have any major downsides and leaving any performance on the table is unlike Intel. Unless they are THAT sure that the new approach will compensate and more?
Back when HT was introduced to Pentium 4, it was relatively "cheap" in terms of added complexity and die space, to harness a lot of clock cycles of "free" performance.
As CPU frontends have become vastly more efficient, the relative potential has decreased. Along with ever more complex and wide CPU designs, security concerns and added complexity have made HT ever more costly to implement and maintain. It has long been overdue for a replacement, or just dropping it outright. These are development resources and die space which could be better spent.
NekajoBoth Intel and AMD uses HT/SMT for a reason still. Tons of software is optimized with this in mind.
Only indirectly, in terms of how many threads are spawn, etc.
NekajoEspecially people with quad and hexa core chips should enable it for sure. Will make up for the lack of real cores.
Depends a lot on the workloads. SMT(HT) does wonders for some, not for others, and can sometimes introduce a lot of latency too.

We have to remember that a new design without HT wouldn't be the same as turning HT off on an old design. This would mean Intel could have prioritized a lot of resources on other features, either a replacement or other design considerations. So unless they screwed up*, there will be highly likely new benefits from dropping HT.

*) With large overhauls there is higher risk of unforeseen problems to delays, or even lead to disabled features.
TumbleGeorgeThe main design of Arrow Lake is complete, so without HT maybe with a next architecture.
Assuming Arrow Lake launches late this summer or fall, then the entire design was completed by summer 2023 (tape out), and the main design long before that.
So I think it's safe to assume that Intel and their trusted partners know which features are coming. ;)
dragontamer5788Programmers in Linux and Windows land both have access to "Core Affinity" flags.<snip>
A little question on the side;
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?
Posted on Reply
#58
dragontamer5788
efikkanA little question on the side;
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?
I have no experience, but I have knowledge... reading how others have solved this problem...

I know that there's various NUMA-code to query the capabilities of sets-of-cores. learn.microsoft.com/en-us/windows/win32/procthread/numa-support . I expect Linux to have similar APIs though named differently of course. The main issue is that NUMA is about memory differences, not core-differences. So the focus of NUMA APIs is closer to malloc/free. (Yes, some bits of memory are on Core#1 or Core#50... but NUMA has a focus on memory). With big.LITTLE, there's newer APIs that handle the different cores but I don't know them quite as well.

For HPC, a common pattern I've seen is to have a startup-benchmark routine, where you attempt different strategies and perform a bit of self-optimizing / self-parameterization. So in practice, even the highest-performance programs ignore a lot of this, just collect practical data, and then self-tune around the problem.
Posted on Reply
#59
ScaLibBDP
efikkanDo you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?
CPUID instruction provides more details about the processor features. OS API ( depends on an Operating System ) is recommended for Process or Thread affinity control and priorities.

Here is a piece of test-code I used for Windows ( any version that supports Get/Set ThreadAffinity API ) in order to control threads affinity:
...
SYSTEM_INFO si = { 0 };
::GetSystemInfo( &si );

RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev = 0;
RTulong dwThread1PrefferedCPU = 0;
DWORD dwErrorCode = 0;

hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();

bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );

RTint iCpuNum = ( 8 - 1 );
RTint iThreadAffinityMask = _RUN_ON_CPU_08; // Default Logical CPU 07 at the beginning of Verification
// Take into account that Logical CPUs are numbered from 0
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d\n"),
iCpuNum,
dwThreadAMPrev, dwErrorCode );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );

iCpuNum = 0;

for( iThreadAffinityMask = 1; iThreadAffinityMask < 256; iThreadAffinityMask *= 2 )
{
iCpuNum++;
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d - Thread Affinity: %3d\n"),
( iCpuNum - 1 ),
dwThreadAMPrev, dwErrorCode, iThreadAffinityMask );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );
}
...
efikkanDo you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?
It is very easy to get Time Stamp Counter value for a Logical CPU using RDTSC instruction:
...
// Test-Case 3 - Retrieving RDTSC values for Logical CPUs
{
CrtPrintf( RTU("\n\tTest-Case 3 - Retrieving RDTSC values for Logical CPUs - 1\n") );

RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev1 = 0;
RTulong dwThreadAMPrev2 = 0;
RTulong dwThread1PrefferedCPU = 0;

ClockV cvRdtscCPU1 = { 0 }; // RDTSC Value for Logical CPU1
ClockV cvRdtscCPU2 = { 0 }; // RDTSC Value for Logical CPU2

while( RTtrue )
{
hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();

bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ GetProcessAffinityMask ] failed\n") );
break;
}

bRc = SysSetPriorityClass( hProcess, REALTIME_PRIORITY_CLASS );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetPriorityClass ] failed\n") );
break;
}
bRc = SysSetThreadPriority( hThread, THREAD_PRIORITY_TIME_CRITICAL );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetThreadPriority ] failed\n") );
break;
}

dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
SysSleep( 0 );
cvRdtscCPU1.uiClockV = __rdtsc();

dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
// dwThreadAMPrev2 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_02 );
SysSleep( 0 );
cvRdtscCPU2.uiClockV = __rdtsc();

SysSetPriorityClass( hProcess, NORMAL_PRIORITY_CLASS );
SysSetThreadPriority( hThread, THREAD_PRIORITY_NORMAL );

CrtPrintf( RTU("\t\tRDTSC for Logical CPU1 : %.0f\n"), ( RTfloat )cvRdtscCPU1.uiClockV );
CrtPrintf( RTU("\t\tRDTSC for Logical CPU2 : %.0f\n"), ( RTfloat )cvRdtscCPU2.uiClockV );
CrtPrintf( RTU("\t\tRDTSC Difference: %.0f ( RDTSC2 - RDTSC1 )\n"),
( RTfloat )( cvRdtscCPU2.uiClockV - cvRdtscCPU1.uiClockV ) );
CrtPrintf( RTU("\t\tdwThreadAMPrev1 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev1 );
CrtPrintf( RTU("\t\tdwThreadAMPrev2 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev2 );

break;
}
}
...
Posted on Reply
#61
ScaLibBDP
TumbleGeorgeWhy sleep so long?
...
Sleep( 5000 ); // 5 seconds
...
It is used to see a Logical Processor switch in Windows Task Manager. Please take a look at:

Time Stamp Counters of Logical CPUs on a Multi-Core Computer System with Windows 7 ( VTR-184 )

in order to see how Logical Processors are switched during real time test processing.

I'd like to mention one more thing and it is Very Important to call Sleep( 0 ) after the switch is done because a couple of hundreds of nanoseconds are needed to make a real physical switch, for example, from a Logical Processor 1 to a Logical Processor 2.
Posted on Reply
#62
efikkan
dragontamer5788For HPC, a common pattern I've seen is to have a startup-benchmark routine, where you attempt different strategies and perform a bit of self-optimizing / self-parameterization. So in practice, even the highest-performance programs ignore a lot of this, just collect practical data, and then self-tune around the problem.
I can see that working well for large batch applications, especially if they aren't latency sensitive, but for games or certain desktop applications, synchronization can quickly cause a lot of latency.

If CPUs continue to become more and more diverse, having generic and reliable ways to determine capabilities would be necessary, to scale well across generations as well as anything from low-end CPUs to high-end or multi-CPUs. Certain software could be very sensitive, and this could ultimately impact end user's purchasing choices.

For HPC users running configurable or even custom software, I can imagine doing even manual calibration would be desirable, they usually don't run on 1000 different hardware configurations. :)
ScaLibBDPCPUID instruction provides more details about the processor features. OS API ( depends on an Operating System ) is recommended for Process or Thread affinity control and priorities.

Here is a piece of test-code I used for Windows ( any version that supports Get/Set ThreadAffinity API ) in order to control threads affinity:
<snip>
Thanks. I've saved that for later. :)
I haven't had time to dive into handling "hybrid" CPU designs yet, but probably I have to eventually.
P.S. you might want to throw a spoiler tag around that code. ;)
Posted on Reply
#63
TumbleGeorge
Hmm, interesting video, especially the part with the multi-core test. It is clear to see why the performance is not equal to the sum of the IPC of all cores. Overlay?
Posted on Reply
#64
unwind-protect
R0H1TNo this is exactly for that reason, security on (shared) or private servers is a major issue. No one gives two hoots about HT on your gaming rig o_O
Until CPU vulnerabilities can be triggered from Javascript in a web browser. I don't agree that ignoring CPU vulnerabilities even on gaming rigs is a long-term option. Unless you truly have a dedicated gaming machine and another computer for all serious use.
dragontamer5788Its not even clear if big.LITTLE will be a better plan than SMT (2-threads per core, or even IBM style 4-threads per core or 8-threads per core). SMT as you points out, is symmetrical. All cores can be treated the same and equivalent, which grossly eases any scheduling algorithm.
That is not true when you have threads/processes to schedule in the amount of more than 1 and less than the number of real cores.

Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.
dragontamer5788Honestly, I think the solution is "don't automatically figure it out". Instead, defer to the programmer.

Programmers in Linux and Windows land both have access to "Core Affinity" flags. (learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setprocessaffinitymask) and (man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html). Modern programmers likely will have to pay attention to affinity more often. Those without affinity will default to LITTLE cores (which are more plentiful, and will encourage programmers to pick big-cores when they need it).

Or... something like that. On the other hand, maybe this would just cause a bunch of programmers to copy/paste code to always use big-cores and that's also counter-productive. Hmmmmmmmmm. Its almost an economics game, the OS-writers don't really control the application-programmers.
That scheme breaks down if you have more than one application on the machine. Even if you just have multiple instances of the same program. You can have each instance grab all the premium CPU resources.

When there are conflicting workloads there needs to be a separate authority to distribute compute resources, including access to premium cores. That is the OS scheduler.

I have personally experimented with deliberate core placement (in a server-class application) and I always gave up because the OS scheduler did a better job as far as overall system throughput is concerned.
Posted on Reply
#65
dragontamer5788
unwind-protectThat is not true when you have threads/processes to schedule in the amount of more than 1 and less than the number of real cores.

Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.
Fair point. SMT / Hyperthreads aren't perfectly symmetrical. (Logical core vs physical core matters). But its still easier than big.LITTLE. So maybe I was a bit too "absolute" with my earlier language, but the gist is still correct. big.LITTLE is much harder to schedule (and is much worse in practice than theory because of this difficulty). But it seems like a solvable problem nonetheless (just not solved yet, today).
Posted on Reply
#66
unwind-protect
ratirtDoes HT impact power consumption in any meaningful way? If it does, maybe that is Intel's goal. Substitute HT with ecores to reduce power consumption
I don't think the issue is power, the issue is silicon (die) space. The machinery for HT on a P core is probably as big or bigger than an entire E core. HT brings you 15-20%, another E-core brings you 50% of a P-core.
Posted on Reply
#67
dragontamer5788
unwind-protectI don't the issue is power, the issue is silicon (die) space. The machinery for HT on a P core is probably as big or bigger than an entire E core. HT brings you 15-20%, another E-core brings you 50% of a P-core.
I severely doubt that. HT recycles resources like no tomorrow. The "biggest" portion of the HT is just the decoder.

But the register files, reorder-buffer, L1 / L2 data-cache, vector-units, pipelines, and more are all shared / recycled. That's more than 80% of the core. The remaining 20% (decoder, branch prediction, L1-code cache) is still needed even for single-thread-per-core applications, but obviously needs to be bigger for a Hyperthread.
Posted on Reply
#68
unwind-protect
dragontamer5788Fair point. SMT / Hyperthreads aren't perfectly symmetrical. (Logical core vs physical core matters). But its still easier than big.LITTLE. So maybe I was a bit too "absolute" with my earlier language, but the gist is still correct. big.LITTLE is much harder to schedule (and is much worse in practice than theory because of this difficulty). But it seems like a solvable problem nonetheless (just not solved yet, today).
Complete agreement.

Part of the problem here is that Windows, which supposedly has the most advanced integration with Intel's "thread director", is not open source. We can see the Linux scheduler, but they are fiddling so much with it just in the last 2 months that it is hard to see what is going on.
dragontamer5788I severely doubt that. HT recycles resources like no tomorrow. The "biggest" portion of the HT is just the decoder.

But the register files, reorder-buffer, L1 / L2 data-cache, vector-units, pipelines, and more are all shared / recycled. That's more than 80% of the core. The remaining 20% (decoder, branch prediction, L1-code cache) is still needed even for single-thread-per-core applications, but obviously needs to be bigger for a Hyperthread.
Yeah, but how much bigger is a current P-core than a current E-core? I thought it is a factor of 10. So the math might still work out in favor of E-cores instead of HT - die-space wise.
Posted on Reply
#69
dragontamer5788
unwind-protectYeah, but how much bigger is a current P-core than a current E-core? I thought it is a factor of 10. So the math might still work out in favor of E-cores instead of HT - die-space wise.
www.semianalysis.com/p/meteor-lake-die-shot-and-architecture



I'd estimate 5x E-cores vs 1x P-Core, just spitballing by looking at this image. A lot of this is because 768kB of L2 cache per E-core (3MB shared between 4x E-cores) is just naturally going to be smaller than 2MB per P-core

You're right that this is a magnitude smaller than I thought, though not quite at the 1-to-10 odds like you initially assumed.
Posted on Reply
#70
efikkan
unwind-protectThat scheme breaks down if you have more than one application on the machine. Even if you just have multiple instances of the same program. You can have each instance grab all the premium CPU resources.
<snip>
That's absolutely a fair point, and is one of my primary concerns too.
That's why I think it's mainly useful to know the amount of different classes of resources; like P-cores and E-cores, and whether these have SMT or not. And for non-x86, whether SMT is 2-way, 4-way or 8-way, or if there are more exotic core configurations (aren't there ARM designs with three different cores?). Assuming all "threads" are equal can result in sub-optimal performance in synchronous workloads.

As we all know, no piece of code will scale perfectly under all circumstances, but at least it will be useful to have some kind of feature detection so an application/game doesn't completely "sabotage" itself if Intel or AMD releases a new "unusual" P-core/E-core mix. :)
Posted on Reply
#71
ScaLibBDP
unwind-protectImagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores.
That is correct. A software developer should take into account what processor units could be used for processing.

It means, that in case of HPC- and Floating-Point-arithmetic-based processing a Floating Point Unit ( FPU ) needs to be used just by one thread (!). This is because there is just one FPU in a core and it is shared between Logical Processors.

For example, for Intel Xeon Phi processors with 64 cores and 4 hardware threads for a core ( 256 Logical Processors ) only one thread for a core needs to be used to achieve a Peak Processing Power. I've verified that rule on Intel Xeon Phi Processor 7210 and here are its specs:

ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core )
Processor name: Intel Xeon Phi 7210
Packages ( sockets ): 1
Cores: 64
Processors ( CPUs ): 256
Cores per package: 64
Threads per core: 4
Peak Processing Power: 2.662 TFLOPs Calculated as follows: 1.30 * 64 * ( 512-bit / 32-bit ) * 2 / Note: Single-Precision ( 23-bit ) data type
TumbleGeorgeHmm, interesting video, especially the part with the multi-core test. It is clear to see why the performance is not equal to the sum of the IPC of all cores. Overlay?
For a Quad-core processor with two hardware threads a bar in the Windows Task Manager usually reaches ~98%-99% when one hardware thread for the core is used.

When thread affinity control is Not used a total sum could Not be equal to 100% because of a Non Deterministic nature of Non Real Time Operating Systems.
Posted on Reply
#72
Beermotor
Gamer Meld doesn't understand what multithreading means.


Edit: skip to 5 min for the full facepalm. Dude renamed the video after folks started correcting him in the comments.
Posted on Reply
#73
Nekajo
Redwoodz:rolleyes: Just go back to Intel, no one cares. You are arguing about a 2% difference in performance that no one really notices most of the time.
You seem to care :D
ThrashZoneHi,
MS doesn't give a crap about desktop only thing they care about is the mobile world it aligns with onedrive storage/....
Desktops are a thorn in their backside and wish they would go away and use lower carbon footprint as why lol
What are you talking about, their main income is literally from enterprise and desktop, with no focus on windows desktop they would loose the entire reason for companies to go both azure and office 365 subs

Microsoft sits at like 99% of marketshare in the enterprise sector
Posted on Reply
#74
remixedcat
BwazeNah, disabling of HT will be paraded as the great security upgrade, and AMD cited as the unsafe company that still uses the inherently unsafe tech (although many of the security risks were Intel specific).
Safe for thier wallet
NekajoYou seem to care :D


What are you talking about, their main income is literally from enterprise and desktop, with no focus on windows desktop they would loose the entire reason for companies to go both azure and office 365 subs

Microsoft sits at like 99% of marketshare in the enterprise sector
Nope... Linux dominates servers
Posted on Reply
#75
Nekajo
remixedcatSafe for thier wallet


Nope... Linux dominates servers
For desktop? No

Linux dominates servers yeah, but desktop, Linux is at a sub 1% marketshare, even MacOS is much higher but Windows dominate by 90-95% or so

I use Arch Linux and Debi for my own servers, but for desktop I'd not be touching it for sure
Posted on Reply
Add your own comment
May 13th, 2024 10:14 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts