• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Intel "Sapphire Rapids," "Alder Lake" and "Tremont" Feature CLDEMOTE Instruction

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,696 (7.42/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Intel's three upcoming processor microarchitectures, namely the next-generation Xeon "Sapphire Rapids," Core "Alder Lake," and low-power "Tremont" cores found in Atom, Pentium Silver, Celeron, and even Core Hybrid processors, will feature a new instruction set that aims to speed up processor cache performance, called CLDEMOTE "cache line demote." This is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

There are a handful benefits to what CLDEMOTE does. Firstly, it frees up lower cache levels such as L1 and L2, which are smaller in size and dedicated to a CPU core, by pushing cache lines to the last-level cache (usually L3). Secondly, it enables rapid load movements between cores by pushing cache lines to L3, which is shared between multiple cores; so it could be picked up by a neighboring core. Dr. John McCalpin from UT Austin wrote a detailed article on CLDEMOTE.



View at TechPowerUp Main Site
 
Alder Lake it is then.
 
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.
 
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.
this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.
 
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.
This is always a problem; it takes several years before a significant portion of the users have hardware support, and this may require multiple versions of software for backwards compatibility.

Unfortunately most software is lagging >15 years behind in terms of ISA. Your OS (Windows or Linux), most libraries and applications are all compiled for x86-64/SSE2. Only a small selection of demanding applications use more modern features. This is unfortunate since we are missing out on a lot of free performance. The Linux distribution "Intel Clear Linux" demonstrates this, where some core libraries and applications are optimized for modern ISA features, and gets a good portion of performance improvements.

this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.
Really?
The OS scheduler lives on a completely different time scale (ms), while data residing in CPU caches usually stay there for nanoseconds to microseconds. I know the article mentions this, but I don't understand where it gets it from.

This sounds like something that can help synchronizing data between cores, yet probably have a very limited use case.
 
this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

Sweet copy paste.

If this was managed dynamically at an OS level it would be hideously slow, if it was managed in silicon it would cause some strange performance inconsistencies with no way to turn it off.

It will most likely need to be part of the OS Kernel, similar to the Intel Clear Linux @efikkan mentions. It will be useful in fringe circumstances until very widespread hardware adoption.
 
This is always a problem; it takes several years before a significant portion of the users have hardware support, and this may require multiple versions of software for backwards compatibility.

Unfortunately most software is lagging >15 years behind in terms of ISA. Your OS (Windows or Linux), most libraries and applications are all compiled for x86-64/SSE2. Only a small selection of demanding applications use more modern features. This is unfortunate since we are missing out on a lot of free performance. The Linux distribution "Intel Clear Linux" demonstrates this, where some core libraries and applications are optimized for modern ISA features, and gets a good portion of performance improvements.


Really?
The OS scheduler lives on a completely different time scale (ms), while data residing in CPU caches usually stay there for nanoseconds to microseconds. I know the article mentions this, but I don't understand where it gets it from.

This sounds like something that can help synchronizing data between cores, yet probably have a very limited use case.

I think that's a good thing. Original x86-64 and SSE2 patents are in the verge of expiring. Supporting AVX for regular applications further locks in x86 control of the market - specifically Intel's. Sure, we now have AMD in the running again, but fortunes can quickly change, and we can have another lost decade, lead by Intel. Best to break Intel's dominance in every possible aspect while it's still feasible.
 
Incoming mitigation in 3...2...1....
 
I think that's a good thing. Original x86-64 and SSE2 patents are in the verge of expiring. Supporting AVX for regular applications further locks in x86 control of the market - specifically Intel's. Sure, we now have AMD in the running again, but fortunes can quickly change, and we can have another lost decade, lead by Intel. Best to break Intel's dominance in every possible aspect while it's still feasible.
AMD have full access to AVX features, and have even made major contributions to it such as FMA.

Intel is partly to blame for the AVX adoption rate though, as Celeron and Pentium models are still lacking support.

As hardware is progressing, but software fails to keep up, we are loosing out on more and more performance potential. I wish that as Microsoft are phasing out 32-bit Windows they would transition to having two 64-bit versions of Windows; a "current" version(e.g. Haswell/Bulldozer or Skylake/Zen ISA level) and a "legacy" version (x86-64/SSE2). Just recompiling the kernel and core libraries would easily yield ~5-10% of free performance, and much more in certain cases by writing optimized code using intrinsics. Such improvements would benefit pretty much all applications, and would help achieving better energy efficiency. For a company which spends thousands of developers on all kinds of gimmicks, this would be a much smarter initiative.
 
So L3 cache is faster than L1 or L2 when another core is working on it?
It makes sense I guess, but I'd never thought of it like that before....
 
So L3 cache is faster than L1 or L2 when another core is working on it?
It makes sense I guess, but I'd never thought of it like that before....
In a way, as L3 is accessible from other cores.
The memory/cache system inside CPUs are much more sophisticated than you think. When the CPU intends to read or write to an address, it fetches a whole cache line (64 bytes) from memory into L2, even if it's only going to write a single byte. There are various locks in place in case cores needs to write to addresses within the same cache line, so I guess if a core can free this lock earlier, it can help in some cases. Otherwise it needs to wait until it's discarded.
 
A core reads and writes to the cache millions of times a second, I can't see how this wouldn't come with a massive overhead. Not only that but you don't really get to know what exactly happens to these cache lines so how exactly you'd even go about to implement this I don't know.
 
maybe it's just alder lake specific thing for big-small configuration ?
 
yesteryear's performance
XD that was funny!

What i meant was that Alder Lake platform (LGA 1700) is gonna be the Intel platform worth buying!

And of course anyone looking for a new PC or simply an upgrade should and must consider price-performance value of the products he/she plans to purchase. That's like the number one rule, no two ways about it.

Right now you can have a 3900X for 10700K price, i mean it's crazy the reality we live in today.
And being only 7% slower at 1080p in games according to
, the 3900X is the clear choice due to its higher core count.
 
maybe it's just alder lake specific thing for big-small configuration ?
It will also be supported by server CPUs such as Sapphire Rapids, so I guess not. What makes you think it does?


For those wanting to learn more about CPU caches and how it impacts software, I highly recommend watching this lecture. Watch at least ~32:47-43:00, but watching the whole thing is well worth an hour of your life. I recommend watching it if you're either interested in how CPU caches work in principle, or is a programmer. The principles explained here are essential for any performant code, and is a subject only becoming more relevant as CPUs are getting more powerful but more reliant on efficient cache usage.

The video explains why e.g. false sharing is a huge problem for scaling with multiple cores, and also states that there is no fix for it yet, but this new CLDEMOTE instruction might just help with some of these cases. Actual sharing have a similar behavior as well, and while that will always have a significant cost tied to it, this new instruction might help reducing that cost and improve scaling.

What's interesting to me is whether this new instruction is targeted only for specific edge cases, or if this is something that helps a lot of common cases of unnecessary stalls in the CPU.

And of course anyone looking for a new PC or simply an upgrade should and must consider price-performance value of the products he/she plans to purchase. That's like the number one rule, no two ways about it.

Right now you can have a 3900X for 10700K price, i mean it's crazy the reality we live in today.
And being only 7% slower at 1080p in games according to
, the 3900X is the clear choice due to its higher core count.
That makes no sense. What matters is performance in the workloads relevant to the buyer, not that it has higher core count.

If 3% lower performance in 1440p (and 7% in 1080p but who cares) matters is up to the end user. If you buy a $800 graphics card, then I could argue that $24 of that is "wasted" due to the CPU. The i7-10700K and i9-10900K models are quite a bit snappier in Photoshop and Premiere, on top of office work and web browsing. These are things which may be highly relevant to some buyers, and much more important than the core count.

When considering prices, the buyer must always look at their local prices. A much bigger problem for Comet Lake is availability. I have yet to see any of the K-models in stock, and many stores expect delivery in August. Of course, I can't know for sure if some are shipping or not. But if availability is practically nonexistent globally, then it's pretty much dead on arrival. It doesn't matter how good it is if you can't buy it.
 
The 10700K TPU review indicates that in CPU tests
, the 3900X is 6.6% faster in applications than 10700K, 7% slower at 1080p in games and the performance per dollar (TPU review says the Intel Core i7-10700K retails for around $400) is 1% higher for the 10700K assuming the $400 price tag.


In Photoshop and Office the difference is less than 100 milliseconds. (58.1 milliseconds for Photoshop, 73.4 milliseconds for Word, 86.8 milliseconds for PowerPoint and 27.8 milliseconds for Excel)
And Premiere shows that the difference is less than 10%. (6%)

In web browser performance the difference is less than 10%. (8% for Google Octane and 6% for WebXPRT)
Actually in Mozilla Kraken, the 3900X is quite a bit snappier.


A CPU that has more threads enables more multitasking, although in order to do that, applications in your daily usage must support multithreading which modern apps support by default. And so 3900X is the better option because not only it has comparable performance and price compared to 10700K, it also has, you guessed it, higher thread count which allows you to do more with your PC.
As for the comparable price globally, for example in Germany, you can purchase the 10700K for 429 euros and the 3900X for 419 euros. The purchasing decision is up to the buyer of course. Here are the links: (June 7th, 2020)

At the end of the day, everyone has their own preferences. We're not gonna force anything (or any idea) on anyone. Intel, AMD and NVIDIA are just brands, what gives them value is the performance they bring to the market. Intel enabled HT from i9 all the way to i3 with 10th Gen Comet Lake processors, an example of bringing performance to the market.
 
Back
Top