• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel "Sapphire Rapids," "Alder Lake" and "Tremont" Feature CLDEMOTE Instruction

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
46,349 (7.68/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Intel's three upcoming processor microarchitectures, namely the next-generation Xeon "Sapphire Rapids," Core "Alder Lake," and low-power "Tremont" cores found in Atom, Pentium Silver, Celeron, and even Core Hybrid processors, will feature a new instruction set that aims to speed up processor cache performance, called CLDEMOTE "cache line demote." This is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

There are a handful benefits to what CLDEMOTE does. Firstly, it frees up lower cache levels such as L1 and L2, which are smaller in size and dedicated to a CPU core, by pushing cache lines to the last-level cache (usually L3). Secondly, it enables rapid load movements between cores by pushing cache lines to L3, which is shared between multiple cores; so it could be picked up by a neighboring core. Dr. John McCalpin from UT Austin wrote a detailed article on CLDEMOTE.



View at TechPowerUp Main Site
 
Joined
Apr 15, 2020
Messages
109 (0.07/day)
System Name Old friend
Processor 3550 Ivy Bridge x 39.0 Multiplier
Memory 2x8GB 2400 RipjawsX
Video Card(s) 970 Maxwell STRIX-GTX970-DC2OC-4GD5
Alder Lake it is then.
 
Joined
Sep 26, 2006
Messages
464 (0.07/day)
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.
 
Joined
Aug 6, 2017
Messages
7,412 (3.03/day)
Location
Poland
System Name Purple rain
Processor 10.5 thousand 4.2G 1.1v
Motherboard Zee 490 Aorus Elite
Cooling Noctua D15S
Memory 16GB 4133 CL16-16-16-31 Viper Steel
Video Card(s) RTX 2070 Super Gaming X Trio
Storage SU900 128,8200Pro 1TB,850 Pro 512+256+256,860 Evo 500,XPG950 480, Skyhawk 2TB
Display(s) Acer XB241YU+Dell S2716DG
Case P600S Silent w. Alpenfohn wing boost 3 ARGBT+ fans
Audio Device(s) K612 Pro w. FiiO E10k DAC,W830BT wireless
Power Supply Superflower Leadex Gold 850W
Mouse G903 lightspeed+powerplay,G403 wireless + Steelseries DeX + Roccat rest
Keyboard HyperX Alloy SilverSpeed (w.HyperX wrist rest),Razer Deathstalker
Software Windows 10
Benchmark Scores A LOT
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.
this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.
 
Joined
Jun 10, 2014
Messages
2,900 (0.81/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.
This is always a problem; it takes several years before a significant portion of the users have hardware support, and this may require multiple versions of software for backwards compatibility.

Unfortunately most software is lagging >15 years behind in terms of ISA. Your OS (Windows or Linux), most libraries and applications are all compiled for x86-64/SSE2. Only a small selection of demanding applications use more modern features. This is unfortunate since we are missing out on a lot of free performance. The Linux distribution "Intel Clear Linux" demonstrates this, where some core libraries and applications are optimized for modern ISA features, and gets a good portion of performance improvements.

this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.
Really?
The OS scheduler lives on a completely different time scale (ms), while data residing in CPU caches usually stay there for nanoseconds to microseconds. I know the article mentions this, but I don't understand where it gets it from.

This sounds like something that can help synchronizing data between cores, yet probably have a very limited use case.
 
Joined
Sep 26, 2006
Messages
464 (0.07/day)
this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

Sweet copy paste.

If this was managed dynamically at an OS level it would be hideously slow, if it was managed in silicon it would cause some strange performance inconsistencies with no way to turn it off.

It will most likely need to be part of the OS Kernel, similar to the Intel Clear Linux @efikkan mentions. It will be useful in fringe circumstances until very widespread hardware adoption.
 
Joined
Jan 31, 2011
Messages
238 (0.05/day)
Processor 3700X
Motherboard X570 TUF Plus
Cooling U12
Memory 32GB 3600MHz
Video Card(s) eVGA GTX970
Storage 512GB 970 Pro
Case CM 500L vertical
This is always a problem; it takes several years before a significant portion of the users have hardware support, and this may require multiple versions of software for backwards compatibility.

Unfortunately most software is lagging >15 years behind in terms of ISA. Your OS (Windows or Linux), most libraries and applications are all compiled for x86-64/SSE2. Only a small selection of demanding applications use more modern features. This is unfortunate since we are missing out on a lot of free performance. The Linux distribution "Intel Clear Linux" demonstrates this, where some core libraries and applications are optimized for modern ISA features, and gets a good portion of performance improvements.


Really?
The OS scheduler lives on a completely different time scale (ms), while data residing in CPU caches usually stay there for nanoseconds to microseconds. I know the article mentions this, but I don't understand where it gets it from.

This sounds like something that can help synchronizing data between cores, yet probably have a very limited use case.

I think that's a good thing. Original x86-64 and SSE2 patents are in the verge of expiring. Supporting AVX for regular applications further locks in x86 control of the market - specifically Intel's. Sure, we now have AMD in the running again, but fortunes can quickly change, and we can have another lost decade, lead by Intel. Best to break Intel's dominance in every possible aspect while it's still feasible.
 
Joined
Oct 22, 2014
Messages
13,210 (3.81/day)
Location
Sunshine Coast
System Name Black Box
Processor Intel Xeon E3-1260L v5
Motherboard MSI E3 KRAIT Gaming v5
Cooling Tt tower + 120mm Tt fan
Memory G.Skill 16GB 3600 C18
Video Card(s) Asus GTX 970 Mini
Storage Kingston A2000 512Gb NVME
Display(s) AOC 24" Freesync 1m.s. 75Hz
Case Corsair 450D High Air Flow.
Audio Device(s) No need.
Power Supply FSP Aurum 650W
Mouse Yes
Keyboard Of course
Software W10 Pro 64 bit
Incoming mitigation in 3...2...1....
 
Joined
Jun 10, 2014
Messages
2,900 (0.81/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
I think that's a good thing. Original x86-64 and SSE2 patents are in the verge of expiring. Supporting AVX for regular applications further locks in x86 control of the market - specifically Intel's. Sure, we now have AMD in the running again, but fortunes can quickly change, and we can have another lost decade, lead by Intel. Best to break Intel's dominance in every possible aspect while it's still feasible.
AMD have full access to AVX features, and have even made major contributions to it such as FMA.

Intel is partly to blame for the AVX adoption rate though, as Celeron and Pentium models are still lacking support.

As hardware is progressing, but software fails to keep up, we are loosing out on more and more performance potential. I wish that as Microsoft are phasing out 32-bit Windows they would transition to having two 64-bit versions of Windows; a "current" version(e.g. Haswell/Bulldozer or Skylake/Zen ISA level) and a "legacy" version (x86-64/SSE2). Just recompiling the kernel and core libraries would easily yield ~5-10% of free performance, and much more in certain cases by writing optimized code using intrinsics. Such improvements would benefit pretty much all applications, and would help achieving better energy efficiency. For a company which spends thousands of developers on all kinds of gimmicks, this would be a much smarter initiative.
 
Joined
Sep 15, 2007
Messages
3,944 (0.65/day)
Location
Police/Nanny State of America
Processor OCed 5800X3D
Motherboard Asucks C6H
Cooling Air
Memory 32GB
Video Card(s) OCed 6800XT
Storage NVMees
Display(s) 32" Dull curved 1440
Case Freebie glass idk
Audio Device(s) Sennheiser
Power Supply Don't even remember
Joined
Feb 20, 2019
Messages
7,275 (3.86/day)
System Name Bragging Rights
Processor Atom Z3735F 1.33GHz
Motherboard It has no markings but it's green
Cooling No, it's a 2.2W processor
Memory 2GB DDR3L-1333
Video Card(s) Gen7 Intel HD (4EU @ 311MHz)
Storage 32GB eMMC and 128GB Sandisk Extreme U3
Display(s) 10" IPS 1280x800 60Hz
Case Veddha T2
Audio Device(s) Apparently, yes
Power Supply Samsung 18W 5V fast-charger
Mouse MX Anywhere 2
Keyboard Logitech MX Keys (not Cherry MX at all)
VR HMD Samsung Oddyssey, not that I'd plug it into this though....
Software W10 21H1, barely
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
So L3 cache is faster than L1 or L2 when another core is working on it?
It makes sense I guess, but I'd never thought of it like that before....
 
Joined
Jun 10, 2014
Messages
2,900 (0.81/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
So L3 cache is faster than L1 or L2 when another core is working on it?
It makes sense I guess, but I'd never thought of it like that before....
In a way, as L3 is accessible from other cores.
The memory/cache system inside CPUs are much more sophisticated than you think. When the CPU intends to read or write to an address, it fetches a whole cache line (64 bytes) from memory into L2, even if it's only going to write a single byte. There are various locks in place in case cores needs to write to addresses within the same cache line, so I guess if a core can free this lock earlier, it can help in some cases. Otherwise it needs to wait until it's discarded.
 
Joined
Jan 8, 2017
Messages
8,926 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
A core reads and writes to the cache millions of times a second, I can't see how this wouldn't come with a massive overhead. Not only that but you don't really get to know what exactly happens to these cache lines so how exactly you'd even go about to implement this I don't know.
 
Joined
Aug 6, 2017
Messages
7,412 (3.03/day)
Location
Poland
System Name Purple rain
Processor 10.5 thousand 4.2G 1.1v
Motherboard Zee 490 Aorus Elite
Cooling Noctua D15S
Memory 16GB 4133 CL16-16-16-31 Viper Steel
Video Card(s) RTX 2070 Super Gaming X Trio
Storage SU900 128,8200Pro 1TB,850 Pro 512+256+256,860 Evo 500,XPG950 480, Skyhawk 2TB
Display(s) Acer XB241YU+Dell S2716DG
Case P600S Silent w. Alpenfohn wing boost 3 ARGBT+ fans
Audio Device(s) K612 Pro w. FiiO E10k DAC,W830BT wireless
Power Supply Superflower Leadex Gold 850W
Mouse G903 lightspeed+powerplay,G403 wireless + Steelseries DeX + Roccat rest
Keyboard HyperX Alloy SilverSpeed (w.HyperX wrist rest),Razer Deathstalker
Software Windows 10
Benchmark Scores A LOT
maybe it's just alder lake specific thing for big-small configuration ?
 
Joined
Apr 15, 2020
Messages
109 (0.07/day)
System Name Old friend
Processor 3550 Ivy Bridge x 39.0 Multiplier
Memory 2x8GB 2400 RipjawsX
Video Card(s) 970 Maxwell STRIX-GTX970-DC2OC-4GD5
yesteryear's performance
XD that was funny!

What i meant was that Alder Lake platform (LGA 1700) is gonna be the Intel platform worth buying!

And of course anyone looking for a new PC or simply an upgrade should and must consider price-performance value of the products he/she plans to purchase. That's like the number one rule, no two ways about it.

Right now you can have a 3900X for 10700K price, i mean it's crazy the reality we live in today.
And being only 7% slower at 1080p in games according to
, the 3900X is the clear choice due to its higher core count.
 
Joined
Jun 10, 2014
Messages
2,900 (0.81/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
maybe it's just alder lake specific thing for big-small configuration ?
It will also be supported by server CPUs such as Sapphire Rapids, so I guess not. What makes you think it does?


For those wanting to learn more about CPU caches and how it impacts software, I highly recommend watching this lecture. Watch at least ~32:47-43:00, but watching the whole thing is well worth an hour of your life. I recommend watching it if you're either interested in how CPU caches work in principle, or is a programmer. The principles explained here are essential for any performant code, and is a subject only becoming more relevant as CPUs are getting more powerful but more reliant on efficient cache usage.

The video explains why e.g. false sharing is a huge problem for scaling with multiple cores, and also states that there is no fix for it yet, but this new CLDEMOTE instruction might just help with some of these cases. Actual sharing have a similar behavior as well, and while that will always have a significant cost tied to it, this new instruction might help reducing that cost and improve scaling.

What's interesting to me is whether this new instruction is targeted only for specific edge cases, or if this is something that helps a lot of common cases of unnecessary stalls in the CPU.

And of course anyone looking for a new PC or simply an upgrade should and must consider price-performance value of the products he/she plans to purchase. That's like the number one rule, no two ways about it.

Right now you can have a 3900X for 10700K price, i mean it's crazy the reality we live in today.
And being only 7% slower at 1080p in games according to
, the 3900X is the clear choice due to its higher core count.
That makes no sense. What matters is performance in the workloads relevant to the buyer, not that it has higher core count.

If 3% lower performance in 1440p (and 7% in 1080p but who cares) matters is up to the end user. If you buy a $800 graphics card, then I could argue that $24 of that is "wasted" due to the CPU. The i7-10700K and i9-10900K models are quite a bit snappier in Photoshop and Premiere, on top of office work and web browsing. These are things which may be highly relevant to some buyers, and much more important than the core count.

When considering prices, the buyer must always look at their local prices. A much bigger problem for Comet Lake is availability. I have yet to see any of the K-models in stock, and many stores expect delivery in August. Of course, I can't know for sure if some are shipping or not. But if availability is practically nonexistent globally, then it's pretty much dead on arrival. It doesn't matter how good it is if you can't buy it.
 
Joined
Apr 15, 2020
Messages
109 (0.07/day)
System Name Old friend
Processor 3550 Ivy Bridge x 39.0 Multiplier
Memory 2x8GB 2400 RipjawsX
Video Card(s) 970 Maxwell STRIX-GTX970-DC2OC-4GD5
The 10700K TPU review indicates that in CPU tests
, the 3900X is 6.6% faster in applications than 10700K, 7% slower at 1080p in games and the performance per dollar (TPU review says the Intel Core i7-10700K retails for around $400) is 1% higher for the 10700K assuming the $400 price tag.


In Photoshop and Office the difference is less than 100 milliseconds. (58.1 milliseconds for Photoshop, 73.4 milliseconds for Word, 86.8 milliseconds for PowerPoint and 27.8 milliseconds for Excel)
And Premiere shows that the difference is less than 10%. (6%)

In web browser performance the difference is less than 10%. (8% for Google Octane and 6% for WebXPRT)
Actually in Mozilla Kraken, the 3900X is quite a bit snappier.


A CPU that has more threads enables more multitasking, although in order to do that, applications in your daily usage must support multithreading which modern apps support by default. And so 3900X is the better option because not only it has comparable performance and price compared to 10700K, it also has, you guessed it, higher thread count which allows you to do more with your PC.
As for the comparable price globally, for example in Germany, you can purchase the 10700K for 429 euros and the 3900X for 419 euros. The purchasing decision is up to the buyer of course. Here are the links: (June 7th, 2020)

At the end of the day, everyone has their own preferences. We're not gonna force anything (or any idea) on anyone. Intel, AMD and NVIDIA are just brands, what gives them value is the performance they bring to the market. Intel enabled HT from i9 all the way to i3 with 10th Gen Comet Lake processors, an example of bringing performance to the market.
 
Top