Intel "Sapphire Rapids," "Alder Lake" and "Tremont" Feature CLDEMOTE Instruction

btarunr · Jun 5, 2020

Intel's three upcoming processor microarchitectures, namely the next-generation Xeon "Sapphire Rapids," Core "Alder Lake," and low-power "Tremont" cores found in Atom, Pentium Silver, Celeron, and even Core Hybrid processors, will feature a new instruction set that aims to speed up processor cache performance, called CLDEMOTE "cache line demote." This is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

There are a handful benefits to what CLDEMOTE does. Firstly, it frees up lower cache levels such as L1 and L2, which are smaller in size and dedicated to a CPU core, by pushing cache lines to the last-level cache (usually L3). Secondly, it enables rapid load movements between cores by pushing cache lines to L3, which is shared between multiple cores; so it could be picked up by a neighboring core. Dr. John McCalpin from UT Austin wrote a detailed article on CLDEMOTE.

View at TechPowerUp Main Site

Sunny and 75 · Jun 5, 2020

Alder Lake it is then.

Nephilim666 · Jun 5, 2020

This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.

cucker tarlson · Jun 5, 2020

Nephilim666 said:
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.

this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

efikkan · Jun 5, 2020

Nephilim666 said:
This is great and I thank Intel for including the innovation, but as we know from the slow adoption of AVX512 it is dependent on developers to support these instructions.

This is always a problem; it takes several years before a significant portion of the users have hardware support, and this may require multiple versions of software for backwards compatibility.

Unfortunately most software is lagging >15 years behind in terms of ISA. Your OS (Windows or Linux), most libraries and applications are all compiled for x86-64/SSE2. Only a small selection of demanding applications use more modern features. This is unfortunate since we are missing out on a lot of free performance. The Linux distribution "Intel Clear Linux" demonstrates this, where some core libraries and applications are optimized for modern ISA features, and gets a good portion of performance improvements.

cucker tarlson said:
this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

Really?
The OS scheduler lives on a completely different time scale (ms), while data residing in CPU caches usually stay there for nanoseconds to microseconds. I know the article mentions this, but I don't understand where it gets it from.

This sounds like something that can help synchronizing data between cores, yet probably have a very limited use case.

Nephilim666 · Jun 5, 2020

cucker tarlson said:
this is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.

Sweet copy paste.

If this was managed dynamically at an OS level it would be hideously slow, if it was managed in silicon it would cause some strange performance inconsistencies with no way to turn it off.

It will most likely need to be part of the OS Kernel, similar to the Intel Clear Linux @efikkan mentions. It will be useful in fringe circumstances until very widespread hardware adoption.

jeremyshaw · Jun 5, 2020

efikkan said:
This is always a problem; it takes several years before a significant portion of the users have hardware support, and this may require multiple versions of software for backwards compatibility.

Unfortunately most software is lagging >15 years behind in terms of ISA. Your OS (Windows or Linux), most libraries and applications are all compiled for x86-64/SSE2. Only a small selection of demanding applications use more modern features. This is unfortunate since we are missing out on a lot of free performance. The Linux distribution "Intel Clear Linux" demonstrates this, where some core libraries and applications are optimized for modern ISA features, and gets a good portion of performance improvements.

Really?
The OS scheduler lives on a completely different time scale (ms), while data residing in CPU caches usually stay there for nanoseconds to microseconds. I know the article mentions this, but I don't understand where it gets it from.

This sounds like something that can help synchronizing data between cores, yet probably have a very limited use case.

I think that's a good thing. Original x86-64 and SSE2 patents are in the verge of expiring. Supporting AVX for regular applications further locks in x86 control of the market - specifically Intel's. Sure, we now have AMD in the running again, but fortunes can quickly change, and we can have another lost decade, lead by Intel. Best to break Intel's dominance in every possible aspect while it's still feasible.

Caring1 · Jun 5, 2020

Incoming mitigation in 3...2...1....

efikkan · Jun 5, 2020

jeremyshaw said:
I think that's a good thing. Original x86-64 and SSE2 patents are in the verge of expiring. Supporting AVX for regular applications further locks in x86 control of the market - specifically Intel's. Sure, we now have AMD in the running again, but fortunes can quickly change, and we can have another lost decade, lead by Intel. Best to break Intel's dominance in every possible aspect while it's still feasible.

AMD have full access to AVX features, and have even made major contributions to it such as FMA.

Intel is partly to blame for the AVX adoption rate though, as Celeron and Pentium models are still lacking support.

As hardware is progressing, but software fails to keep up, we are loosing out on more and more performance potential. I wish that as Microsoft are phasing out 32-bit Windows they would transition to having two 64-bit versions of Windows; a "current" version(e.g. Haswell/Bulldozer or Skylake/Zen ISA level) and a "legacy" version (x86-64/SSE2). Just recompiling the kernel and core libraries would easily yield ~5-10% of free performance, and much more in certain cases by writing optimized code using intrinsics. Such improvements would benefit pretty much all applications, and would help achieving better energy efficiency. For a company which spends thousands of developers on all kinds of gimmicks, this would be a much smarter initiative.

TheGuruStud · Jun 5, 2020

Adc7dTPU said:
Alder Lake it is then.

Have fun with yesteryear's performance lol

Chrispy_ · Jun 5, 2020

So L3 cache is faster than L1 or L2 when another core is working on it?
It makes sense I guess, but I'd never thought of it like that before....

efikkan · Jun 5, 2020

Chrispy_ said:
So L3 cache is faster than L1 or L2 when another core is working on it?
It makes sense I guess, but I'd never thought of it like that before....

In a way, as L3 is accessible from other cores.
The memory/cache system inside CPUs are much more sophisticated than you think. When the CPU intends to read or write to an address, it fetches a whole cache line (64 bytes) from memory into L2, even if it's only going to write a single byte. There are various locks in place in case cores needs to write to addresses within the same cache line, so I guess if a core can free this lock earlier, it can help in some cases. Otherwise it needs to wait until it's discarded.

Vya Domus · Jun 5, 2020

A core reads and writes to the cache millions of times a second, I can't see how this wouldn't come with a massive overhead. Not only that but you don't really get to know what exactly happens to these cache lines so how exactly you'd even go about to implement this I don't know.

cucker tarlson · Jun 5, 2020

maybe it's just alder lake specific thing for big-small configuration ?

Sunny and 75 · Jun 6, 2020

TheGuruStud said:
yesteryear's performance

XD that was funny!

What i meant was that Alder Lake platform (LGA 1700) is gonna be the Intel platform worth buying!

And of course anyone looking for a new PC or simply an upgrade should and must consider price-performance value of the products he/she plans to purchase. That's like the number one rule, no two ways about it.

Right now you can have a 3900X for 10700K price, i mean it's crazy the reality we live in today.
And being only 7% slower at 1080p in games according to

Intel Core i7-10700K Review - Unlocked and Loaded

The Core i7-10700K is Intel's second strongest overclockable Comet Lake CPU, with a powerful 8c/16t configuration. We saw pretty amazing tweaking potential from the 10700 non-K, so we'll definitely compare against that in the Core i7-10700K review, and of course against AMD's Ryzen 9 3900X.

www.techpowerup.com

, the 3900X is the clear choice due to its higher core count.

efikkan · Jun 6, 2020

cucker tarlson said:
maybe it's just alder lake specific thing for big-small configuration ?

It will also be supported by server CPUs such as Sapphire Rapids, so I guess not. What makes you think it does?

For those wanting to learn more about CPU caches and how it impacts software, I highly recommend watching this lecture. Watch at least ~32:47-43:00, but watching the whole thing is well worth an hour of your life. I recommend watching it if you're either interested in how CPU caches work in principle, or is a programmer. The principles explained here are essential for any performant code, and is a subject only becoming more relevant as CPUs are getting more powerful but more reliant on efficient cache usage.

The video explains why e.g. false sharing is a huge problem for scaling with multiple cores, and also states that there is no fix for it yet, but this new CLDEMOTE instruction might just help with some of these cases. Actual sharing have a similar behavior as well, and while that will always have a significant cost tied to it, this new instruction might help reducing that cost and improve scaling.

What's interesting to me is whether this new instruction is targeted only for specific edge cases, or if this is something that helps a lot of common cases of unnecessary stalls in the CPU.

Adc7dTPU said:
And of course anyone looking for a new PC or simply an upgrade should and must consider price-performance value of the products he/she plans to purchase. That's like the number one rule, no two ways about it.

Right now you can have a 3900X for 10700K price, i mean it's crazy the reality we live in today.
And being only 7% slower at 1080p in games according to

Intel Core i7-10700K Review - Unlocked and Loaded

The Core i7-10700K is Intel's second strongest overclockable Comet Lake CPU, with a powerful 8c/16t configuration. We saw pretty amazing tweaking potential from the 10700 non-K, so we'll definitely compare against that in the Core i7-10700K review, and of course against AMD's Ryzen 9 3900X.

www.techpowerup.com

, the 3900X is the clear choice due to its higher core count.

That makes no sense. What matters is performance in the workloads relevant to the buyer, not that it has higher core count.

If 3% lower performance in 1440p (and 7% in 1080p but who cares) matters is up to the end user. If you buy a $800 graphics card, then I could argue that $24 of that is "wasted" due to the CPU. The i7-10700K and i9-10900K models are quite a bit snappier in Photoshop and Premiere, on top of office work and web browsing. These are things which may be highly relevant to some buyers, and much more important than the core count.

When considering prices, the buyer must always look at their local prices. A much bigger problem for Comet Lake is availability. I have yet to see any of the K-models in stock, and many stores expect delivery in August. Of course, I can't know for sure if some are shipping or not. But if availability is practically nonexistent globally, then it's pretty much dead on arrival. It doesn't matter how good it is if you can't buy it.

Sunny and 75 · Jun 7, 2020

The 10700K TPU review indicates that in CPU tests

Intel Core i7-10700K Review - Unlocked and Loaded

The Core i7-10700K is Intel's second strongest overclockable Comet Lake CPU, with a powerful 8c/16t configuration. We saw pretty amazing tweaking potential from the 10700 non-K, so we'll definitely compare against that in the Core i7-10700K review, and of course against AMD's Ryzen 9 3900X.

www.techpowerup.com

, the 3900X is 6.6% faster in applications than 10700K, 7% slower at 1080p in games and the performance per dollar (TPU review says the Intel Core i7-10700K retails for around $400) is 1% higher for the 10700K assuming the $400 price tag.

Intel Core i7-10700K Review - Unlocked and Loaded

The Core i7-10700K is Intel's second strongest overclockable Comet Lake CPU, with a powerful 8c/16t configuration. We saw pretty amazing tweaking potential from the 10700 non-K, so we'll definitely compare against that in the Core i7-10700K review, and of course against AMD's Ryzen 9 3900X.

www.techpowerup.com

In Photoshop and Office the difference is less than 100 milliseconds. (58.1 milliseconds for Photoshop, 73.4 milliseconds for Word, 86.8 milliseconds for PowerPoint and 27.8 milliseconds for Excel)
And Premiere shows that the difference is less than 10%. (6%)

Intel Core i7-10700K Review - Unlocked and Loaded

The Core i7-10700K is Intel's second strongest overclockable Comet Lake CPU, with a powerful 8c/16t configuration. We saw pretty amazing tweaking potential from the 10700 non-K, so we'll definitely compare against that in the Core i7-10700K review, and of course against AMD's Ryzen 9 3900X.

www.techpowerup.com

In web browser performance the difference is less than 10%. (8% for Google Octane and 6% for WebXPRT)
Actually in Mozilla Kraken, the 3900X is quite a bit snappier.

A CPU that has more threads enables more multitasking, although in order to do that, applications in your daily usage must support multithreading which modern apps support by default. And so 3900X is the better option because not only it has comparable performance and price compared to 10700K, it also has, you guessed it, higher thread count which allows you to do more with your PC.
As for the comparable price globally, for example in Germany, you can purchase the 10700K for 429 euros and the 3900X for 419 euros. The purchasing decision is up to the buyer of course. Here are the links: (June 7th, 2020)

Intel Core i7-10700K, 8C/16T, 3.80-5.10GHz, boxed ohne Kühler (BX8070110700K) ab € 560,00 (2025) | Preisvergleich Geizhals Deutschland

✔ Preisvergleich für Intel Core i7-10700K, 8C/16T, 3.80-5.10GHz, boxed ohne Kühler (BX8070110700K) ✔ Bewertungen ✔ Produktinfo ⇒ Kerne: 8 (8C) • Threads: 16 • Turbotakt: 5.10GHz (Turbo Boost Max 3.0), 5.00GHz (Turbo Boost 2.0) • Basis… ✔ Intel ✔ Testberichte ✔ Günstig kaufen

geizhals.de

AMD Ryzen 9 3900X, 12C/24T, 3.80-4.60GHz, boxed | Preisvergleich Geizhals Deutschland

✔ Preisvergleich für AMD Ryzen 9 3900X, 12C/24T, 3.80-4.60GHz, boxed ✔ Bewertungen ✔ Produktinfo ⇒ Kerne: 12 (12C) • Threads: 24 • Turbotakt: 4.60GHz • Basistakt: 3.80GHz… ✔ AMD ✔ Testberichte ✔ Günstig kaufen

geizhals.de

At the end of the day, everyone has their own preferences. We're not gonna force anything (or any idea) on anyone. Intel, AMD and NVIDIA are just brands, what gives them value is the performance they bring to the market. Intel enabled HT from i9 all the way to i3 with 10th Gen Comet Lake processors, an example of bringing performance to the market.

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	Gigabyte B550 AORUS Elite V2
Cooling	DeepCool Gammax L240 V2
Memory	2x 16GB DDR4-3200
Video Card(s)	Galax RTX 4070 Ti EX
Storage	Samsung 990 1TB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

System Name	Old friend
Processor	3550 Ivy Bridge x 39.0 Multiplier
Memory	2x8GB 2400 RipjawsX
Video Card(s)	1070 Gaming X
Storage	870 EVO 500GB
Display(s)	27" QHD VA Curved @120Hz
Power Supply	Platinum 650W
Mouse	Light² 200
Keyboard	G610 Red

System Name	Purple rain
Processor	10.5 thousand 4.2G 1.1v
Motherboard	Zee 490 Aorus Elite
Cooling	Noctua D15S
Memory	16GB 4133 CL16-16-16-31 Viper Steel
Video Card(s)	RTX 2070 Super Gaming X Trio
Storage	SU900 128,8200Pro 1TB,850 Pro 512+256+256,860 Evo 500,XPG950 480, Skyhawk 2TB
Display(s)	Acer XB241YU+Dell S2716DG
Case	P600S Silent w. Alpenfohn wing boost 3 ARGBT+ fans
Audio Device(s)	K612 Pro w. FiiO E10k DAC,W830BT wireless
Power Supply	Superflower Leadex Gold 850W
Mouse	G903 lightspeed+powerplay,G403 wireless + Steelseries DeX + Roccat rest
Keyboard	HyperX Alloy SilverSpeed (w.HyperX wrist rest),Razer Deathstalker
Software	Windows 10
Benchmark Scores	A LOT

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

Processor	3700X
Motherboard	X570 TUF Plus
Cooling	U12
Memory	32GB 3600MHz
Video Card(s)	eVGA GTX970
Storage	512GB 970 Pro
Case	CM 500L vertical

System Name	H7 Flow 2024
Processor	AMD 5800X3D
Motherboard	Asus X570 Tough Gaming
Cooling	Custom liquid
Memory	32 GB DDR4
Video Card(s)	Intel ARC A750
Storage	Crucial P5 Plus 2TB.
Display(s)	AOC 24" Freesync 1m.s. 75Hz
Mouse	Lenovo
Keyboard	Eweadn Mechanical
Software	W11 Pro 64 bit

Processor	OCed 5800X3D
Motherboard	Asucks C6H
Cooling	Air
Memory	32GB
Video Card(s)	OCed 9070XT red devil
Storage	NVMees
Display(s)	32" Dull curved 1440
Case	Freebie glass idk
Audio Device(s)	Sennheiser, Custom 5.1
Power Supply	Don't even remember

System Name	Bragging Rights
Processor	Atom Z3735F 1.33GHz
Motherboard	It has no markings but it's green
Cooling	No, it's a 2.2W processor
Memory	2GB DDR3L-1333
Video Card(s)	Gen7 Intel HD (4EU @ 311MHz)
Storage	32GB eMMC and 128GB Sandisk Extreme U3
Display(s)	10" IPS 1280x800 60Hz
Case	Veddha T2
Audio Device(s)	Apparently, yes
Power Supply	Samsung 18W 5V fast-charger
Mouse	MX Anywhere 2
Keyboard	Logitech MX Keys (not Cherry MX at all)
VR HMD	Samsung Oddyssey, not that I'd plug it into this though....
Software	W10 21H1, barely
Benchmark Scores	I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Intel "Sapphire Rapids," "Alder Lake" and "Tremont" Feature CLDEMOTE Instruction

Editor & Senior Moderator