Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

Tech00 · Dec 17, 2020

efikkan said:
The good news is that many workloads are essentially vectorized in nature, but unfortunately the coding practices taught in school these days tell people to ignore the real problem (the data) and instead focus on building a complex "architecture" to hide and scatter the data and state all over the place, making SIMD (or any decent performance really) virtually impossible.

Vectorizing the data is always a good start, but it may not be enough. Whenever you see a loop iterating over a dense array, there is some theoretical potential there, but the solution may not be obvious. Often, the solution is to restructure the data with similar data grouped together (the data oriented approach), rather than the typical "world modelling", then potential for SIMD often will become obvious. But as we know, a compiler can never do this, the developer still has to do the ground work.

I haven't had time to look into the details of AMX yet, but my gut feeling tells me that any 2D data should have some potential here, like images, video, etc.

Agree!
In my experience, performance does not come from getting an IPC improvements of 10%-20% executing the same stupid sequence of instructions slightly fast, shaving of a 1/5 of a clock cycle here and there or even adding 50% more cores - that's not where it is.
Performance come from:
1) understanding the nature of the problem your are trying to solve -> options for organizing data and what algorithms to use. Vectorizing the problem quickly becomes the best way to get stepwise improvement: e.g. 2x+ improvement. Vectorizing is a more efficient way to parallelize than just adding more general cores (having more independent uncoordinated cores fighting for the same memory resources) - more cores works best if they work on unrelated tasks (different data), while vectorization works best to parallelize on one task with one data (potentially huge) set and get that task done fastest.
2) understanding the instruction set of the processor you are targeting -> focus on the instruction set that can do the job done fastest (currently that is AVX-512)
3) understanding caching and memory access and how data should be organized and accessed to maximize data throughput to feed the CPU optimally

If you do this with a Skylake or Cascadelake AVX-512 capable CPU (e.g. Cascadelake X: core i9 10980XE with 2 FMAs per core) , the bottle neck is NOT the CPU but RAM (and memory hierarchy). Focus shifts to how you read and write data fastest possibly, maximizing use of the different cache levels doing as many ops as possible on the data while it's still hot in cache before you move on the the next sub set of data (fitted to cache sizes).

Designing the data structures and algorithms around optimal date flow becomes the way you get max performance.

The point is that a Cascadelake-X core i9-10980XE CPU is not the bottle neck, the memory subsystem is - you just can't feed this beast fast enough - it chews thru all you can through at it quickly utilizing the full memory band width - and that is as good as it gets! Assuming you vectorize data and algorithms

I'll try to make the point here again using the 10980XE as an example: Due to the super scalar nature of the 10980XE where each core can reorder (from independent instructions in the pipeline) and execute instructions in parallel (10 parallel but differently capable execution ports) each core has incredible parallelism as it is. BUT most importantly for what I talk about here, a 10980XE has 36 vector FMAs/ALUs (2 per core). Each core has effectively two 16 float wide (or 16 32-bit integer wide) vector ALU/FMA that can perform 2 SIMD instructions in parallel (if programmed correctly and that is the big if). with propper programming you effectively have 36 FMA or ALU vector cores (depending on float or integer).
From a pure number crunching perspective that makes the 10980 a 36 core 16 float wide number crunching CPU (not 18)!

So why aren't we seeing this reflected gains in most of today's apps (given that AVX-512 has been on the market for 3 years now: core i9-7980XE we should see more right)?
Answer: They (developers) have simply not taken the time to vectorzie their code. Why is that? a few reasons I can think of:
1) They are under too much budget pressure so they cannot go back and redesign their inefficient sequential code (commercial dev houses). This is inexcusable poor management! Why? Well if the is a business case for faster CPU (and that's the whole model for AMD and Intel so yeah there is) then there is also a BC for faster s/w.
if this is the case dev houses are basically relying on that gen over gen 10%-20% IPC improvement to continue, giving users a minuscule, imperceptible performance increases gen over gen. btw it is really hard to shave off another 1/5 fraction of clock cycle of a each instruction when you have already been doing that gen over gen for decades.
2) Incompetent s/w developers (no a python class does not make you a Computer Scientist)
3) Lazy developers and NO, a compiler will NOT compensate for laziness
4) Also Intel fault - they could have spent time develop efficient API layers that uses the new instructions the best way (encapsulate some of the most common algorithms and data structures in vectorized form) and make it broadly available for free and "sell" it to the dev houses (i mean educate them not make money from selling the api's) - they did some, but sorry MKL (Intel Math Kernel Library) is not enough.

This problem will be even bigger with AMX because if AVX-512 is a s/w redesign then AMX is even more of a redesign: You really need to "vectorize/matricize" and put data into vectors and/or matrices to get the max performance and throughput for the Sapphire Rapids. It will be even harder for a compiler to this (compared to AVX-512) - not going to happen!
In summary it will be down to the individual programmer to get 2x-4x performance out of the next gen of AVX-512 / AMX Intel CPUs. If not we will just get another 10%-20% IPC improvement as usual and have lots of wasted silicon with AVX-512 and AMX.
The bigger point I'm making is that design and programming has to change to "vectorize/matricize" data and algorithms by default NOT as some sort of obscure after thought.

Tech00 · Dec 17, 2020

TumbleGeorge said:
Is CPU's using math tricks to shortcut calculations? Or this is possible only for analog systems like human brain?

Modern CPUs are using some math tricks but not the once you are referring to.

Instead of doing the full calculation the "proper" way (like the once we learn in school):
1) for certain operations (e.g. square root, divisions) the CPU could use table look up to get to a close enough answer and then use an algorithm (like a few iterations of Newton Raphson Method) to get to the final result with enough significant digits. This is faster (potentially much faster) but requires more silicon to do. All is done in silicon (tables, Newton Raphson).
2) there are specific instructions that do a sloppier (but good enough and very, very fast) job with lower precision, e.g. VRCP14PS that would Compute Approximate Reciprocals of Packed Float Values with and error of +/- 2^-14, i.e. individually invert each of the 16 floats in a vector in a total of 2 clock cycles (throughput) - that is fast! = 8 float reciprocals per clock! Good luck trying to get that with any other CPU! This can be very useful when you do not need very high precision and 2^-14 will do just fine.
3) using various combinations of Taylor polynomial expansions, table look ups and Newton Raphson Method to calculate transcendent functions like e^x, log x etc fast (given the precision you want different combinations are used) - all in silicon and very fast!
BTW all of these techniques can and are also used in s/w by the end programmer, hand tuned (assembly coded) libraries (and sometimes even by the compiler!) to improve results that does not have good enough precision the fastest way possible.

Modern super scalar CPUs also uses other tricks:
Each core keep a pipe of instructions and see what can be executed in parallel, i.e. sequential instructions can be executed in parallel if
1) there are enough free useful ports (say there are two free ALU port) AND
2) they are independent (one instruction is not depended on the result of a previous instruction that has not finished yet)
say
a = c*b;
f = c*e;
d = a*e;
3rd instruction (d=a*e) is depended on that the 1st is finished BUT the 2nd instruction is independent on 1st, so 1st and 2nd can be executed in parallel if there are two free ALU ports.

Reordering:
Take the same example above but swap the last two rows
a = c*b;
d = a*e;
f = c*e;
if you do this sequentially, 2nd instruction is waiting for the 1st to finish BUT a modern core can see that there is a 3rd instruction coming after that is independent of 1 and 2 so if will reorder the pipeline and execute 3 ahead of 2 while 2 is waiting. This is called reordering. and is a great speed up - also helps the programmer so he/she does not need to think of reordering the instructions at the micro level (if they are close). compilers also help to do this in a bigger window of instructions and can even un-roll loops to decouple dependencies, all helping the programmer.

Loop cache (where the core keeps the whole inner loop in dedicated loop cache) so it does not need to go out to lower slower levels of cache. If you can fit the entire inner loop in loop cache it is extremely fast- I think some implementation of prime95 are able to do this... hand coded assembly. This way the core has full visibility on what's going on end to end and can do very good choices of reordering, predict branches and keep all ports as utilized as possible.

There are many more tricks of course, like read ahead, branch prediction etc.

efikkan · Dec 19, 2020

Tech00 said:
In my experience, performance does not come from getting an IPC improvements of 10%-20% executing the same stupid sequence of instructions slightly fast, shaving of a 1/5 of a clock cycle here and there or even adding 50% more cores - that's not where it is.

This depends on whether you are talking as a prospective buyer looking to buy the right piece of hardware, or if you're talking as a developer looking to write a piece of software.

I disagree about the importance of IPC. While SIMD is certainly very important for any heavy workload, most improvements which improves IPC also improves SIMD throughput, like larger instruction windows, better prefetching, larger caches, larger cache bandwidth, larger register files, etc. If we are to get even more throughput and potentially more AVX units in the future, all of these things needs to continue scaling to keep them fed.
Plus there is the added bonus of these things helping practically all code, which is why IPC improvements generally improve "everything", makes systems more responsive, etc. IPC also helps heavily multithreaded workloads scale even further, so much higher IPC is something everyone should want.

Tech00 said:
Vectorizing is a more efficient way to parallelize than just adding more general cores (having more independent uncoordinated cores fighting for the same memory resources) - more cores works best if they work on unrelated tasks (different data), while vectorization works best to parallelize on one task with one data (potentially huge) set and get that task done fastest.

Yes, many struggle to understand this.
Multiple cores are better for independent work chunks, so parallelization on a high level.
SIMD is better for repeating logic over parallel data, so parallelization on a low level.
Many applications use both, like 7zip, winrar, video encoders, blender etc., to get the maximum performance. One approach doesn't solve everything optimally. Engineering is mostly about using the right tool for the job, not using one tool for everything.

Tech00 said:
If you do this with a Skylake or Cascadelake AVX-512 capable CPU (e.g. Cascadelake X: core i9 10980XE with 2 FMAs per core) , the bottle neck is NOT the CPU but RAM (and memory hierarchy).

Yes, the reason why Skylake-X added more L2 cache and redesigned the L3 cache is for the increased throughput of AVX-512. AVX-512 is a beast, and it probably still stuggles to keep it saturated.

Tech00 said:
So why aren't we seeing this reflected gains in most of today's apps (given that AVX-512 has been on the market for 3 years now: core i9-7980XE we should see more right)?

Answer: They (developers) have simply not taken the time to vectorzie their code. Why is that? a few reasons I can think of:

Well, beyond a slow rollout of AVX-512 from Intel, this probably comes down to the general slow adoption of all SIMD. It seems like AVX2 now is finally getting traction, now that we should be focusing on AVX-512.
Software is always slow at adoption. There are to my knowledge no substantial client software using AVX-512 yet, but once something like Phoshop, Blender, etc. does it, it will suddenly become a must for prosumers. Luckily, once software is vectorized, converting it to a new AVX version isn't hard.

To make things worse, Intel have been lacking AVX/FMA support for their Celeron and Pentium CPUs, which also affects which ISA features developers will prioritize.

Tech00 said:
4) Also Intel fault - they could have spent time develop efficient API layers that uses the new instructions the best way (encapsulate some of the most common algorithms and data structures in vectorized form) and make it broadly available for free and "sell" it to the dev houses (i mean educate them not make money from selling the api's) - they did some, but sorry MKL (Intel Math Kernel Library) is not enough.

I disagree here.
APIs and complete algorithms is generally not useful for adoption of AVX in software (beyond some more pure math workloads for academic purposes, which is where MKL is often used).
What developers need is something better than the intrinsic macros we have right now, something which enables us to write clean readable C code while getting optimal AVX-code after compilation.
The only code I want them to optimize for us would be the C standard library and existing libraries and OS core, which is something they kind of already have done for the "Intel Clear Linux" distro, but very little of this has been adopted by the respective software projects.

Tech00 said:
…
Reordering:
Take the same example above but swap the last two rows
a = c*b;
d = a*e;
f = c*e;
if you do this sequentially, 2nd instruction is waiting for the 1st to finish BUT a modern core can see that there is a 3rd instruction coming after that is independent of 1 and 2 so if will reorder the pipeline and execute 3 ahead of 2 while 2 is waiting. This is called reordering. and is a great speed up - also helps the programmer so he/she does not need to think of reordering the instructions at the micro level (if they are close). compilers also help to do this in a bigger window of instructions and can even un-roll loops to decouple dependencies, all helping the programmer.

Todays high-performance x86 CPUs are huge superscalar out-of-order machines. You know Itanium(EPIC) tried to solve this by being an in-order with explicitly parallel instructions, and failed miserably doing so.
From the looks of it, Sapphire Rapids will continue the current trend and be even significantly larger than Sunny Cove.
Just looking at the instruction window, the trend has been:
Nehalem: 128, Sandy Bridge: 168 (+31%), Haswell: 192 (+14%), Skylake: 224 (+17%), Sunny Cove: 352 (+57%), Golden Cove: 600(?) (+70%)
There will probably be further improvements in micro-op cache, load/stores, decoding and possibly more execution ports, to squeeze out as much parallelization as possible.

HansRapad · Dec 22, 2020

Super XP said:
Irrelevant when AMDs EPYC are beating anything Intel has to offer in the data centre and server market. The upgrade cycle is a long one, and Intel has entrenched themselves in that market due to years of domination. I believe Intel owns over 90% of this market. If AMD can even capture say 10% or even 15%, that is more than enough to keep AMD pumping out faster, efficient and highly innovative processors. A slight increase for AMD in % is a huge win for the company. That's how much Intel controls that market. But its slowly turning to AMDs favour. And the industry sees AMD now as a compelling & proven alternative. Things are going to look quite interesting in 2-5 years from now.

AMD's ZEN caught Intel with its knickers in a knot And they haven't recovered, so you need to ask yourself why they created a job position named "Marketing Strategist" i.e.: Damage Control
That last time I've witnessed Intel do weird things because they are behind innovatively and technologically was back in the day, with its Legendary Athlon 64 days when AMD took the Price/Performance crown.

I don't see Company other than Cloud Computing or IT Company switching anytime soon

No one want to invest Revalidating production Software, end to end on AMD hardware, especially when the business process being so complicated

Super XP · Dec 22, 2020

HansRapad said:
I don't see Company other than Cloud Computing or IT Company switching anytime soon

No one want to invest Revalidating production Software, end to end on AMD hardware, especially when the business process being so complicated

They will if the price is right. But I see smaller companies moving towards AMD hardware. AMD's Worldwide revenues for servers running AMD CPUs were up 112.4% year over year. Will they topple Intel in this market, probably not, but they will continue to eat away at it slowly but surely.

HansRapad · Dec 24, 2020

Super XP said:
They will if the price is right. But I see smaller companies moving towards AMD hardware. AMD's Worldwide revenues for servers running AMD CPUs were up 112.4% year over year. Will they topple Intel in this market, probably not, but they will continue to eat away at it slowly but surely.

Sorry I don't think you get it, it was never about price, it was about the complexity of custom software they use, Big company pay any price, and do not want to deal with risk involved by switching Platform just to save some money on server infrastructure.

Smaller company has smaller business process, use less complex software, in fact they may even use Cloud server instead, Big company? nope

To answer Switching to AMD Platform is worth it, you need to answer several question

How large the scope involved
How much the cost for revalidating end to end business application involved
How much the downtime we expect during transition
How long the transition period
What are the mitigation plan for issue during transition
How large the development timeline impacted to adapt with this new requirement
What are the transition strategy
What are the risk involved
How much the loss during the transition period

it wasn't as simple as, oh this hardware cheaper and faster, let's retest, all of our software for AMD hardware, done clear, happy ending? No not at all, they need to retest everything, Integration, of all software, All Module, all databases, all everything. and if we talking about Non IT company, (Bank, FMCG, ect) They don't want to deal with this

InVasMani · Dec 24, 2020

Not disagreeing with what you say about some of these big companies that's certainly valid and true, but at the same big companies have switched and/or will switch if the reasons are compelling enough to. It's defiantly a case by case basis just look at CUDA and how that's evolved over time they pretty much created a new market segment that mostly didn't exist within that scope it does today it's growth substantially in scale over the last decade or so.

Super XP · Dec 25, 2020

HansRapad said:
Sorry I don't think you get it, it was never about price, it was about the complexity of custom software they use, Big company pay any price, and do not want to deal with risk involved by switching Platform just to save some money on server infrastructure.

Smaller company has smaller business process, use less complex software, in fact they may even use Cloud server instead, Big company? nope

To answer Switching to AMD Platform is worth it, you need to answer several question

How large the scope involved

How much the cost for revalidating end to end business application involved

How much the downtime we expect during transition

How long the transition period

What are the mitigation plan for issue during transition

How large the development timeline impacted to adapt with this new requirement

What are the transition strategy

What are the risk involved

How much the loss during the transition period

it wasn't as simple as, oh this hardware cheaper and faster, let's retest, all of our software for AMD hardware, done clear, happy ending? No not at all, they need to retest everything, Integration, of all software, All Module, all databases, all everything. and if we talking about Non IT company, (Bank, FMCG, ect) They don't want to deal with this

I understand exactly what you are saying. Which is why I said Smaller Companies tend to lean towards going with AMD, a cheaper, better power efficient and faster option. I am fully aware of the complexities involved with rewriting custom software and how much issues and risk will be involved by switching platforms. But its not just for the savings.

My entire point was AMD is increasing market share in this space that's more than 95% controlled by Intel. AMD is in fact gaining market share, a slow and painful process, but they are even with larger companies that will concede in taking risks for custom complex software platforms. I mean, for AMD to be up say about 1% in this space is huge. But its a very slow climb up because AMD can't possibly make enough capacity to anyways to feed this industry. Will they ever overtake Intel in this space? Absolutely Not IMO. lol

yeeeeman · Jan 4, 2021

Vya Domus said:
Still not quite competitive enough with Milan for a 10nm product and Zen 3 Epyc is just months away. Ain't looking good, especially since it's rumored that AMD will remain on 64 core configurations. In other words they think these are no threat.

How can you tell? What do you know about golden cove core that is inside sapphire rapids? FYI zen 3 and willow cove ipc are 5-10% apart so if golden cove brings 20% improvent then those 9 fewer cores aren't that important anymore. Try and keep an open mind about this.

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

System Name	RiseZEN Gaming PC
Processor	AMD Ryzen 7 5800X @ Auto
Motherboard	Asus ROG Strix X570-E Gaming ATX Motherboard
Cooling	Corsair H115i Elite Capellix AIO, 280mm Radiator, Dual RGB 140mm ML Series PWM Fans
Memory	G.Skill TridentZ 64GB (4 x 16GB) DDR4 3200
Video Card(s)	ASUS DUAL RX 6700 XT DUAL-RX6700XT-12G
Storage	Corsair MP510 480GB M.2 - 2 x WD_BLACK 1TB SN850X M.2 1TB - Lexar NQ790 M.2 2TB
Display(s)	ASUS ROG Strix 34” XG349C 144Hz 1440p + Asus ROG 27" MG278Q 144Hz WQHD 1440p
Case	Corsair Obsidian Series 450D Gaming Case
Audio Device(s)	SteelSeries 5Hv2 w/ Sound Blaster Z SE
Power Supply	Corsair RM750x Power Supply
Mouse	Razer Death-Adder + Viper 8K HZ Ambidextrous Gaming Mouse - Ergonomic Left Hand Edition
Keyboard	Logitech G910 Orion Spectrum RGB Gaming Keyboard
Software	Windows 10 Pro - 64-Bit Edition (Back to Win 10 because 11 is garbage)
Benchmark Scores	I'm the Doctor, Doctor Who. The Definition of Gaming is PC Gaming...

System Name	RiseZEN Gaming PC
Processor	AMD Ryzen 7 5800X @ Auto
Motherboard	Asus ROG Strix X570-E Gaming ATX Motherboard
Cooling	Corsair H115i Elite Capellix AIO, 280mm Radiator, Dual RGB 140mm ML Series PWM Fans
Memory	G.Skill TridentZ 64GB (4 x 16GB) DDR4 3200
Video Card(s)	ASUS DUAL RX 6700 XT DUAL-RX6700XT-12G
Storage	Corsair MP510 480GB M.2 - 2 x WD_BLACK 1TB SN850X M.2 1TB - Lexar NQ790 M.2 2TB
Display(s)	ASUS ROG Strix 34” XG349C 144Hz 1440p + Asus ROG 27" MG278Q 144Hz WQHD 1440p
Case	Corsair Obsidian Series 450D Gaming Case
Audio Device(s)	SteelSeries 5Hv2 w/ Sound Blaster Z SE
Power Supply	Corsair RM750x Power Supply
Mouse	Razer Death-Adder + Viper 8K HZ Ambidextrous Gaming Mouse - Ergonomic Left Hand Edition
Keyboard	Logitech G910 Orion Spectrum RGB Gaming Keyboard
Software	Windows 10 Pro - 64-Bit Edition (Back to Win 10 because 11 is garbage)
Benchmark Scores	I'm the Doctor, Doctor Who. The Definition of Gaming is PC Gaming...

Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

Tech00

New Member

Tech00

New Member

efikkan

HansRapad

Super XP

HansRapad

InVasMani

Super XP

yeeeeman