Friday, April 9th 2021

Intel's Upcoming Sapphire Rapids Server Processors to Feature up to 56 Cores with HBM Memory

Intel has just launched its Ice Lake-SP lineup of Xeon Scalable processors, featuring the new Sunny Cove CPU core design. Built on the 10 nm node, these processors represent Intel's first 10 nm shipping product designed for enterprise. However, there is another 10 nm product going to be released for enterprise users. Intel is already preparing the Sapphire Rapids generation of Xeon processors and today we get to see more details about it. Thanks to the anonymous tip that VideoCardz received, we have a bit more details like core count, memory configurations, and connectivity options. And Sapphire Rapids is shaping up to be a very competitive platform. Do note that the slide is a bit older, however, it contains useful information.

The lineup will top at 56 cores with 112 threads, where this processor will carry a TDP of 350 Watts, notably higher than its predecessors. Perhaps one of the most interesting notes from the slide is the department of memory. The new platform will make a debut of DDR5 standard and bring higher capacities with higher speeds. Along with the new protocol, the chiplet design of Sapphire Rapids will bring HBM2E memory to CPUs, with up to 64 GBs of it per socket/processor. The PCIe 5.0 standard will also be present with 80 lanes, accompanying four Intel UPI 2.0 links. Intel is also supposed to extend the x86_64 configuration here with AMX/TMUL extensions for better INT8 and BFloat16 processing.
Source: VideoCardz
Add your own comment

29 Comments on Intel's Upcoming Sapphire Rapids Server Processors to Feature up to 56 Cores with HBM Memory

#1
dj-electric
There's no doubt that the jump to Sapphire Rapids, 10nmSF, migration to PCIE5 and DDR5, while some chips have HBM integrated into them is a monumental jump in capabilities to server.
Im hoping consumer platforms will get most of it trickled down. The idea of HEDT for Intel died the moment Threadripper first gen came out, pretty much.

Who knows.
Posted on Reply
#2
Chrispy_
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
Posted on Reply
#3
Unregistered
Judging by how Intel measures TDP I'm already extremely worried of that 350W.

You can do it, you can reach 64 cores, someday, preferrably at less power consumption too.
#4
watzupken
Chrispy_350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
To be honest, I always feel that Intel may have made considerable compromises to get 10nm out the door. So the 10nm we are seeing in the market now is probably worst than what was originally planned. Looking at the SuperFin used for Tiger Lake, on the surface it seems to have improved allowing for higher clock speed, but it also sounded like its going down the path of 14nm, i.e. feed it more power to gain higher clock speed. So with Sapphire Rapids coming in with a rumoured TDP of 350W, it likely shows that SuperFin is really not that super. AMD aside, I feel ARM will be the more potent competitor for Intel when it comes to data center CPUs.
AlexaJudging by how Intel measures TDP I'm already extremely worried of that 350W.
This is very true. May be that's why they need to immerse into liquid to cool now, instead of using traditional heatsink fans.
Posted on Reply
#5
HalfAHertz
Chrispy_350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
It's that high because these are 99% two dies glued toghether with a DMI link/foveros. It's just unreasonable to put 40 fat x86 cores on a single die. So Intel's only option is to make smaller dies and link them. Since they don't have an energy efficient interconnect, you end up needing a ton of power..
Posted on Reply
#6
Chrispy_
@watzupken

I just spotted that Anandtech have reviewed 3rd Gen Xeon and whilst it's not a turd, yeah it's not looking good either. To say it's better than the dumpster fire of previous-gen Xeons is a veiled insult. Maybe they're approaching AMD's older Rome (Zen2) Epyc lineup in performance and efficiency, so they're now only 2 years behind.

It looks like 10nm is still not good, but it's come far enough that it's now worth using over Skylake 14nm from 2017.

@HalfAHertz
This is Intel's first stab at glue in a long time and they're not very good at it. Reminds me of the power/efficiency scaling of Core2Quad :D
Posted on Reply
#7
dragontamer5788
Its my understanding that HBM2 does NOT have enough capacity to be useful to a typical datacenter workload.

I'm guessing that these are going to be intended for supercomputers, as a competitor to the A64Fx, and not as a typical datacenter CPU.
Posted on Reply
#8
TumbleGeorge
Hmm, what is this: Intel Optane 300 series Crow pass up to 2.6X random access?
Posted on Reply
#9
Vya Domus
dragontamer5788Its my understanding that HBM2 does NOT have enough capacity to be useful to a typical datacenter workload.
It could just be used as another level of cache.
Posted on Reply
#10
ncrs
Chrispy_This is Intel's first stab at glue in a long time and they're not very good at it. Reminds me of the power/efficiency scaling of Core2Quad :D
The 9000 series of Xeon Scalable (Cascade Lake) is using glue as well and is fairly recent:


The problem with those processors was that they only came from one source - Intel-built servers. The highest model having 56 cores required liquid cooling since it had a TDP of 400W:
Posted on Reply
#11
dragontamer5788
Vya DomusIt could just be used as another level of cache.
Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.

software.intel.com/content/www/us/en/develop/articles/intel-xeon-phi-processor-7200-family-memory-management-optimizations.html



Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
Posted on Reply
#12
AsRock
TPU addict
Chrispy_350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
+++
Posted on Reply
#13
Vya Domus
dragontamer5788Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.

software.intel.com/content/www/us/en/develop/articles/intel-xeon-phi-processor-7200-family-memory-management-optimizations.html



Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
AMD did figure out a solution with HBCC without having to resort to some kind of split memory arrangement and it was completely transparent to the application. It didn't really make sense back then on a GPU running graphics workloads simply because it wasn't really required but on a CPU when processing a data set of several dozens or hundreds of GBs it might. Caches are never guaranteed to provide a definitive improvement.

The thing is, if whatever it is that you wrote is highly multithreaded and data independent latency typically doesn't matter.
Posted on Reply
#14
dragontamer5788
Vya DomusThe thing is, if whatever it is that you wrote is highly multithreaded and data independent latency typically doesn't matter.
And how much enterprise crap is single-threaded Python crunching a Django template and then piping it to a single-threaded home-grown Bash script?

Lol. Single-threaded, highly-dependent data being passed around, turning into XML into protobufs, back into XML, and then converted into JSON just to pass 50-bytes of text around. Yeaaaahhhhhhhh. Better companies may have good programmers who write more intelligent code. But seriously: most code out there (even "enterprise" code) is absolute unoptimized single-threaded crap.
Posted on Reply
#15
Steevo
dragontamer5788Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.

software.intel.com/content/www/us/en/develop/articles/intel-xeon-phi-processor-7200-family-memory-management-optimizations.html



Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
Uhhhh whut

every console built and optimized program built is coded and tuned for the performance of the hardware being used, an additional level of cache that is that close and finely tuned will boost performance compared to system memory and can be further tuned when the specifics are known.

this is why the consoles can have actual higher performance than X86 running generic code to fit all configuration.

even if it nets a 10 percent performance boost Intel needs it to stay competitive.
Posted on Reply
#16
tabascosauz
Chrispy_350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
To be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
Posted on Reply
#17
Chrispy_
tabascosauzTo be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
That's true, I guess. There are (expensive) niche-use Xeons/Epyc that are built for low core counts, high clockspeed, and full cache - but yeah Servers generally run at 2.x GHz.

I was foolishly thinking this might be a sign that 10nm desktop parts are on the way in 2021 but actually, low-clock server parts can be viable whilst the process node is still wholly useless for consumer products at 4GHz+

I actually wish Intel would go back to making fanless CPUs - the sub 5W Core-M range was great for ultraportables and they ran at 0.8-2.0GHz which would be plenty for a general-purpose laptop. My experience with Ice Lake 10nm laptops was that they cooked themselves when boosting but otherwise ran very efficiently. What if Intel made a 4C/8T CPU with a 5W envelope and max clockspeed of, say, 1.6GHz? I'd buy that.
Posted on Reply
#18
TheoneandonlyMrK
dragontamer5788Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.

software.intel.com/content/www/us/en/develop/articles/intel-xeon-phi-processor-7200-family-memory-management-optimizations.html



Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
I think your being harsh on Hbm latency, the chips are significantly closer and the latency wasn't that far out anyway.
People are touting Hbm memory near CPU to help ai and ml applications specifically.
I would imagine the plan is to have tiered memory personally, not to use it as a l4 cache but rather a special pool to use as required depending on the application.

Given Intel always has a wide array of SKU's it's relatively easy to imagine that Hbm won't be on every SKU since some applications might not require it.
Posted on Reply
#19
Aquinus
Resident Wat-man
dragontamer5788HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.
That might not matter if the HBM is used as something like an eviction cache or to hold on to cache state from a context switch. For example, if you can quickly reload cache for a context switch, you could get faster performance than having to wait for the cache to get repopulated from the new task. Or even if you're processing a task that was getting processed on another core. Using HBM to preserve cache context would be a huge benefit for systems that are doing a lot of tasks in parallel, like with application servers.
Posted on Reply
#20
Vya Domus
dragontamer5788And how much enterprise crap is single-threaded Python crunching a Django template and then piping it to a single-threaded home-grown Bash script?

Lol. Single-threaded, highly-dependent data being passed around, turning into XML into protobufs, back into XML, and then converted into JSON just to pass 50-bytes of text around. Yeaaaahhhhhhhh. Better companies may have good programmers who write more intelligent code. But seriously: most code out there (even "enterprise" code) is absolute unoptimized single-threaded crap.
Despite that these companies race each other in creating platforms that offer as many cores per node as possible because that matters more. Who cares if your single-thread python crap runs 5% slower if you can get 30% more instances running per socket for example.

And whatever gets written in that way is probably not something performance critical anyway.
Posted on Reply
#21
qcmadness
tabascosauzTo be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
At least Willow Cove has its IPC increased form 15%-20%.
Posted on Reply
#22
dragontamer5788
TheoneandonlyMrKI think your being harsh on Hbm latency, the chips are significantly closer and the latency wasn't that far out anyway.
HBM, and DDR4 / DDR5 latency, are mainly affected by physics of DRAM. Being forced to precharge the sense Amps, transfer data with RAS, before finally being able to CAS the data over is just a lot of latency. And those steps must be done on any DRAM, be it HBM or DDR4 / DDR5, or GDDR6x. Its a big reason why SRAM-caches exist and why DRAM caches (even on-chip eDRAM caches) remain unpopular.

In fact, most DDR4 / HBM / GDDR6x "latency improvements" aren't actually improving latency at all. They're just hiding latency behind more-and-more parallel requests. See DDR4 bank-groups. They're up to 32 parallel bank groups (and therefore 32-parallel PRE-RAS-CAS sequences) per chip in DDR4 IIRC, and that will only go up.

Whenever we start seeing eDRAM / DRAM as a caching layer (ex: XBox 360, Xeon Phi 7200), we suddenly get an influx of frustrated programmers who have to deal with the reality of the architecture. You're simply not going to get much better than the ~50ns or so of latency on DRAM. Too many steps need to be done per request at the fundamental physics layer.

SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
Posted on Reply
#23
Aquinus
Resident Wat-man
dragontamer5788SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
Except they make the die bigger and require more power for the same capacity. It's more simple, but it's also more expensive in several regards.
Posted on Reply
#24
TheoneandonlyMrK
dragontamer5788HBM, and DDR4 / DDR5 latency, are mainly affected by physics of DRAM. Being forced to precharge the sense Amps, transfer data with RAS, before finally being able to CAS the data over is just a lot of latency. And those steps must be done on any DRAM, be it HBM or DDR4 / DDR5, or GDDR6x. Its a big reason why SRAM-caches exist and why DRAM caches (even on-chip eDRAM caches) remain unpopular.

In fact, most DDR4 / HBM / GDDR6x "latency improvements" aren't actually improving latency at all. They're just hiding latency behind more-and-more parallel requests. See DDR4 bank-groups. They're up to 32 parallel bank groups (and therefore 32-parallel PRE-RAS-CAS sequences) per chip in DDR4 IIRC, and that will only go up.

Whenever we start seeing eDRAM / DRAM as a caching layer (ex: XBox 360, Xeon Phi 7200), we suddenly get an influx of frustrated programmers who have to deal with the reality of the architecture. You're simply not going to get much better than the ~50ns or so of latency on DRAM. Too many steps need to be done per request at the fundamental physics layer.

SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
Look , the server industry these chips Are aimed at can afford the effort or they won't buy or one API isn't shit, there's many variables and more yet to be disclosed that could influence the viability of these chips, and I read about all of that way back too , when they were released, I'm aware how the hardware works and is made or I wouldn't be commenting on it.
Posted on Reply
#25
JayN
Any idea if the HBM will be stacked on top? They connected HBM off to the sides with EMIB in the Xe-HPC package, but I haven't seen anything similar in the leaked delidded Sapphire Rapids photos.

Also... the Data Streaming Accelerator is an interesting addition. Its spec says it supports Optane. There is also an operation for flushing processor cache.

I wonder if the DSA is being used to maintain the processor cache coherency with accelerator memory when CXL bias ownership is flipped.
Posted on Reply
Add your own comment