• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel's Upcoming Sapphire Rapids Server Processors to Feature up to 56 Cores with HBM Memory

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,189 (0.91/day)
Intel has just launched its Ice Lake-SP lineup of Xeon Scalable processors, featuring the new Sunny Cove CPU core design. Built on the 10 nm node, these processors represent Intel's first 10 nm shipping product designed for enterprise. However, there is another 10 nm product going to be released for enterprise users. Intel is already preparing the Sapphire Rapids generation of Xeon processors and today we get to see more details about it. Thanks to the anonymous tip that VideoCardz received, we have a bit more details like core count, memory configurations, and connectivity options. And Sapphire Rapids is shaping up to be a very competitive platform. Do note that the slide is a bit older, however, it contains useful information.

The lineup will top at 56 cores with 112 threads, where this processor will carry a TDP of 350 Watts, notably higher than its predecessors. Perhaps one of the most interesting notes from the slide is the department of memory. The new platform will make a debut of DDR5 standard and bring higher capacities with higher speeds. Along with the new protocol, the chiplet design of Sapphire Rapids will bring HBM2E memory to CPUs, with up to 64 GBs of it per socket/processor. The PCIe 5.0 standard will also be present with 80 lanes, accompanying four Intel UPI 2.0 links. Intel is also supposed to extend the x86_64 configuration here with AMX/TMUL extensions for better INT8 and BFloat16 processing.


View at TechPowerUp Main Site
 
Joined
Aug 13, 2010
Messages
5,379 (1.08/day)
There's no doubt that the jump to Sapphire Rapids, 10nmSF, migration to PCIE5 and DDR5, while some chips have HBM integrated into them is a monumental jump in capabilities to server.
Im hoping consumer platforms will get most of it trickled down. The idea of HEDT for Intel died the moment Threadripper first gen came out, pretty much.

Who knows.
 
Joined
Feb 20, 2019
Messages
7,188 (3.86/day)
System Name Bragging Rights
Processor Atom Z3735F 1.33GHz
Motherboard It has no markings but it's green
Cooling No, it's a 2.2W processor
Memory 2GB DDR3L-1333
Video Card(s) Gen7 Intel HD (4EU @ 311MHz)
Storage 32GB eMMC and 128GB Sandisk Extreme U3
Display(s) 10" IPS 1280x800 60Hz
Case Veddha T2
Audio Device(s) Apparently, yes
Power Supply Samsung 18W 5V fast-charger
Mouse MX Anywhere 2
Keyboard Logitech MX Keys (not Cherry MX at all)
VR HMD Samsung Oddyssey, not that I'd plug it into this though....
Software W10 21H1, barely
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
 
D

Deleted member 205776

Guest
Judging by how Intel measures TDP I'm already extremely worried of that 350W.

You can do it, you can reach 64 cores, someday, preferrably at less power consumption too.
 
Joined
Mar 28, 2020
Messages
1,632 (1.12/day)
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
To be honest, I always feel that Intel may have made considerable compromises to get 10nm out the door. So the 10nm we are seeing in the market now is probably worst than what was originally planned. Looking at the SuperFin used for Tiger Lake, on the surface it seems to have improved allowing for higher clock speed, but it also sounded like its going down the path of 14nm, i.e. feed it more power to gain higher clock speed. So with Sapphire Rapids coming in with a rumoured TDP of 350W, it likely shows that SuperFin is really not that super. AMD aside, I feel ARM will be the more potent competitor for Intel when it comes to data center CPUs.

Judging by how Intel measures TDP I'm already extremely worried of that 350W.
This is very true. May be that's why they need to immerse into liquid to cool now, instead of using traditional heatsink fans.
 
Joined
May 4, 2009
Messages
1,970 (0.36/day)
Location
Bulgaria
System Name penguin
Processor R7 5700G
Motherboard Asrock B450M Pro4
Cooling Some CM tower cooler that will fit my case
Memory 4 x 8GB Kingston HyperX Fury 2666MHz
Video Card(s) IGP
Storage ADATA SU800 512GB
Display(s) 27' LG
Case Zalman
Audio Device(s) stock
Power Supply Seasonic SS-620GM
Software win10
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
It's that high because these are 99% two dies glued toghether with a DMI link/foveros. It's just unreasonable to put 40 fat x86 cores on a single die. So Intel's only option is to make smaller dies and link them. Since they don't have an energy efficient interconnect, you end up needing a ton of power..
 
Joined
Feb 20, 2019
Messages
7,188 (3.86/day)
System Name Bragging Rights
Processor Atom Z3735F 1.33GHz
Motherboard It has no markings but it's green
Cooling No, it's a 2.2W processor
Memory 2GB DDR3L-1333
Video Card(s) Gen7 Intel HD (4EU @ 311MHz)
Storage 32GB eMMC and 128GB Sandisk Extreme U3
Display(s) 10" IPS 1280x800 60Hz
Case Veddha T2
Audio Device(s) Apparently, yes
Power Supply Samsung 18W 5V fast-charger
Mouse MX Anywhere 2
Keyboard Logitech MX Keys (not Cherry MX at all)
VR HMD Samsung Oddyssey, not that I'd plug it into this though....
Software W10 21H1, barely
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
@watzupken

I just spotted that Anandtech have reviewed 3rd Gen Xeon and whilst it's not a turd, yeah it's not looking good either. To say it's better than the dumpster fire of previous-gen Xeons is a veiled insult. Maybe they're approaching AMD's older Rome (Zen2) Epyc lineup in performance and efficiency, so they're now only 2 years behind.

It looks like 10nm is still not good, but it's come far enough that it's now worth using over Skylake 14nm from 2017.

@HalfAHertz
This is Intel's first stab at glue in a long time and they're not very good at it. Reminds me of the power/efficiency scaling of Core2Quad :D
 
Joined
Apr 24, 2020
Messages
2,517 (1.75/day)
Its my understanding that HBM2 does NOT have enough capacity to be useful to a typical datacenter workload.

I'm guessing that these are going to be intended for supercomputers, as a competitor to the A64Fx, and not as a typical datacenter CPU.
 
Joined
Sep 1, 2020
Messages
2,015 (1.55/day)
Location
Bulgaria
Hmm, what is this: Intel Optane 300 series Crow pass up to 2.6X random access?
 
Joined
Jan 8, 2017
Messages
8,860 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Its my understanding that HBM2 does NOT have enough capacity to be useful to a typical datacenter workload.
It could just be used as another level of cache.
 
Joined
Jun 29, 2018
Messages
444 (0.21/day)
This is Intel's first stab at glue in a long time and they're not very good at it. Reminds me of the power/efficiency scaling of Core2Quad :D
The 9000 series of Xeon Scalable (Cascade Lake) is using glue as well and is fairly recent:


The problem with those processors was that they only came from one source - Intel-built servers. The highest model having 56 cores required liquid cooling since it had a TDP of 400W:
 
Joined
Apr 24, 2020
Messages
2,517 (1.75/day)
It could just be used as another level of cache.

Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


1617989893575.png


Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
 

AsRock

TPU addict
Joined
Jun 23, 2007
Messages
18,851 (3.08/day)
Location
UK\USA
Processor AMD 3900X \ AMD 7700X
Motherboard ASRock AM4 X570 Pro 4 \ ASUS X670Xe TUF
Cooling D15
Memory Patriot 2x16GB PVS432G320C6K \ G.Skill Flare X5 F5-6000J3238F 2x16GB
Video Card(s) eVga GTX1060 SSC \ XFX RX 6950XT RX-695XATBD9
Storage Sammy 860, MX500, Sabrent Rocket 4 Sammy Evo 980 \ 1xSabrent Rocket 4+, Sammy 2x990 Pro
Display(s) Samsung 1080P \ LG 43UN700
Case Fractal Design Pop Air 2x140mm fans from Torrent \ Fractal Design Torrent 2 SilverStone FHP141x2
Audio Device(s) Yamaha RX-V677 \ Yamaha CX-830+Yamaha MX-630 Infinity RS4000\Paradigm P Studio 20, Blue Yeti
Power Supply Seasonic Prime TX-750 \ Corsair RM1000X Shift
Mouse Steelseries Sensei wireless \ Steelseries Sensei wireless
Keyboard Logitech K120 \ Wooting Two HE
Benchmark Scores Meh benchmarks.
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.

+++
 
Joined
Jan 8, 2017
Messages
8,860 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


View attachment 195983

Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.

AMD did figure out a solution with HBCC without having to resort to some kind of split memory arrangement and it was completely transparent to the application. It didn't really make sense back then on a GPU running graphics workloads simply because it wasn't really required but on a CPU when processing a data set of several dozens or hundreds of GBs it might. Caches are never guaranteed to provide a definitive improvement.

The thing is, if whatever it is that you wrote is highly multithreaded and data independent latency typically doesn't matter.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,517 (1.75/day)
The thing is, if whatever it is that you wrote is highly multithreaded and data independent latency typically doesn't matter.

And how much enterprise crap is single-threaded Python crunching a Django template and then piping it to a single-threaded home-grown Bash script?

Lol. Single-threaded, highly-dependent data being passed around, turning into XML into protobufs, back into XML, and then converted into JSON just to pass 50-bytes of text around. Yeaaaahhhhhhhh. Better companies may have good programmers who write more intelligent code. But seriously: most code out there (even "enterprise" code) is absolute unoptimized single-threaded crap.
 
Joined
Nov 4, 2005
Messages
11,654 (1.73/day)
System Name Compy 386
Processor 7800X3D
Motherboard Asus
Cooling Air for now.....
Memory 64 GB DDR5 6400Mhz
Video Card(s) 7900XTX 310 Merc
Storage Samsung 990 2TB, 2 SP 2TB SSDs and over 10TB spinning
Display(s) 56" Samsung 4K HDR
Audio Device(s) ATI HDMI
Mouse Logitech MX518
Keyboard Razer
Software A lot.
Benchmark Scores Its fast. Enough.
Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


View attachment 195983

Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
Uhhhh whut

every console built and optimized program built is coded and tuned for the performance of the hardware being used, an additional level of cache that is that close and finely tuned will boost performance compared to system memory and can be further tuned when the specifics are known.

this is why the consoles can have actual higher performance than X86 running generic code to fit all configuration.

even if it nets a 10 percent performance boost Intel needs it to stay competitive.
 

tabascosauz

Moderator
Supporter
Staff member
Joined
Jun 24, 2015
Messages
7,457 (2.33/day)
Location
Western Canada
System Name ab┃ob
Processor 7800X3D┃5800X3D
Motherboard B650E PG-ITX┃B550-I Strix
Cooling PA120+T30┃AXP120x67
Memory 64GB 6000CL30┃32GB 3600CL14
Video Card(s) RTX 4070 Ti Eagle┃RTX A2000
Storage 8TB of SSDs┃1TB SN550
Display(s) 43" QN90B / 32" M32Q / 27" S2721DGF
Case Caselabs S3┃Lone Industries L5
Power Supply Corsair HX1000┃HDPlex
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.

To be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
 
Joined
Feb 20, 2019
Messages
7,188 (3.86/day)
System Name Bragging Rights
Processor Atom Z3735F 1.33GHz
Motherboard It has no markings but it's green
Cooling No, it's a 2.2W processor
Memory 2GB DDR3L-1333
Video Card(s) Gen7 Intel HD (4EU @ 311MHz)
Storage 32GB eMMC and 128GB Sandisk Extreme U3
Display(s) 10" IPS 1280x800 60Hz
Case Veddha T2
Audio Device(s) Apparently, yes
Power Supply Samsung 18W 5V fast-charger
Mouse MX Anywhere 2
Keyboard Logitech MX Keys (not Cherry MX at all)
VR HMD Samsung Oddyssey, not that I'd plug it into this though....
Software W10 21H1, barely
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
To be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
That's true, I guess. There are (expensive) niche-use Xeons/Epyc that are built for low core counts, high clockspeed, and full cache - but yeah Servers generally run at 2.x GHz.

I was foolishly thinking this might be a sign that 10nm desktop parts are on the way in 2021 but actually, low-clock server parts can be viable whilst the process node is still wholly useless for consumer products at 4GHz+

I actually wish Intel would go back to making fanless CPUs - the sub 5W Core-M range was great for ultraportables and they ran at 0.8-2.0GHz which would be plenty for a general-purpose laptop. My experience with Ice Lake 10nm laptops was that they cooked themselves when boosting but otherwise ran very efficiently. What if Intel made a 4C/8T CPU with a 5W envelope and max clockspeed of, say, 1.6GHz? I'd buy that.
 
Last edited:
Joined
Mar 10, 2010
Messages
11,878 (2.31/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


View attachment 195983

Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
I think your being harsh on Hbm latency, the chips are significantly closer and the latency wasn't that far out anyway.
People are touting Hbm memory near CPU to help ai and ml applications specifically.
I would imagine the plan is to have tiered memory personally, not to use it as a l4 cache but rather a special pool to use as required depending on the application.

Given Intel always has a wide array of SKU's it's relatively easy to imagine that Hbm won't be on every SKU since some applications might not require it.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.96/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.
That might not matter if the HBM is used as something like an eviction cache or to hold on to cache state from a context switch. For example, if you can quickly reload cache for a context switch, you could get faster performance than having to wait for the cache to get repopulated from the new task. Or even if you're processing a task that was getting processed on another core. Using HBM to preserve cache context would be a huge benefit for systems that are doing a lot of tasks in parallel, like with application servers.
 
Joined
Jan 8, 2017
Messages
8,860 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
And how much enterprise crap is single-threaded Python crunching a Django template and then piping it to a single-threaded home-grown Bash script?

Lol. Single-threaded, highly-dependent data being passed around, turning into XML into protobufs, back into XML, and then converted into JSON just to pass 50-bytes of text around. Yeaaaahhhhhhhh. Better companies may have good programmers who write more intelligent code. But seriously: most code out there (even "enterprise" code) is absolute unoptimized single-threaded crap.
Despite that these companies race each other in creating platforms that offer as many cores per node as possible because that matters more. Who cares if your single-thread python crap runs 5% slower if you can get 30% more instances running per socket for example.

And whatever gets written in that way is probably not something performance critical anyway.
 
Joined
Mar 6, 2018
Messages
115 (0.05/day)
To be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
At least Willow Cove has its IPC increased form 15%-20%.
 
Joined
Apr 24, 2020
Messages
2,517 (1.75/day)
I think your being harsh on Hbm latency, the chips are significantly closer and the latency wasn't that far out anyway.

HBM, and DDR4 / DDR5 latency, are mainly affected by physics of DRAM. Being forced to precharge the sense Amps, transfer data with RAS, before finally being able to CAS the data over is just a lot of latency. And those steps must be done on any DRAM, be it HBM or DDR4 / DDR5, or GDDR6x. Its a big reason why SRAM-caches exist and why DRAM caches (even on-chip eDRAM caches) remain unpopular.

In fact, most DDR4 / HBM / GDDR6x "latency improvements" aren't actually improving latency at all. They're just hiding latency behind more-and-more parallel requests. See DDR4 bank-groups. They're up to 32 parallel bank groups (and therefore 32-parallel PRE-RAS-CAS sequences) per chip in DDR4 IIRC, and that will only go up.

Whenever we start seeing eDRAM / DRAM as a caching layer (ex: XBox 360, Xeon Phi 7200), we suddenly get an influx of frustrated programmers who have to deal with the reality of the architecture. You're simply not going to get much better than the ~50ns or so of latency on DRAM. Too many steps need to be done per request at the fundamental physics layer.

SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.96/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
Except they make the die bigger and require more power for the same capacity. It's more simple, but it's also more expensive in several regards.
 
Joined
Mar 10, 2010
Messages
11,878 (2.31/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
HBM, and DDR4 / DDR5 latency, are mainly affected by physics of DRAM. Being forced to precharge the sense Amps, transfer data with RAS, before finally being able to CAS the data over is just a lot of latency. And those steps must be done on any DRAM, be it HBM or DDR4 / DDR5, or GDDR6x. Its a big reason why SRAM-caches exist and why DRAM caches (even on-chip eDRAM caches) remain unpopular.

In fact, most DDR4 / HBM / GDDR6x "latency improvements" aren't actually improving latency at all. They're just hiding latency behind more-and-more parallel requests. See DDR4 bank-groups. They're up to 32 parallel bank groups (and therefore 32-parallel PRE-RAS-CAS sequences) per chip in DDR4 IIRC, and that will only go up.

Whenever we start seeing eDRAM / DRAM as a caching layer (ex: XBox 360, Xeon Phi 7200), we suddenly get an influx of frustrated programmers who have to deal with the reality of the architecture. You're simply not going to get much better than the ~50ns or so of latency on DRAM. Too many steps need to be done per request at the fundamental physics layer.

SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
Look , the server industry these chips Are aimed at can afford the effort or they won't buy or one API isn't shit, there's many variables and more yet to be disclosed that could influence the viability of these chips, and I read about all of that way back too , when they were released, I'm aware how the hardware works and is made or I wouldn't be commenting on it.
 
Top