• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel's Upcoming Sapphire Rapids Server Processors to Feature up to 56 Cores with HBM Memory

AleksandarK

Staff member
Joined
Aug 19, 2017
Messages
1,057 (0.77/day)
Intel has just launched its Ice Lake-SP lineup of Xeon Scalable processors, featuring the new Sunny Cove CPU core design. Built on the 10 nm node, these processors represent Intel's first 10 nm shipping product designed for enterprise. However, there is another 10 nm product going to be released for enterprise users. Intel is already preparing the Sapphire Rapids generation of Xeon processors and today we get to see more details about it. Thanks to the anonymous tip that VideoCardz received, we have a bit more details like core count, memory configurations, and connectivity options. And Sapphire Rapids is shaping up to be a very competitive platform. Do note that the slide is a bit older, however, it contains useful information.

The lineup will top at 56 cores with 112 threads, where this processor will carry a TDP of 350 Watts, notably higher than its predecessors. Perhaps one of the most interesting notes from the slide is the department of memory. The new platform will make a debut of DDR5 standard and bring higher capacities with higher speeds. Along with the new protocol, the chiplet design of Sapphire Rapids will bring HBM2E memory to CPUs, with up to 64 GBs of it per socket/processor. The PCIe 5.0 standard will also be present with 80 lanes, accompanying four Intel UPI 2.0 links. Intel is also supposed to extend the x86_64 configuration here with AMX/TMUL extensions for better INT8 and BFloat16 processing.


View at TechPowerUp Main Site
 
Joined
Aug 13, 2010
Messages
4,807 (1.22/day)
There's no doubt that the jump to Sapphire Rapids, 10nmSF, migration to PCIE5 and DDR5, while some chips have HBM integrated into them is a monumental jump in capabilities to server.
Im hoping consumer platforms will get most of it trickled down. The idea of HEDT for Intel died the moment Threadripper first gen came out, pretty much.

Who knows.
 
Joined
Feb 20, 2019
Messages
2,466 (3.03/day)
System Name Flavour of the month. I roll through hardware like it's not even mine (it often isn't).
Processor 3900X, 5800X, 2700U
Motherboard Aorus X570 Elite, B550 DS3H
Cooling Alphacool CPU+GPU soft-tubing loop (Laing D5 360mm+140mm), AMD Wraith Prism
Memory 32GB Patriot 3600CL17, 32GB Corsair LPX 3200CL16, 16GB HyperX 2400CL14
Video Card(s) 2070S, 5700XT, Vega10
Storage 1TB WD S100G, 2TB Adata SX8200 Pro, 1TB MX500, 500GB Hynix 2242 bastard thing, 16TB of rust + backup
Display(s) Dell SG3220 165Hz VA, Samsung 65" Q9FN 120Hz VA
Case NZXT H440NE, Silverstone GD04 (almost nothing original left inside, thanks 3D printer!)
Audio Device(s) CA DacMagic+ with Presonus Eris E5, Yamaha RX-V683 with Q Acoustics 3000-series, Sony MDR-1A
Power Supply BeQuiet StraightPower E9 680W, Corsair RM550, and a 45W Lenovo DC power brick, I guess.
Mouse G303, MX Anywhere 2, Another MX Anywhere 2.
Keyboard CM QuickFire Stealth (Cherry MX Brown), Logitech MX Keys (not Cherry MX at all)
Software W10
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
 
Joined
Dec 24, 2020
Messages
1,020 (7.23/day)
Processor AMD Ryzen 9 3900X
Motherboard ASUS ROG Strix B550-F
Cooling Noctua NH-U12A @ 1000 RPM
Memory 4x 8 GB G.Skill Trident Z Neo 3600 16-16-16-36
Video Card(s) MSI RTX 3070 Gaming X Trio
Storage 1x Samsung 980 PRO 500 GB | 1x SanDisk X400 512 GB | 2x Crucial MX500 1 TB
Display(s) 1440p 144Hz ASUS TUF VG27AQ | 1080p 72Hz LG 22MP68VQ-P
Case be quiet! Pure Base 500DX Black | 3x Silent Wings 3 140mm PWM High-Speed @ 900 RPM
Audio Device(s) AAF DCH Optimus Sound - Legacy
Power Supply Seasonic Prime PX-750 80+ Platinum Fully Modular
Mouse ASUS ROG Chakram
Keyboard ASUS ROG Strix Flare Cherry MX Red RGB
Software Windows 10 Pro 20H2
Judging by how Intel measures TDP I'm already extremely worried of that 350W.

You can do it, you can reach 64 cores, someday, preferrably at less power consumption too.
 
Joined
Mar 28, 2020
Messages
751 (1.82/day)
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
To be honest, I always feel that Intel may have made considerable compromises to get 10nm out the door. So the 10nm we are seeing in the market now is probably worst than what was originally planned. Looking at the SuperFin used for Tiger Lake, on the surface it seems to have improved allowing for higher clock speed, but it also sounded like its going down the path of 14nm, i.e. feed it more power to gain higher clock speed. So with Sapphire Rapids coming in with a rumoured TDP of 350W, it likely shows that SuperFin is really not that super. AMD aside, I feel ARM will be the more potent competitor for Intel when it comes to data center CPUs.

Judging by how Intel measures TDP I'm already extremely worried of that 350W.
This is very true. May be that's why they need to immerse into liquid to cool now, instead of using traditional heatsink fans.
 
Joined
May 4, 2009
Messages
1,952 (0.44/day)
Location
Bulgaria
System Name penguin
Processor R5 2400G
Motherboard Asrock B450M Pro4
Cooling Box
Memory 4 x Kingston HyperX Fury 2666MHz
Video Card(s) IGP
Storage ADATA SU800 512GB
Display(s) 27' LG
Case Zalman
Audio Device(s) stock
Power Supply Seasonic SS-620GM
Software win10
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.
It's that high because these are 99% two dies glued toghether with a DMI link/foveros. It's just unreasonable to put 40 fat x86 cores on a single die. So Intel's only option is to make smaller dies and link them. Since they don't have an energy efficient interconnect, you end up needing a ton of power..
 
Joined
Feb 20, 2019
Messages
2,466 (3.03/day)
System Name Flavour of the month. I roll through hardware like it's not even mine (it often isn't).
Processor 3900X, 5800X, 2700U
Motherboard Aorus X570 Elite, B550 DS3H
Cooling Alphacool CPU+GPU soft-tubing loop (Laing D5 360mm+140mm), AMD Wraith Prism
Memory 32GB Patriot 3600CL17, 32GB Corsair LPX 3200CL16, 16GB HyperX 2400CL14
Video Card(s) 2070S, 5700XT, Vega10
Storage 1TB WD S100G, 2TB Adata SX8200 Pro, 1TB MX500, 500GB Hynix 2242 bastard thing, 16TB of rust + backup
Display(s) Dell SG3220 165Hz VA, Samsung 65" Q9FN 120Hz VA
Case NZXT H440NE, Silverstone GD04 (almost nothing original left inside, thanks 3D printer!)
Audio Device(s) CA DacMagic+ with Presonus Eris E5, Yamaha RX-V683 with Q Acoustics 3000-series, Sony MDR-1A
Power Supply BeQuiet StraightPower E9 680W, Corsair RM550, and a 45W Lenovo DC power brick, I guess.
Mouse G303, MX Anywhere 2, Another MX Anywhere 2.
Keyboard CM QuickFire Stealth (Cherry MX Brown), Logitech MX Keys (not Cherry MX at all)
Software W10
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
@watzupken

I just spotted that Anandtech have reviewed 3rd Gen Xeon and whilst it's not a turd, yeah it's not looking good either. To say it's better than the dumpster fire of previous-gen Xeons is a veiled insult. Maybe they're approaching AMD's older Rome (Zen2) Epyc lineup in performance and efficiency, so they're now only 2 years behind.

It looks like 10nm is still not good, but it's come far enough that it's now worth using over Skylake 14nm from 2017.

@HalfAHertz
This is Intel's first stab at glue in a long time and they're not very good at it. Reminds me of the power/efficiency scaling of Core2Quad :D
 
Joined
Apr 24, 2020
Messages
926 (2.40/day)
Its my understanding that HBM2 does NOT have enough capacity to be useful to a typical datacenter workload.

I'm guessing that these are going to be intended for supercomputers, as a competitor to the A64Fx, and not as a typical datacenter CPU.
 
Joined
Sep 1, 2020
Messages
414 (1.62/day)
Location
Bulgaria
Hmm, what is this: Intel Optane 300 series Crow pass up to 2.6X random access?
 
Joined
Jan 8, 2017
Messages
6,656 (4.19/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
Its my understanding that HBM2 does NOT have enough capacity to be useful to a typical datacenter workload.
It could just be used as another level of cache.
 
Joined
Jun 29, 2018
Messages
135 (0.13/day)
This is Intel's first stab at glue in a long time and they're not very good at it. Reminds me of the power/efficiency scaling of Core2Quad :D
The 9000 series of Xeon Scalable (Cascade Lake) is using glue as well and is fairly recent:


The problem with those processors was that they only came from one source - Intel-built servers. The highest model having 56 cores required liquid cooling since it had a TDP of 400W:
 
Joined
Apr 24, 2020
Messages
926 (2.40/day)
It could just be used as another level of cache.

Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


1617989893575.png


Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
 

AsRock

TPU addict
Joined
Jun 23, 2007
Messages
17,098 (3.37/day)
Location
UK\US
Processor 2500k \ AMD 3900X+NH-D15
Motherboard ASRock Z68 \ ASRock AM4 X570 Pro 4
Memory Samsung low profile 2x8GB \ Patriot 2x16GB PVS432G320C6K
Video Card(s) eVga GTX1060 SSC \ XFX R9 390X
Storage 2xIntel 80Gb (SATA2) Crucial MX500 \ Samsung 860 1TB +Samsung Evo 250GB+500GB Sabrent 1TB Rocket
Display(s) Samsung 1080P \ LG 43UN700
Case HTPC400 \ Thermaltake Armor case ( VE2000BWS ), With Zalman fan controller ( wattage usage ).
Audio Device(s) Yamaha RX-A820 \ Yamaha CX-830+Yamaha MX-630 Infinity RS4000\Paradigm P Studio 20, Blue Yeti
Power Supply PC&Power 750w \ Seasonic 750w MKII
Mouse Steelseries Sensei wireless \ Steelseries Sensei wireless
Keyboard Logitech K120 \ ROCCAT MK Pro ( modded amber leds )
Benchmark Scores Meh benchmarks.
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.

+++
 
Joined
Jan 8, 2017
Messages
6,656 (4.19/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


View attachment 195983

Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.

AMD did figure out a solution with HBCC without having to resort to some kind of split memory arrangement and it was completely transparent to the application. It didn't really make sense back then on a GPU running graphics workloads simply because it wasn't really required but on a CPU when processing a data set of several dozens or hundreds of GBs it might. Caches are never guaranteed to provide a definitive improvement.

The thing is, if whatever it is that you wrote is highly multithreaded and data independent latency typically doesn't matter.
 
Last edited:
Joined
Apr 24, 2020
Messages
926 (2.40/day)
The thing is, if whatever it is that you wrote is highly multithreaded and data independent latency typically doesn't matter.

And how much enterprise crap is single-threaded Python crunching a Django template and then piping it to a single-threaded home-grown Bash script?

Lol. Single-threaded, highly-dependent data being passed around, turning into XML into protobufs, back into XML, and then converted into JSON just to pass 50-bytes of text around. Yeaaaahhhhhhhh. Better companies may have good programmers who write more intelligent code. But seriously: most code out there (even "enterprise" code) is absolute unoptimized single-threaded crap.
 
Joined
Nov 4, 2005
Messages
10,766 (1.90/day)
System Name MoFo 2
Processor AMD PhenomII 1100T @ 4.2Ghz
Motherboard Asus Crosshair IV
Cooling Swiftec 655 pump, Apogee GT,, MCR360mm Rad, 1/2 loop.
Memory 8GB DDR3-2133 @ 1900 8.9.9.24 1T
Video Card(s) HD7970 1250/1750
Storage Agility 3 SSD 6TB RAID 0 on RAID Card
Display(s) 46" 1080P Toshiba LCD
Case Rosewill R6A34-BK modded (thanks to MKmods)
Audio Device(s) ATI HDMI
Power Supply 750W PC Power & Cooling modded (thanks to MKmods)
Software A lot.
Benchmark Scores Its fast. Enough.
Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


View attachment 195983

Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
Uhhhh whut

every console built and optimized program built is coded and tuned for the performance of the hardware being used, an additional level of cache that is that close and finely tuned will boost performance compared to system memory and can be further tuned when the specifics are known.

this is why the consoles can have actual higher performance than X86 running generic code to fit all configuration.

even if it nets a 10 percent performance boost Intel needs it to stay competitive.
 
Joined
Jun 24, 2015
Messages
2,753 (1.28/day)
Location
Western Canada
System Name Austere Box R1.4
Processor R9 5900X
Motherboard B550M TUF Wifi (2006)
Cooling NH-C14S iPPC
Memory 32GB 3600 16-19-19
Video Card(s) RTX 2060 Super FE (0.981V)
Storage 3TB SX8200/SN750/Blue3D
Case Sliger Cerberus
Power Supply Seasonic SGX-650
350W.

10nm still looks broken to me, although there's no mention of clockspeeds at that power draw, so maybe they've ironed out enough kinks to be competitive.

These are datacenter parts. Performance per Watt is king there and AMD already has Zen3 Epyc with 64C/128 and 2P support at 225W. Embarassingly cheap compared to Xeon Platinums, too.

To be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
 
Joined
Feb 20, 2019
Messages
2,466 (3.03/day)
System Name Flavour of the month. I roll through hardware like it's not even mine (it often isn't).
Processor 3900X, 5800X, 2700U
Motherboard Aorus X570 Elite, B550 DS3H
Cooling Alphacool CPU+GPU soft-tubing loop (Laing D5 360mm+140mm), AMD Wraith Prism
Memory 32GB Patriot 3600CL17, 32GB Corsair LPX 3200CL16, 16GB HyperX 2400CL14
Video Card(s) 2070S, 5700XT, Vega10
Storage 1TB WD S100G, 2TB Adata SX8200 Pro, 1TB MX500, 500GB Hynix 2242 bastard thing, 16TB of rust + backup
Display(s) Dell SG3220 165Hz VA, Samsung 65" Q9FN 120Hz VA
Case NZXT H440NE, Silverstone GD04 (almost nothing original left inside, thanks 3D printer!)
Audio Device(s) CA DacMagic+ with Presonus Eris E5, Yamaha RX-V683 with Q Acoustics 3000-series, Sony MDR-1A
Power Supply BeQuiet StraightPower E9 680W, Corsair RM550, and a 45W Lenovo DC power brick, I guess.
Mouse G303, MX Anywhere 2, Another MX Anywhere 2.
Keyboard CM QuickFire Stealth (Cherry MX Brown), Logitech MX Keys (not Cherry MX at all)
Software W10
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
To be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
That's true, I guess. There are (expensive) niche-use Xeons/Epyc that are built for low core counts, high clockspeed, and full cache - but yeah Servers generally run at 2.x GHz.

I was foolishly thinking this might be a sign that 10nm desktop parts are on the way in 2021 but actually, low-clock server parts can be viable whilst the process node is still wholly useless for consumer products at 4GHz+

I actually wish Intel would go back to making fanless CPUs - the sub 5W Core-M range was great for ultraportables and they ran at 0.8-2.0GHz which would be plenty for a general-purpose laptop. My experience with Ice Lake 10nm laptops was that they cooked themselves when boosting but otherwise ran very efficiently. What if Intel made a 4C/8T CPU with a 5W envelope and max clockspeed of, say, 1.6GHz? I'd buy that.
 
Last edited:
Joined
Mar 10, 2010
Messages
8,787 (2.15/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R7 3800X@4.350/525/ Intel 8750H
Motherboard Crosshair hero7 @bios 2703/?
Cooling 360EK extreme rad+ 360$EK slim all push, cpu Monoblock Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in two sticks./16Gb
Video Card(s) Sapphire refference Rx vega 64 EK waterblocked/Rtx 2060
Storage Silicon power qlc nvmex3 in raid 0/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd
Display(s) Samsung UAE28"850R 4k freesync, LG 49" 4K 60hz ,Oculus
Case Lianli p0-11 dynamic
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
Cache normally has lower latency and higher bandwidth.

HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.

-------

One of the Xeon Phi models had HMC (a competitor to HBM) + DDR4. And look how well that was figured out. Its hard as hell to optimize programs on those kinds of split-memory systems. There were all sorts of issues figuring out whether or not something should be in the HMC or on the DDR4.


View attachment 195983

Its not easy. Just give Xeon Phi 7200 documents a look, and then imagine trying to write code against that.
I think your being harsh on Hbm latency, the chips are significantly closer and the latency wasn't that far out anyway.
People are touting Hbm memory near CPU to help ai and ml applications specifically.
I would imagine the plan is to have tiered memory personally, not to use it as a l4 cache but rather a special pool to use as required depending on the application.

Given Intel always has a wide array of SKU's it's relatively easy to imagine that Hbm won't be on every SKU since some applications might not require it.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
12,093 (3.56/day)
Location
Concord, NH
System Name Apollo
Processor Intel Core i9 9880H
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Full Size Wireless Apple Magic Keyboard
Software MacOS 10.15.7
HBM2 is maybe slightly more latency to DDR4 / DDR5 and higher bandwidth. That makes it complicated to reason about. There will be many cases where HBM2 slows down code (any latency-critical code), while speeding up other cases (speeds up bandwidth-critical code). As far as I'm aware, no one has ever figured out a way to predict whether code is latency bound or bandwidth bound ahead of time.
That might not matter if the HBM is used as something like an eviction cache or to hold on to cache state from a context switch. For example, if you can quickly reload cache for a context switch, you could get faster performance than having to wait for the cache to get repopulated from the new task. Or even if you're processing a task that was getting processed on another core. Using HBM to preserve cache context would be a huge benefit for systems that are doing a lot of tasks in parallel, like with application servers.
 
Joined
Jan 8, 2017
Messages
6,656 (4.19/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
And how much enterprise crap is single-threaded Python crunching a Django template and then piping it to a single-threaded home-grown Bash script?

Lol. Single-threaded, highly-dependent data being passed around, turning into XML into protobufs, back into XML, and then converted into JSON just to pass 50-bytes of text around. Yeaaaahhhhhhhh. Better companies may have good programmers who write more intelligent code. But seriously: most code out there (even "enterprise" code) is absolute unoptimized single-threaded crap.
Despite that these companies race each other in creating platforms that offer as many cores per node as possible because that matters more. Who cares if your single-thread python crap runs 5% slower if you can get 30% more instances running per socket for example.

And whatever gets written in that way is probably not something performance critical anyway.
 
Joined
Mar 6, 2018
Messages
60 (0.05/day)
To be fair, I don't think Intel has general efficiency in mind when they say that 10 has "improved". They just mean that it's changed enough to behave much better at higher clocks. 10nm probably won't ever not be "broken" by efficiency standards so long as it's paired with Sunny Cove derivatives.
  • Second-gen 10nm in Sunny Cove clocked poorly at the top end, which placed 4.0GHz generally out of reach.
  • Third-gen 10SF in Willow Cove only improved that frequency scaling at the top end, relevant only really for consumer parts.
  • 10ESF for Willow Cove might improve by the same small margin efficiency-wise as 10SF, just increasing the clock envelope further.
The process hasn't changed much in the 2.0-3.5GHz range, which is where Xeons reside. Neither has the arch changed much. So add more cores (and more chiplets) like Intel is doing here, and power's just gonna go up - no surprises there.
At least Willow Cove has its IPC increased form 15%-20%.
 
Joined
Apr 24, 2020
Messages
926 (2.40/day)
I think your being harsh on Hbm latency, the chips are significantly closer and the latency wasn't that far out anyway.

HBM, and DDR4 / DDR5 latency, are mainly affected by physics of DRAM. Being forced to precharge the sense Amps, transfer data with RAS, before finally being able to CAS the data over is just a lot of latency. And those steps must be done on any DRAM, be it HBM or DDR4 / DDR5, or GDDR6x. Its a big reason why SRAM-caches exist and why DRAM caches (even on-chip eDRAM caches) remain unpopular.

In fact, most DDR4 / HBM / GDDR6x "latency improvements" aren't actually improving latency at all. They're just hiding latency behind more-and-more parallel requests. See DDR4 bank-groups. They're up to 32 parallel bank groups (and therefore 32-parallel PRE-RAS-CAS sequences) per chip in DDR4 IIRC, and that will only go up.

Whenever we start seeing eDRAM / DRAM as a caching layer (ex: XBox 360, Xeon Phi 7200), we suddenly get an influx of frustrated programmers who have to deal with the reality of the architecture. You're simply not going to get much better than the ~50ns or so of latency on DRAM. Too many steps need to be done per request at the fundamental physics layer.

SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
12,093 (3.56/day)
Location
Concord, NH
System Name Apollo
Processor Intel Core i9 9880H
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Full Size Wireless Apple Magic Keyboard
Software MacOS 10.15.7
SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
Except they make the die bigger and require more power for the same capacity. It's more simple, but it's also more expensive in several regards.
 
Joined
Mar 10, 2010
Messages
8,787 (2.15/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R7 3800X@4.350/525/ Intel 8750H
Motherboard Crosshair hero7 @bios 2703/?
Cooling 360EK extreme rad+ 360$EK slim all push, cpu Monoblock Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in two sticks./16Gb
Video Card(s) Sapphire refference Rx vega 64 EK waterblocked/Rtx 2060
Storage Silicon power qlc nvmex3 in raid 0/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd
Display(s) Samsung UAE28"850R 4k freesync, LG 49" 4K 60hz ,Oculus
Case Lianli p0-11 dynamic
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
HBM, and DDR4 / DDR5 latency, are mainly affected by physics of DRAM. Being forced to precharge the sense Amps, transfer data with RAS, before finally being able to CAS the data over is just a lot of latency. And those steps must be done on any DRAM, be it HBM or DDR4 / DDR5, or GDDR6x. Its a big reason why SRAM-caches exist and why DRAM caches (even on-chip eDRAM caches) remain unpopular.

In fact, most DDR4 / HBM / GDDR6x "latency improvements" aren't actually improving latency at all. They're just hiding latency behind more-and-more parallel requests. See DDR4 bank-groups. They're up to 32 parallel bank groups (and therefore 32-parallel PRE-RAS-CAS sequences) per chip in DDR4 IIRC, and that will only go up.

Whenever we start seeing eDRAM / DRAM as a caching layer (ex: XBox 360, Xeon Phi 7200), we suddenly get an influx of frustrated programmers who have to deal with the reality of the architecture. You're simply not going to get much better than the ~50ns or so of latency on DRAM. Too many steps need to be done per request at the fundamental physics layer.

SRAM (typical L3 / L2 / L1 caches) just doesn't have to deal with any of those steps at all. So adding those caches and/or making them bigger is simple and obvious.
Look , the server industry these chips Are aimed at can afford the effort or they won't buy or one API isn't shit, there's many variables and more yet to be disclosed that could influence the viability of these chips, and I read about all of that way back too , when they were released, I'm aware how the hardware works and is made or I wouldn't be commenting on it.
 
Top