Thursday, December 31st 2020

Intel Confirms HBM is Supported on Sapphire Rapids Xeons

Intel has just released its "Architecture Instruction Set Extensions and Future Features Programming Reference" manual, which serves the purpose of providing the developers' information about Intel's upcoming hardware additions which developers can utilize later on. Today, thanks to the @InstLatX64 on Twitter we have information that Intel is bringing on-package High Bandwidth Memory (HBM) solution to its next-generation Sapphire Rapids Xeon processors. Specifically, there are two instructions mentioned: 0220H - HBM command/address parity error and 0221H - HBM data parity error. Both instructions are there to address data errors in HBM so the CPU operates with correct data.

The addition of HBM is just one of the many new technologies Sapphire Rapids brings. The platform is supposedly going to bring many new technologies like an eight-channel DDR5 memory controller enriched with Intel's Data Streaming Accelerator (DSA). To connect to all of the external accelerators, the platform uses PCIe 5.0 protocol paired with CXL 1.1 standard to enable cache coherency in the system. And as a reminder, this would not be the first time we see a server CPU use HBM. Fujitsu has developed an A64FX processor with 48 cores and HBM memory, and it is powering today's most powerful supercomputer - Fugaku. That is showing how much can a processor get improved by adding a faster memory on-board. We are waiting to see how Intel manages to play it out and what we end up seeing on the market when Sapphire Rapids is delivered.
Source: @InstLatX64 on Twitter
Add your own comment

14 Comments on Intel Confirms HBM is Supported on Sapphire Rapids Xeons

#1
londiste
I get that HBM has very nice upsides but I am very afraid that soon CPUs will come with embedded memory so no upgrades and premium can be charged for more memory.
Looks like HBM might be planned to be used as sort of L4 cache but still.
Posted on Reply
#2
ZoneDymo
londisteI get that HBM has very nice upsides but I am very afraid that soon CPUs will come with embedded memory so no upgrades and premium can be charged for more memory.
Looks like HBM might be planned to be used as sort of L4 cache but still.
Personally I wonder if the HBM in that case cant be used as an extra in between step.

AMD's new gpu's have that infinity Cache which is basically super fast memory and then have standard GDDR6 next to that.
So why not have this HBM be the Intel version of that and still have memory next to it, just an extra step just how ram is an inbetween for CPU and Storage.

Heck maybe AMD in the future will have Infinity Cache > HBM > GDDR6
Posted on Reply
#3
londiste
Infinity Cache is just marketing - it is basically just L3 cache with a cool-sounding name.
But you are right though, like I said, I would expect HBM to be become L4 cache for now.

Another layer of memory might bring some other possibilities to the table as well. XPoint DIMMs are what stands out as something that would gain from this.
Posted on Reply
#4
evernessince
ZoneDymoPersonally I wonder if the HBM in that case cant be used as an extra in between step.

AMD's new gpu's have that infinity Cache which is basically super fast memory and then have standard GDDR6 next to that.
So why not have this HBM be the Intel version of that and still have memory next to it, just an extra step just how ram is an inbetween for CPU and Storage.

Heck maybe AMD in the future will have Infinity Cache > HBM > GDDR6
Infinity Cache = L3 Cache, it's been around on CPUs for awhile. AMD just added it to their GPUs to reduce memory bandwidth requirements.

When you are adding HBM to a CPU via an interposer you are talking about considerably increasing the cost of manufacture and time to produce. AMD learned this with Vega.

Aside from a few professional scenarios, I don't really see how regular consumers would benefit from having both HBM and DDR. If you are having such a problem with cache misses (which neither AMD nor Intel do) that you need another layer of storage between the L4 and main memory, you'd be much wiser to improve the amount of cache your CPU has or tweak what it decides to store in cache. Cache is still vastly faster and of lower latency than HBM. HBM has much more bandwidth than DDR4 but consumer systems don't really need more bandwidth right now. Heck we are still using dual channel memory and you'd be hard pressed to find a game that actually benefits from quad channel.

The thing with AMD's L3 infinity cache is that it fixes a downside of their choice of memory. It doesn't go searching for a solution to a problem that doesn't exist.
Posted on Reply
#5
AnarchoPrimitiv
evernessinceInfinity Cache = L3 Cache, it's been around on CPUs for awhile. AMD just added it to their GPUs to reduce memory bandwidth requirements.

When you are adding HBM to a CPU via an interposer you are talking about considerably increasing the cost of manufacture and time to produce. AMD learned this with Vega.

Aside from a few professional scenarios, I don't really see how regular consumers would benefit from having both HBM and DDR. If you are having such a problem with cache misses (which neither AMD nor Intel do) that you need another layer of storage between the L4 and main memory, you'd be much wiser to improve the amount of cache your CPU has or tweak what it decides to store in cache. Cache is still vastly faster and of lower latency than HBM. HBM has much more bandwidth than DDR4 but consumer systems don't really need more bandwidth right now. Heck we are still using dual channel memory and you'd be hard pressed to find a game that actually benefits from quad channel.

The thing with AMD's L3 infinity cache is that it fixes a downside of their choice of memory. It doesn't go searching for a solution to a problem that doesn't exist.
HBM integrated into a powerful APU would be helpful, but market forces keep that from happening as people would rather upgrade memory and cpus/apus separately
Posted on Reply
#6
Caring1
Just because it can be supported doesn't mean it will be used, especially if Intel outsources chip production to partners making it optional.
Posted on Reply
#8
Vya Domus
londisteI get that HBM has very nice upsides but I am very afraid that soon CPUs will come with embedded memory so no upgrades and premium can be charged for more memory.
Looks like HBM might be planned to be used as sort of L4 cache but still.
It's the only wat to get around the ever increasing gap between DRAM bandwidth and CPU throughput.
Posted on Reply
#9
Tech00
Vya DomusIt's the only wat to get around the ever increasing gap between DRAM bandwidth and CPU throughput.
Correct! Already Skylake/Cascadelake gen of server CPUs from intel are not bottle necked by the CPU's processing capability but by the memory subsytem. Their memory system cannot keep up with the CPU and quickly becomes the bottleneck. The only way to significantly improve is to make the susbystem faster and lower latency and a level 4 tier will help do some of that (DDR5 on its own is still not fast enough but will also help when combined with L HBM4).
In other words: Level 4 cache HBM is the logical next step to feed the beast.
This will be expensive though so I think this is going to be high end Xeon server and workstation only. Unless intel somehow figured out some new smart, cost effective way to implement this... I can't see it though...
Posted on Reply
#10
TechLurker
Frankly, I'm surprised AMD wasn't the first to push out the embedded HBM concept with higher-end APUs, considering they've been trying to push HBM off and on. HBM would have been perfect for high-end APUs and help fill in that memory bottleneck on Vega and RDNA-based APUs. For that matter, I wonder if future AMD mobos might 2GB or 4GB HBM3 embedded on the mobo chipset or even directly into the main lanes that connect the CPU directly to the GPU and 1st NVMe drive, serving as a sort of supplementary "Infinity cache" to either/both the CPU and GPU as well as the NVMe. 2GB on a B-50 series and 4GB on an X-70 series can provide some benefit to iGPUs as well as dedicated GPUs, and also serving as extra cache for the CPU side should it need it more for certain tasks.
Posted on Reply
#11
evernessince
AnarchoPrimitivHBM integrated into a powerful APU would be helpful, but market forces keep that from happening as people would rather upgrade memory and cpus/apus separately
If you did create an APU with HBM, it would be priced to the point where I'd make more sense just to add a dGPU. Might be useful for professional applications but again, you'd need a significant amount of expensive HBM for those markets.
R0H1TThat's not a necessity especially when 3d stacking is already here ~
www.anandtech.com/show/16051/3dfabric-the-home-for-tsmc-2-5d-and-3d-stacking-roadmap
You are replacing one expensive process with another. There are a lot of other concerns with vertical stacking as well that have to be taken into consideration during design. First, the order of the stack is important. The CPU die essentially has to be the bottom of the stack (assuming you still have IO and CPU cores together) as the die on the bottom will have the lowest latencies. The more the stacks, the higher the latency penalty of dies at the top. At some point you'd certainly need to design an active interconnect so that the stacks can communicate efficiently as well. If you just have dumb wires run, routing between the stacks is going to be suboptimal. The university of Toranto did a paper on the use of a active interposer for routing data between chiplets (same idea, only horizontal) and found that the more dies that are used, the greater the impact the use of an active interposer would have. In principle, multi-chiplet designs (through either 3D stacking or otherwise) stand to benefit massively as they increase in complexity. By benefit I mean erase the latency penalty and under ideal conditions, beat out monolithic designs. Second, there's heat. If you put HBM on top you are then putting a barrier to heat transfer for your CPU die. Heat would first have to transfer through the HBM to get to the IHS. Aside for the potential of degradation of the HBM (which may be mitigable) you'd likely have to make performance compromises due to the thermal restrictions. If you are looking for maximum performance, going horizontal is far better. IMO 3D stacking is best used in conjuction with a horizontal interposer. You can split off low power components like HBM and IO and keep high power parts like CPU dies unstacked, all while retaining maximum performance and thermals without having a monstrous CPU size. Last, any product has to be designed from the ground up for vertical stacking. Traces have to be made to properly connect the stacks and enable communication. There are likely many design considerations that need to be done.
Posted on Reply
#12
dragontamer5788
ZoneDymoPersonally I wonder if the HBM in that case cant be used as an extra in between step.

AMD's new gpu's have that infinity Cache which is basically super fast memory and then have standard GDDR6 next to that.
So why not have this HBM be the Intel version of that and still have memory next to it, just an extra step just how ram is an inbetween for CPU and Storage.

Heck maybe AMD in the future will have Infinity Cache > HBM > GDDR6
Of course it "can" be used as an extra step, but I doubt it.

From a latency perspective, HBM is the same latency as any other DRAM (including DDR4), so you may win in bandwidth, but without a latency win... there's a huge chance you're just slowing things down. Xeon Phi had a HMC + DDR4 version (HMC was a stacked-ram competitor to HBM), and that kind of architecture is really hard and non-obvious to optimize for. Latency-sensitive code would be better run off of DDR4 (which is cheaper, and therefore physically larger). Bandwidth-sensitive code would prefer HBM.

As a programmer: its very non-obvious if you'll be latency-sensitive or bandwidth-sensitive. As a system engineer, who combined multiple code together, it is further non-obvious... so configuring such a system is just too complicated in the real world.

----------

HBM-only would probably be the way to go. Unless someone figures out how to solve this complexity issue (or better predict latency vs bandwidth sensitive code).
Posted on Reply
#13
londiste
The very wide bus of HBM should allow high bandwidth without the latency losses incurred by data doubling and prefetching that affects latency on DDR4/5?
Posted on Reply
#14
dragontamer5788
londisteThe very wide bus of HBM should allow high bandwidth without the latency losses incurred by data doubling and prefetching that affects latency on DDR4/5?
The reason prefetching / etc. etc. exists, is because most of the latency is from the DRAM cell itself. It doesn't matter if you're using HBM, DDR4, DDR5, or GDDR6x, they are all using DRAM cells with significant amounts of latency.

If you do DDR4 -> HBM -> Cache, it means you're now incurring two DRAM latencies per read/write, instead of one. A more reasonable architecture is DDR4 -> Cache + HBM->Cache, splitting the two up. However, that architecture is very difficult to program. As such, the most reasonable in practice is HBM->Cache (and avoiding the use of DDR4 / DDR5).

Unless Intel wants another Xeon Phi I guess...
Posted on Reply
Add your own comment
Copyright © 2004-2021 www.techpowerup.com. All rights reserved.
All trademarks used are properties of their respective owners.