Apple Patents Multi-Level Hybrid Memory Subsystem

Ravenas · Jan 29, 2021

That lifestyle company can't be besting Intel...

lexluthermiester · Jan 29, 2021

AleksandarK said:
Update 21:14 UTC: We have been reached out by Mr. Kerry Creeron, an attorney with the firm of Banner & Witcoff, who provided us with additional insights about the patent. Mr. Creeron has provided us with his personal commentary about it, and you can find Mr. Creeron's quote below.

I suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.

InVasMani · Jan 29, 2021

Just put a 3D stacked DRAM chip under neath the CPU socket in the center tired to the socket and CPU directly.

lexluthermiester said:
I suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.

I agree this type of patent is bad for a level playing field of open competition. We've seen how this works with RAMBUS already it doesn't benefit consumers.

Aquinus · Jan 29, 2021

InVasMani said:
Just put a 3D stacked DRAM chip under neath the CPU socket in the center tired to the socket and CPU directly.

DRAM can't be put on the interposer for the CPU if it's under the CPU. It would have to be mounted to the PCB under the interposer and would make for a very complicated PCB design. I wouldn't envy that engineer's task. It's also not that much closer to the CPU compared to putting it next to it like with M1's system memory. There are a lot of cons and not a lot of benefits.

InVasMani · Jan 29, 2021

That's probably true and valid, but things are shrinking it'll get easier to place it there in due time. Also it's not to replace system memory more to provide a quicker buffer between it. I was speaking about wiring it under the motherboard socket as opposed to the underneath the middle of the CPU's PCB. A PCIE wired microSD card slot would be neat there as well. Consider this 3D stacked and 2TB of storage with PCIE 4.0 x16 slot wiring to it. If they could pull that off it would be rather amazing.

RJARRRPCGP · Jan 29, 2021

lexluthermiester said:
I suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.

So, Apple is being accused of having another baseless patent, likened to round corners?

pjl321 · Jan 29, 2021

I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!

Aquinus · Jan 30, 2021

pjl321 said:
I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!

The complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.

Wirko · Jan 30, 2021

pjl321 said:
I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!

You have just described the 2024 Mac Pro. Starting at $15,000. If anyone develops HBM4 with much reduced latency/faster random access by then, that is.

Aquinus said:
The complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.

Yeah, you explained it nicely.
On the other hand, Apple could also make "traditional" DRAM wider. The M1 apparently has all the DRAM in two packages and a more powerful processor could have four or eight close to the processor die.

Aquinus · Jan 30, 2021

Wirko said:
You have just described the 2024 Mac Pro. Starting at $15,000. If anyone develops HBM4 with much reduced latency/faster random access by then, that is.

Yeah, you explained it nicely.
On the other hand, Apple could also make "traditional" DRAM wider. The M1 apparently has all the DRAM in two packages and a more powerful processor could have four or eight close to the processor die.

HBM is efficient because the transistors aren't switched as fast. You lose that advantage if you try to drive it as fast as traditional DRAM. I don't think people really realize how much more power that higher switching frequencies require.

bug · Jan 30, 2021

Aquinus said:
The complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.

Exactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.

Wirko · Jan 30, 2021

bug said:
Exactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.

Apple's "Cache DRAM" can only solve that problem if it's a special low-latency type of dynamic RAM. I don't know if anything like that is available, however, Intel's Crystal Well apparently was such a chip, with a latency of ~30 ns in addition to great bandwidth (measured by Anand).

Aquinus · Jan 31, 2021

bug said:
Exactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.

Well, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem. It's not great on latency, but that's a problem that can be solved, or at the very least, mitigated.

Let me put it another way. HBM2 is the reason why my MacBook Pro is silent with two 5k displays plugged into it. All the other GDDR models would have the fan whirring away due to memory being clocked up to drive them. That's heat and power that you can't afford on a mobile device. It's also how they could cram 40 CUs onto the Radeon Pro 5600m and stay within the 50w power envelope, all while having almost 400GB/s of max theoretical bandwidth. You can't tell me that's not an advantage.

Wirko · Jan 31, 2021

Aquinus said:
Well, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem.

And then Apple solves the HBM cost problem by putting the retail price somewhere in the geosynchronous orbit. Violà, problems gone.

Aquinus · Jan 31, 2021

Wirko said:
And then Apple solves the HBM cost problem by putting the retail price somewhere in the geosynchronous orbit. Violà, problems gone.

Truth. I sold a kidney to afford my MacBook Pro. :laugh:

bug · Feb 1, 2021

Aquinus said:
Well, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem.

Ok, that's more accurate than what I said.

Aquinus said:
It's not great on latency, but that's a problem that can be solved, or at the very least, mitigated.

I'm not so sure HBM's latency can be as low as required for usage in a cache system.
Latency doesn't seem to move like at all. At least that's what happened for DDR.

Aquinus · Feb 1, 2021

bug said:
I'm not so sure HBM's latency can be as low as required for usage in a cache system.
Latency doesn't seem to move like at all. At least that's what happened for DDR.

Well, HBM does have a latency penalty, but it makes up for that through its ability to burst a lot of data and because it's split into several channels, you can actually queue up a lot of memory requests and get stuff back rapidfire. So while there is overhead involved, it might not actually be that bad depending on how much data you need to pull at once. Think about it, AMD beefed up the size of its last level of cache with the latest Zen chips. Why, would they do that? The answer is simple, an off die I/O chiplet introduces latency and you need a way to buffer that latency. Depending on the caching strategy, that last level of cache might get a ton of hits and the more hits you get, the more insulated you are from the latency cost.

You also have to consider what Apple is doing. This level in the memory hierarchy has to also be able to support a GPU and AI circuitry as well. HBM is definitely well suited towards those sort of tasks, so all in all, it's probably a wash when it comes to latency. The real advantage comes from the memory bandwidth with relatively low power consumption and a high memory density.

System Name	AM5
Processor	AMD Ryzen R9 7950X
Motherboard	Asrock X670E Taichi
Cooling	EK AIO Basic 360
Memory	Corsair Vengeance DDR5 5600 64 Gb - XMP1 Profile
Video Card(s)	AMD Reference 7900 XTX 24 Gb
Storage	Crucial Gen 5 1 TB, Samsung Gen 4 980 1 TB / Samsung 8TB SSD
Display(s)	Samsung 34" 240hz 4K
Case	Fractal Define R7
Power Supply	Seasonic PRIME PX-1300, 1300W 80+ Platinum, Full Modular

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.3.1

System Name	KHR-1
Processor	Ryzen 9 5900X
Motherboard	ASRock B550 PG Velocita (UEFI-BIOS P3.40)
Memory	64 GB G.Skill RipJaws V F4-3200C16D-64GVK
Video Card(s)	Sparkle Titan Arc A770 16 GB
Storage	Western Digital Black SN850 1 TB NVMe SSD
Display(s)	Alienware AW3423DWF OLED-ASRock PG27Q15R2A (backup)
Case	Corsair 275R
Audio Device(s)	Technics SA-EX140 receiver with Polk VT60 speakers
Power Supply	eVGA Supernova G3 750W
Mouse	Logitech G Pro (Hero)
Software	Windows 11 Pro x64 24H2

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.3.1

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

Apple Patents Multi-Level Hybrid Memory Subsystem

Ravenas

lexluthermiester

InVasMani

Aquinus

Resident Wat-man

InVasMani

RJARRRPCGP

pjl321

Aquinus

Resident Wat-man

Wirko

Aquinus

Resident Wat-man

bug

Wirko

Aquinus

Resident Wat-man

Wirko

Aquinus

Resident Wat-man

bug

Aquinus

Resident Wat-man

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10