• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Apple Patents Multi-Level Hybrid Memory Subsystem

That lifestyle company can't be besting Intel... ;)
 
Update 21:14 UTC: We have been reached out by Mr. Kerry Creeron, an attorney with the firm of Banner & Witcoff, who provided us with additional insights about the patent. Mr. Creeron has provided us with his personal commentary about it, and you can find Mr. Creeron's quote below.
I suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.
 
Just put a 3D stacked DRAM chip under neath the CPU socket in the center tired to the socket and CPU directly.

I suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.

I agree this type of patent is bad for a level playing field of open competition. We've seen how this works with RAMBUS already it doesn't benefit consumers.
 
Just put a 3D stacked DRAM chip under neath the CPU socket in the center tired to the socket and CPU directly.
DRAM can't be put on the interposer for the CPU if it's under the CPU. It would have to be mounted to the PCB under the interposer and would make for a very complicated PCB design. I wouldn't envy that engineer's task. It's also not that much closer to the CPU compared to putting it next to it like with M1's system memory. There are a lot of cons and not a lot of benefits.
 
That's probably true and valid, but things are shrinking it'll get easier to place it there in due time. Also it's not to replace system memory more to provide a quicker buffer between it. I was speaking about wiring it under the motherboard socket as opposed to the underneath the middle of the CPU's PCB. A PCIE wired microSD card slot would be neat there as well. Consider this 3D stacked and 2TB of storage with PCIE 4.0 x16 slot wiring to it. If they could pull that off it would be rather amazing.
 
I suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.
So, Apple is being accused of having another baseless patent, likened to round corners?
 
I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!
 
I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!
The complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.
 
I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!
You have just described the 2024 Mac Pro. Starting at $15,000. If anyone develops HBM4 with much reduced latency/faster random access by then, that is.

The complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.
Yeah, you explained it nicely.
On the other hand, Apple could also make "traditional" DRAM wider. The M1 apparently has all the DRAM in two packages and a more powerful processor could have four or eight close to the processor die.
 
You have just described the 2024 Mac Pro. Starting at $15,000. If anyone develops HBM4 with much reduced latency/faster random access by then, that is.


Yeah, you explained it nicely.
On the other hand, Apple could also make "traditional" DRAM wider. The M1 apparently has all the DRAM in two packages and a more powerful processor could have four or eight close to the processor die.
HBM is efficient because the transistors aren't switched as fast. You lose that advantage if you try to drive it as fast as traditional DRAM. I don't think people really realize how much more power that higher switching frequencies require.
 
The complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.
Exactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.
 
Exactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.
Apple's "Cache DRAM" can only solve that problem if it's a special low-latency type of dynamic RAM. I don't know if anything like that is available, however, Intel's Crystal Well apparently was such a chip, with a latency of ~30 ns in addition to great bandwidth (measured by Anand).
 
  • Like
Reactions: bug
Exactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.
Well, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem. It's not great on latency, but that's a problem that can be solved, or at the very least, mitigated.

Let me put it another way. HBM2 is the reason why my MacBook Pro is silent with two 5k displays plugged into it. All the other GDDR models would have the fan whirring away due to memory being clocked up to drive them. That's heat and power that you can't afford on a mobile device. It's also how they could cram 40 CUs onto the Radeon Pro 5600m and stay within the 50w power envelope, all while having almost 400GB/s of max theoretical bandwidth. You can't tell me that's not an advantage.
 
Last edited:
Well, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem.
And then Apple solves the HBM cost problem by putting the retail price somewhere in the geosynchronous orbit. Violà, problems gone.
 
And then Apple solves the HBM cost problem by putting the retail price somewhere in the geosynchronous orbit. Violà, problems gone.
Truth. I sold a kidney to afford my MacBook Pro. :laugh:
 
Well, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem.
Ok, that's more accurate than what I said.
It's not great on latency, but that's a problem that can be solved, or at the very least, mitigated.
I'm not so sure HBM's latency can be as low as required for usage in a cache system.
Latency doesn't seem to move like at all. At least that's what happened for DDR.
 
I'm not so sure HBM's latency can be as low as required for usage in a cache system.
Latency doesn't seem to move like at all. At least that's what happened for DDR.
Well, HBM does have a latency penalty, but it makes up for that through its ability to burst a lot of data and because it's split into several channels, you can actually queue up a lot of memory requests and get stuff back rapidfire. So while there is overhead involved, it might not actually be that bad depending on how much data you need to pull at once. Think about it, AMD beefed up the size of its last level of cache with the latest Zen chips. Why, would they do that? The answer is simple, an off die I/O chiplet introduces latency and you need a way to buffer that latency. Depending on the caching strategy, that last level of cache might get a ton of hits and the more hits you get, the more insulated you are from the latency cost.

You also have to consider what Apple is doing. This level in the memory hierarchy has to also be able to support a GPU and AI circuitry as well. HBM is definitely well suited towards those sort of tasks, so all in all, it's probably a wash when it comes to latency. The real advantage comes from the memory bandwidth with relatively low power consumption and a high memory density.
 
Back
Top