Apple Patents Multi-Level Hybrid Memory Subsystem

AleksandarK · Jan 27, 2021

Apple has today patented a new approach to how it uses memory in the System-on-Chip (SoC) subsystem. With the announcement of the M1 processor, Apple has switched away from the traditional Intel-supplied chips and transitioned into a fully custom SoC design called Apple Silicon. The new designs have to integrate every component like the Arm CPU and a custom GPU. Both of these processors need good memory access, and Apple has figured out a solution to the problem of having both the CPU and the GPU accessing the same pool of memory. The so-called UMA (unified memory access) represents a bottleneck because both processors share the bandwidth and the total memory capacity, which would leave one processor starving in some scenarios.

Apple has patented a design that aims to solve this problem by combining high-bandwidth cache DRAM as well as high-capacity main DRAM. "With two types of DRAM forming the memory system, one of which may be optimized for bandwidth and the other of which may be optimized for capacity, the goals of bandwidth increase and capacity increase may both be realized, in some embodiments," says the patent, " to implement energy efficiency improvements, which may provide a highly energy-efficient memory solution that is also high performance and high bandwidth." The patent got filed way back in 2016 and it means that we could start seeing this technology in the future Apple Silicon designs, following the M1 chip.

Update 21:14 UTC: We have been reached out by Mr. Kerry Creeron, an attorney with the firm of Banner & Witcoff, who provided us with additional insights about the patent. Mr. Creeron has provided us with his personal commentary about it, and you can find Mr. Creeron's quote below.

Kerry Creeron—an attorney with the firm of Banner & Witcoff. said:
High-level, the patent covers a memory system having a cache DRAM and that is coupled to a main DRAM. The cache DRAM is less dense and has lower energy consumption than the main DRAM. The cache DRAM may also have higher performance. A variety of different layouts are illustrated for connecting the main and cache DRAM ICs, e.g. in FIGS. 8-13. One interesting layout involves through silicon vias (TSVs) that pass through a stack of main DRAM memory chips.

Theoretically, such layouts might be useful for adding additional slower DRAM to Apple's M1 chip architecture.

Finally, I note that the lead inventor, Biswas, was with PA Semi before Apple Acquired it.

View at TechPowerUp Main Site

TheLostSwede · Jan 27, 2021

I thought the M1 already did something like this.

AleksandarK · Jan 27, 2021

TheLostSwede said:
I thought the M1 already did something like this.

Currently, the M1 has a UMA for both CPU and GPU.

Vya Domus · Jan 27, 2021

Broadwell had basically the same thing, except the DRAM was integrated on the chip. I am not sure how worth it is to have it on the same package instead. Not every good if you're trying to save power either vs having it integrated.

So it seems to me like this is just a worse version of what Intel did with their L4 cache.

bug · Jan 27, 2021

So... Apple just invented cached RAM access?

zlobby · Jan 27, 2021

bug said:
So... Apple just invented cached RAM access?

Yes, and don't forget they invented the optimal radii of the corners of their phones.

Fouquin · Jan 27, 2021

Vya Domus said:
Broadwell had basically the same thing, except the DRAM was integrated on the chip. I am not sure how worth it is to have it on the same package instead. Not every good if you're trying to save power either vs having it integrated.

So it seems to me like this is just a worse version of what Intel did with their L4 cache.

Haswell and Broadwell eDRAM L4 was also its own die wired through the substrate. They did this to reduce manufacturing complexity and offer two zones of clock control for performance binning. External tiers of cache are nothing new, neither is stacked caches. This is something that goes back decades. What is new is the order of cache access, landing size, wire latency and bandwidth.

Post Nut Clairvoyance · Jan 27, 2021

Fouquin said:
Haswell and Broadwell eDRAM L4 was also its own die wired through the substrate. They did this to reduce manufacturing complexity and offer two zones of clock control for performance binning. External tiers of cache are nothing new, neither is stacked caches. This is something that goes back decades. What is new is the order of cache access, landing size, wire latency and bandwidth.

View attachment 185800

Just wonder how it compares to EDRAM. But not expecting this to be order of magnitude faster. Should actually provide worthwhile speed boost compared to EDRAM at least, putting it on SOC die(?) instead of separate EDRAM die.

TechLurker · Jan 27, 2021

I wonder if AMD will eventually implement similar for their own APUs, but maybe with HBM instead, or maybe even on the board again like the old Phenom-era mobo cache; except where the HBM could be used either by the CPU or the GPU as-needed.

crimsontape · Jan 27, 2021

I wouldn't expect this to be a real performance boosting design (maybe between UMA and some wire reduced latency?), but rather a change up in the bill of materials and manufacturing approaches. This is like Intel's recent efforts in a lot of ways.

This is also in the context of Big-Small core setup. So, there may be some in-house special sauce that calls to optimizing that architecture setup against an on-package bit of DRAM.

Throw on some degree of MRAM tech or whatever to provide some quasi speedy permanent cache on the package... Really start veering off that "it's either L$, RAM or on the disk" dogma, and incorporate layers of software priority for OS and applications to help further optimize latency... That could get interesting for reducing the PCB foot print of these devices.

mouacyk · Jan 27, 2021

Tacking on eDRAM to a fast x86 processor is one thing, but adding it to a slow ARM chip is quite another.

ZoneDymo · Jan 27, 2021

mouacyk said:
Tacking on eDRAM to a fast x86 processor is one thing, but adding it to a slow ARM chip is quite another.

but what about adding it to a fast ARM chip? like is the case here?

mouacyk · Jan 27, 2021

ZoneDymo said:
but what about adding it to a fast ARM chip? like is the case here?

3.2GHz core isn't exactly fast by today's standards, but what is its actual non-core speed that feeds the cache? When Apple/Intel first introduced it on Broadwell, core and uncore were running well in excess of 4GHz.

zlobby · Jan 27, 2021

TechLurker said:
I wonder if AMD will eventually implement similar for their own APUs, but maybe with HBM instead, or maybe even on the board again like the old Phenom-era mobo cache; except where the HBM could be used either by the CPU or the GPU as-needed.

I'm sure it never ocurred to the engineers at AMD.

Steevo · Jan 27, 2021

So like socket Super 7 in the late 90s that had on board cache Apple is now doing it?

GA-586SG Gigabyte Tech Motherboards System

Shop GA-586SG Gigabyte Tech Motherboards. Server Upgrades, Solutions & More from Top Brands/Manufacturers at Servers4less.

www.servers4less.com

I wonder who they are going to parent troll with this as well.....

InVasMani · Jan 27, 2021

bug said:
So... Apple just invented cached RAM access?

They modified what others invented marginally to avoid lawsuits and patented that as their very own cutting edge innovation that will cost a torso because a arm and leg is no longer enough. AMD should just replace the DRAM on the substrate with a FPGA and patent that. They can look into how or why to utilize it after. Hey look we did a thing do too.

RJARRRPCGP · Jan 28, 2021

TechLurker said:
I wonder if AMD will eventually implement similar for their own APUs, but maybe with HBM instead, or maybe even on the board again like the old Phenom-era mobo cache; except where the HBM could be used either by the CPU or the GPU as-needed.

FTR, socket 462 was their last FSB desktop platform! AMD is well known for moving stuff to on-die, before Intel. For Intel, OTOH, they still did FSB until the first-gen Core i-series.

Aquinus · Jan 28, 2021

This feels like just another level in the memory hierarchy. I had always felt that the CPU needed something between the last level of on-die CPU cache and system memory. HBM seemed like it would be the best option for this kind of thing because of its memory density and fairly low power usage and the on-package memory that Apple uses seems to be a lot like that. So it makes a boatload of sense that there would be a need for something between flash memory, which is already ungodly fast, and DRAM that's further away from the CPU/GPU/NPU/XYZPU. Imagine having 16GB of "really fast" memory, but having 128GB of "still pretty fast" memory to support it? If the caching strategy is sound, performance is good, and it's as good as having 144GB of memory, then that sounds pretty damn sexy if you ask me.

With that said though, as the owner of a top of the line MacBook Pro that's expensive as a kidney, I know that Apple will absolutely make you pay through the nose for it.

Steevo said:
So like socket Super 7 in the late 90s that had on board cache Apple is now doing it?

The Apple G3 and G4 had an off die cache on the same package as the CPU. It's nothing new. It just makes sense to have cache closer to the actual cores using it for the sake of latency. It depends on where in the memory hierarchy you need improvement.

Edit: I mean, what does this look like? I used one of these when StarCraft was brand new. This isn't new tech. We just have better tech to use it with.

mechtech · Jan 28, 2021

Looks like Rambus is at it again with the patents...............oh wait

bug · Jan 28, 2021

Aquinus said:
This feels like just another level in the memory hierarchy. I had always felt that the CPU needed something between the last level of on-die CPU cache and system memory. HBM seemed like it would be the best option for this kind of thing because of its memory density and fairly low power usage and the on-package memory that Apple uses seems to be a lot like that. So it makes a boatload of sense that there would be a need for something between flash memory, which is already ungodly fast, and DRAM that's further away from the CPU/GPU/NPU/XYZPU. Imagine having 16GB of "really fast" memory, but having 128GB of "still pretty fast" memory to support it? If the caching strategy is sound, performance is good, and it's as good as having 144GB of memory, then that sounds pretty damn sexy if you ask me.

With that said though, as the owner of a top of the line MacBook Pro that's expensive as a kidney, I know that Apple will absolutely make you pay through the nose for it.

Caches need to be fast (low-latency) and they are power hungry because of that. HBM is kinda the opposite of that.

Aquinus · Jan 28, 2021

bug said:
Caches need to be fast (low-latency) and they are power hungry because of that. HBM is kinda the opposite of that.

Latency gets worse the further away from the compute cores you go when it comes to the memory hierarchy. I'm not suggesting another level of external cache because SRAM is expensive and runs hot. I'm suggesting that this is adding a level to the memory hierarchy between cache and system memory. It doesn't need to have latency better than cache, just better than system memory while maintaining high bandwidth.

bug · Jan 28, 2021

Aquinus said:
Latency gets worse the further away from the compute cores you go when it comes to the memory hierarchy. I'm not suggesting another level of external cache because SRAM is expensive and runs hot. I'm suggesting that this is adding a level to the memory hierarchy between cache and system memory. It doesn't need to have latency better than cache, just better than system memory while maintaining high bandwidth.

Of course, but I was addressing this:

HBM seemed like it would be the best option for this kind of thing because of its memory density and fairly low power usage and the on-package memory that Apple uses seems to be a lot like that.

Wirko · Jan 28, 2021

bug said:
Caches need to be fast (low-latency) and they are power hungry because of that. HBM is kinda the opposite of that.

Caches need to be low-latency but how does Apple's "cache DRAM" fit in here? What technology could it be based on if it's expected to offer cosiderably lower latency than the main DRAM?

bug · Jan 28, 2021

Wirko said:
Caches need to be low-latency but how does Apple's "cache DRAM" fit in here? What technology could it be based on if it's expected to offer cosiderably lower latency than the main DRAM?

If it's not mentioned in the patent itself, we may never know. It could be either a form of cheaper SRAM or some plain DDR that lower latency just by the virtue of being on the same die.

Edit: Lo and behold, the update says it's just DDR.

Aquinus · Jan 29, 2021

Wirko said:
Caches need to be low-latency but how does Apple's "cache DRAM" fit in here? What technology could it be based on if it's expected to offer cosiderably lower latency than the main DRAM?

Remember that little thing called Crystal Well that Intel made for certain Broadwell chips; an eDRAM cache paired with the more beefy iGPUs? Think that, but with more capacity. While electricity travels pretty fast, a lot of latency is introduced by the length of the circuit. Having something like stacked DRAM really close to the CPU can offer better latency than that of DIMMs that are physically much further away. This is why I think something like HBM or stacked DRAM on the same package as the CPU could act as a level in the memory hierarchy that's between the last level of CPU cache and system memory.

System Name	Overlord Mk MLI
Processor	AMD Ryzen 7 7800X3D
Motherboard	Gigabyte X670E Aorus Master
Cooling	Noctua NH-D15 SE with offsets
Memory	32GB Team T-Create Expert DDR5 6000 MHz @ CL30-34-34-68
Video Card(s)	Gainward GeForce RTX 4080 Phantom GS
Storage	1TB Solidigm P44 Pro, 2 TB Corsair MP600 Pro, 2TB Kingston KC3000
Display(s)	Acer XV272K LVbmiipruzx 4K@160Hz
Case	Fractal Design Torrent Compact
Audio Device(s)	Corsair Virtuoso SE
Power Supply	be quiet! Pure Power 12 M 850 W
Mouse	Logitech G502 Lightspeed
Keyboard	Corsair K70 Max
Software	Windows 10 Pro
Benchmark Scores	https://valid.x86.fr/yfsd9w

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

System Name	wasted talent
Processor	i5-11400F
Motherboard	Gigabyte B560M Aorussy Pro
Cooling	Silverstone AR12
Memory	Patriot Viper Steel 2X8 4400 @ 3600 C14,14,12,28
Video Card(s)	Sapphire RX 6700 Pulse, Galax 1650 Super EX
Storage	Kingston A2000 500GB
Display(s)	Gigabyte M27Q
Case	open mATX: zwzdiy.cc/M/Product/209574419.html
Audio Device(s)	HiFiMan HE400SE
Power Supply	Strix Gold 650W
Mouse	Skoll Mini, G502 LightSpeed
Keyboard	Akko 3084S
Software	1809 LTSC
Benchmark Scores	3968/540 CB R20 MT/ST

System Name	La Machina
Processor	AMD Ryzen 2700
Motherboard	ASUS B450 TUF mATX
Cooling	EVO 212
Memory	Corsair 3200MHz CL16
Video Card(s)	RX 560
Storage	Some SSD here, some old spinning stuff there
Display(s)	4k Samsung TV and an Asus Pro Art 231
Case	Some microatx Antec
Audio Device(s)	ASUS Essence STX
Power Supply	Seasonic 600W maybe?

Apple Patents Multi-Level Hybrid Memory Subsystem

AleksandarK

News Editor

TheLostSwede

News Editor

AleksandarK

News Editor

Vya Domus

bug

zlobby

Fouquin

Staff

Post Nut Clairvoyance

TechLurker

crimsontape

mouacyk

ZoneDymo

mouacyk

zlobby

Steevo

GA-586SG Gigabyte Tech Motherboards System

InVasMani

RJARRRPCGP

Aquinus

Resident Wat-man

mechtech

bug

Aquinus

Resident Wat-man

bug

Wirko

bug

Aquinus

Resident Wat-man

System Name	Gentoo64 /w Cold Coffee
Processor	9900K 5.2GHz @1.312v
Motherboard	MXI APEX
Cooling	Raystorm Pro + 1260mm Super Nova
Memory	2x16GB TridentZ 4000-14-14-28-2T @1.6v
Video Card(s)	RTX 4090 LiquidX Barrow 3015MHz @1.1v
Storage	660P 1TB, 860 QVO 2TB
Display(s)	LG C1 + Predator XB1 QHD
Case	Open Benchtable V2
Audio Device(s)	SB X-Fi
Power Supply	MSI A1000G
Mouse	G502
Keyboard	G815
Software	Gentoo/Windows 10
Benchmark Scores	Always only ever very fast

System Name	Cyberline
Processor	Intel Core i7 2600k -> 12600k
Motherboard	Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling	Tuniq Tower 120 -> Custom Watercoolingloop
Memory	Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s)	AMD RX480 -> RX7800XT
Storage	Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s)	Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case	antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s)	Focusrite 2i4 (USB)
Power Supply	Seasonic 620watt 80+ Platinum
Mouse	Elecom EX-G
Keyboard	Rapoo V700
Software	Windows 10 Pro 64bit

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

System Name	KHR-1
Processor	Ryzen 9 5900X
Motherboard	ASRock B550 PG Velocita (UEFI-BIOS P3.40)
Memory	64 GB G.Skill RipJaws V F4-3200C16D-64GVK
Video Card(s)	Sparkle Titan Arc A770 16 GB
Storage	Western Digital Black SN850 1 TB NVMe SSD
Display(s)	Alienware AW3423DWF OLED-ASRock PG27Q15R2A (backup)
Case	Corsair 275R
Audio Device(s)	Technics SA-EX140 receiver with Polk VT60 speakers
Power Supply	eVGA Supernova G3 750W
Mouse	Logitech G Pro (Hero)
Software	Windows 11 Pro x64 23H2

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.3.1

Processor	Ryzen 5700x
Motherboard	Gigabyte X570S Aero G R1.1 BiosF5g
Cooling	Noctua NH-C12P SE14 w/ NF-A15 HS-PWM Fan 1500rpm
Memory	Micron DDR4-3200 2x32GB D.S. D.R. (CT2K32G4DFD832A)
Video Card(s)	AMD RX 6800 - Asus Tuf
Storage	Kingston KC3000 1TB & 2TB & 4TB Corsair MP600 Pro LPX
Display(s)	LG 27UL550-W (27" 4k)
Case	Be Quiet Pure Base 600 (no window)
Audio Device(s)	Realtek ALC1220-VB
Power Supply	SuperFlower Leadex V Gold Pro 850W ATX Ver2.52
Mouse	Mionix Naos Pro
Keyboard	Corsair Strafe with browns
Software	W10 22H2 Pro x64

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin