AMD Envisions Stacked DRAM on top of Compute Chiplets in the Near Future

btarunr · Feb 22, 2023

AMD in its ISSCC 2023 presentation detailed how it has advanced data-center energy-efficiency and managed to keep up with Moore's Law, even as semiconductor foundry node advances have tapered. Perhaps its most striking prediction for server processors and HPC accelerators is multi-layer stacked DRAM. The company has, for some time now, made logic products, such as GPUs, with stacked HBM. These have been multi-chip modules (MCMs), in which the logic die and HBM stacks sit on top of a silicon interposer. While this conserves PCB real-estate compared to discrete memory chips/modules; it is inefficient on the substrate, and the interposer is essentially a silicon die that has microscopic wiring between the chips stacked on top of it.

AMD envisions that the high-density server processor of the near-future will have many layers of DRAM stacked on top of logic chips. Such a method of stacking conserves both PCB and substrate real-estate, allowing chip-designers to cram even more cores and memory per socket. The company also sees a greater role of in-memory compute, where trivial simple compute and data-movement functions can be executed directly on the memory, saving round-trips to the processor. Lastly, the company talked about the possibility of an on-package optical PHY, which would simplify network infrastructure.

View at TechPowerUp Main Site | Source

nguyen · Feb 22, 2023

Should be great for mobile gaming, as for desktop, not so much...

Chaitanya · Feb 22, 2023

Given how well 3DV cache works having some DRAM(additional tier of memory) close to cpu die will be greatly helpful for whole range of applications.

Dirt Chip · Feb 22, 2023

Any thoughts on how to effectively cool this stacked tower?
Or this only aimed at low frequency, high core count server CPU`s?

Anyway, nice innovation.

Tomorrow · Feb 22, 2023

I see the point for mobile where space and energy savings are important or servers where compute per area matters but on desktop it's a tough ask. Current iterations of L3 cache sit on top of existing L3 on the chipset. Even so it compromises the chipset boost clock speeds due to voltage limitations and has issues with cooling.

So if AMD plans to utilize this in the future they and TSMC have to first solve the cooling and compromised clocks issue. 7800X3D loses 400-700Mhz of potential boost due to these issues. That's a significant chunk of clock speed.

The other issue is release timing. Currently it takes half a year before desktop gets X3D models. I hope in the future these will be the default on day 1 launch of a new architecture.
But stacking clearly is the future. Sockets are already enormous. Desktop sockets already approach 2000 pins and server sockets exceed 6000.

erek · Feb 22, 2023

Dirt Chip said:
Any thoughts on how to effectively cool this stacked tower?
Or this only aimed at low frequency, high core count server CPU`s?

Anyway, nice innovation.

IBM demonstrates water-cooling for 3D processors

Building modern processors in three dimensions instead of two could solve a …

arstechnica.com

delshay · Feb 22, 2023

Now they talking my cup of tea. Big fan of HBM Memory, this is why I stuck with my R9/Vega Nano cards. HBM-PIM was talked about some time ago by Samsung.

Very short read here High Bandwidth Memory - Wikipedia

Dirt Chip · Feb 22, 2023

erek said:
IBM demonstrates water-cooling for 3D processors

Building modern processors in three dimensions instead of two could solve a …

arstechnica.com

This will be 'cool', to skip the CPU block and pass the water right through the inner-core layers

I give it zero applicability though, as today CUP`s have more than "10,000 interconnects per cm2", like 10,000 time more.

Minus Infinity · Feb 22, 2023

HBM3 prices have just been increased enormously. Let's see what customers think of the new pricing down the track.

InVasMani · Feb 22, 2023

The optical part has me fascinated reminds me of some of this discussion I had raised about them and me trying to just brainstorm where and how might these be applied that could work and make sense and provide innovation at the same time.

https://www.techpowerup.com/forums/...ged-photonics-for-nvlink.276139/#post-4418550

Another thing related to optic I had mentioned was this.

"I could see a optical path potentially interconnecting them all quickly as well."

That was in regard to FPGA/CPU/GPU chiplets as well as 3Dstacking.

I spoke some about combining DIMM's with chiplets and felt it could be a good if for no other reason than for compression/decompression from those. This is neat though combining TSV and a chiplet and stacking DRAM directly on top of it. Perhaps some optical interconnects in place of TSV could work too. I think that would add another layer of complications though if you had the optical connection on substrate and then used a optical connect in place of TSV you could shave off some latency I think. I don't know maybe that's a bit of what they've already envisioned here though.

Eventually they could maybe have a I/O die in the center and 8 surrounding chiplets on a substrate and below that another substrate connected to it optically with 3D stacked cache. The way it could connect with the substrate above is each chiplet along the edge around the I/O could connect to a optical connection to the 3D stacked cache below. In fact you could even cool it a bit as well because the cache itself can be inverted and cooled on that side regardless of the optical easily enough. The only barrier I see is the cost of optics and how well it can be shrunk down in size at the same time for functionality as a interconnect.

LiviuTM · Feb 22, 2023

Guys, don't get too excited yet. These technologies will surely be quite expensive in the beginning, slim chances to see them soon in consumer products.

ymdhis · Feb 22, 2023

I have been envisioning this ever since the first HBM GPU came out. Putting 16/32GB of HBM/2/3 next to a CPU would allow for memory speeds in excess of any DDR4/5 stick, and it would save a tremendous amount of space on the motherboard. Put a decent iGPU in such a chip and it would all but eliminate the mid-range gpu market. And I'm sure that greedy cpu makers like Intel would cream their pants at the possibility of offering every cpu in 2-5 different SKUs based on inbuilt memory amount.
With chiplets it could even be more feasible; one cpu core, one gpu core, one memory controller, and one or multiple HBM stacks. It would be perfect for both consoles and SFF builds.

Wirko · Feb 22, 2023

Minus Infinity said:
HBM3 prices have just been increased enormously. Let's see what customers think of the new pricing down the track.

The same packaging tech is also usable for integrating DDR or LPDDR.

HBM isn't universally better, it requires more exotic dense interconnects and giant memory controllers.

delshay · Feb 22, 2023

Wirko said:
The same packaging tech is also usable for integrating DDR or LPDDR.

HBM isn't universally better, it requires more exotic dense interconnects and giant memory controllers.

Are you sure? I do believe HBM controller is smaller than GDDR(x). ...it's certainly more efficient & requires less power.

Xajel · Feb 22, 2023

I guessed this long time ago to be the future of all PCs.

But, prosumer & consumer products will still need an upgradable RAM of some sort.

I mean the stacked DRAM is there for sure, but the CPU/SoC should still keep the 128/160bit IMC there for expandability. The thing will be tiered Memory Hierarchy will prioritize processes for the Stacked DRAM and/or regular RAM depending on Power/Bandwidth/Latency requirements.

Depending on how hard and costly is the stacked DRAM, AMD might have one or two options, and big OEMs might request some custom orders.

Denver · Feb 22, 2023

This is the most exciting thing I've seen in a long time. This would certainly be a huge step forward for the server market.

Panther_Seraphin · Feb 22, 2023

HBM as a Pseudo L4 cache would be a big benefit to a lot of things as X3D has proven that there is a lot of software/games/databases that would take advantage of the wider/faster transfer speeds of HBM vs traditional DRAM.

However there would nearly always be a need for DRAM connectivity as datasets are growing quicker in most areas than what is feesible with direct memory. (Currently capable of 6TB of memory per socket in the server space)

Also you are adding more power requirements to the CPU die (~30 watts) just for the HBM and would need a radically different I/O die to accomodate its integration. So you start running into packaging issues as the size of some of the server CPUs are getting into the realms of ridiculousness. (Consumer CPUs are 4x4Cm2 where as newest AMD Genoa is already pushing 7.5x7.5Cm2)

Punkenjoy · Feb 22, 2023

nguyen said:
Should be great for mobile gaming, as for desktop, not so much...

That depend, it would suck if they remove the ability to add more memory via DIMM slots on the motherboard. Else it might not.

Panther_Seraphin said:
HBM as a Pseudo L4 cache would be a big benefit to a lot of things as X3D has proven that there is a lot of software/games/databases that would take advantage of the wider/faster transfer speeds of HBM vs traditional DRAM.

However there would nearly always be a need for DRAM connectivity as datasets are growing quicker in most areas than what is feesible with direct memory. (Currently capable of 6TB of memory per socket in the server space)

Also you are adding more power requirements to the CPU die (~30 watts) just for the HBM and would need a radically different I/O die to accommodate its integration. So you start running into packaging issues as the size of some of the server CPUs are getting into the realms of ridiculousness. (Consumer CPUs are 4x4Cm2 where as newest AMD Genoa is already pushing 7.5x7.5Cm2)

That really depend on how much memory they add on top of the die, if they add more than few GB, a cache would be hard to maintain and it's possible it would be better to just use that memory as standard memory (and bypass all the cache lookup thing). We can do that with NUMA. Memory tiering would be something new on desktop but not on server and mainframe.

It could require some time for Windows to adapt but imagine you have 16 GB on top of your APU and then 32 GB or more of DDR5 DIMM. You play a game and the OS put the most used data into the on die memory and all the extra stuff get moved to the slower DIMM.

If you use less than 16 GB of ram, the DIMM memory could be powered down.

ymdhis · Feb 22, 2023

Re-reading the article, they are talking not about using HBM, but by stacking HBM *on top* of the cpu/gpu. Basically imagine the 3d stacking on the 5800X3D, but with 8GB 1024bit modules.

That makes a lot more sense than just adding HBM on an interposer, and may be simpler to manufacture (since they don't need the interposer itself). It would remove one of the big problems of using HBM and make it a lot more viable.

Panther_Seraphin · Feb 22, 2023

Punkenjoy said:
That really depend on how much memory they add on top of the die, if they add more than few GB, a cache would be hard to maintain and it's possible it would be better to just use that memory as standard memory (and bypass all the cache lookup thing). We can do that with NUMA. Memory tiering would be something new on desktop but not on server and mainframe.

That would then also require a seperate I/O die for the HBM to interact with the rest of the memory and they are VERY pin heavy due to the inherant nature of HBM. Consider that DDR5 are equivalent of 40 data line per channel including ECC vs the 1024 of HBM.

Punkenjoy said:
That depend, it would suck if they remove the ability to add more memory via DIMM slots on the motherboard. Else it might not.

It could require some time for Windows to adapt but imagine you have 16 GB on top of your APU and then 32 GB or more of DDR5 DIMM. You play a game and the OS put the most used data into the on die memory and all the extra stuff get moved to the slower DIMM.

If you use less than 16 GB of ram, the DIMM memory could be powered down.

So HBM3 seems to be sitting on 16Gb per die on a 1024 bit bus. Current figures from an H100 shows a memory bandwidth of ~600Gbps per die so 4 dies would be ~2.4Tbps of bandwidth with a capacity of 64Gb

Strangely enough I can see the Consumer CPUs being easier to adapt to include HBM vs the Server CPUs just due to size constraints on current interposers as the extra pins on the I/O die to accomodate the HBM would be a big increase per die over current I/O dies and looking at current Genoa chips its already fairly constrained in terms of size.

ymdhis said:
Re-reading the article, they are talking not about using HBM, but by stacking HBM *on top* of the cpu/gpu. Basically imagine the 3d stacking on the 5800X3D, but with 8GB 1024bit modules.

That makes a lot more sense than just adding HBM on an interposer, and may be simpler to manufacture (since they don't need the interposer itself). It would remove one of the big problems of using HBM and make it a lot more viable.

Actually they are on about including DRAM via 3D stacking vs HBM for the reasons I describe above. What they are trying to save/show is that by including Some/All of DRAM directly on the die of the CPU vs having it external the energy saved transferring data to and from memory is in the order of 60x

DRAM stacking is FAR easier to do due to the relative simplicity vs an HBM controller.

Punkenjoy · Feb 22, 2023

Panther_Seraphin said:
That would then also require a seperate I/O die for the HBM to interact with the rest of the memory and they are VERY pin heavy due to the inherant nature of HBM. Consider that DDR5 are equivalent of 40 data line per channel including ECC vs the 1024 of HBM.

So HBM3 seems to be sitting on 16Gb per die on a 1024 bit bus. Current figures from an H100 shows a memory bandwidth of ~600Gbps per die so 4 dies would be ~2.4Tbps of bandwidth with a capacity of 64Gb

Strangely enough I can see the Consumer CPUs being easier to adapt to include HBM vs the Server CPUs just due to size constraints on current interposers as the extra pins on the I/O die to accomodate the HBM would be a big increase per die over current I/O dies and looking at current Genoa chips its already fairly constrained in terms of size.

Actually they are on about including DRAM via 3D stacking vs HBM for the reasons I describe above. What they are trying to save/show is that by including Some/All of DRAM directly on the die of the CPU vs having it external the energy saved transferring data to and from memory is in the order of 60x

DRAM stacking is FAR easier to do due to the relative simplicity vs an HBM controller.

It's not know right now if they would go with HBM or a new different solution. But the pin count is not an issue with TSV in the chip. They already connect L3 via that and not sure how many pin they use, but it's a significant humber.

The memory controller is one thing, but also the location of those pin. It would probably require a special type of DRAM that AMD would either made themself or ask a third party to produce to ensure they can connect with those TSV. HBM have pin all around, you can't put TSV all around the chip right now. (Or they use an interposer between the DRAM and the Controller die but that seem costly).

I do not think the amount of silicon space is a real issue for now. They can probably package another 7 nm die or just have a bigger i/o die. We will see what they do.

They could by example have special DRAM die that have the control logic and only use the TSV between that and the i/o die or CCD. There is a lot of possibility.

Steevo · Feb 22, 2023

A water block with a recess to hold the stacked layers and liquid metal as the interface medium, no different than the existing stepped vapor chambers, really the ability to put a coating on the active die surface and to stop cooling the inactive die side would make a much larger difference. Bonding is still the same to the fiberglass substrate, wiring the die is more complicated, but maybe soon enough we will print cache on the inactive side anyway and have a two layer piece of glass.

thegnome · Feb 22, 2023

Unless the memory can transfer heat very well I don't see this ever being great for the compute parts thermals. That thing will overheat once put under some tense mixed load. I remember seeing those in silicon "water" channels, maybe that'll solve it?

hs4 · Feb 23, 2023

I don't think this is such an expensive technology. Stacking memory on top of logic is already being done by Intel with Lakefield in 2020, and CPUs derived from mobile applications, such as Apple silicon, have already introduced designs that directly connect memory to the CPU package. EMIB or equivalent packaging technology would be a low-cost and thermal-tolerant solution for desktop package. Of course, for faster applications, I would think it would be stacked directly on the logic using Cu-Cu bonding like vcache.

LabRat 891 · Feb 23, 2023

Regardless of the new software-side technologies 'Processing in/on RAM / Memory in/on Processing' can facilitate* I have very mixed feelings on the concept.
Ever-increasing 'integration' has been the source of much performance uplift and reduction in latency, but all I see is less modularity.

* - Ovonic Junction compute-in-PCM looks to be a potential 'game changer' in regards to AI/MI hardware acceleration.
Tangentially related: Apparently, that's apart of the whole 'What's going on w/ Optane' situation... IP shenanigans over recognized-as potentially-game-changing tech.

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	Gigabyte B550 AORUS Elite V2
Cooling	DeepCool Gammax L240 V2
Memory	2x 16GB DDR4-3200
Video Card(s)	Galax RTX 4070 Ti EX
Storage	Samsung 990 1TB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus Astral 5090 LC OC
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX1200
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

System Name	Dirt Sheep \| Silent Sheep
Processor	i5-2400 \| 13900K (-0.02mV offset)
Motherboard	Asus P8H67-M LE \| Gigabyte AERO Z690-G, bios F29 Intel baseline
Cooling	Scythe Katana Type 1 \| Noctua NH-U12A chromax.black
Memory	G-skill 28GB DDR3 \| Corsair Vengeance 432GB DDR5 5200Mhz C40 @4000MHz
Video Card(s)	iGPU \| NV 1080TI FE
Storage	Micron 256GB SSD \| 2SN850 1TB, 230S 4TB, 840EVO 128GB, IronWolf 6TB, 2HC550 18TB in RAID1
Display(s)	LG 21` FHD W2261VP \| Lenovo 27` 4K Qreator 27
Case	Thermaltake V3 Black\|Define 7 Solid: 2TOUGHFAN 14Pro+2Stock 14 inlet, NF-A14 PPC-3000+NF-A8 outlet
Audio Device(s)	Beyerdynamic DT 990 (or the screen speakers when I'm too lazy)
Power Supply	Enermax Pro82+ 525W \| Corsair RM650x (2021)
Mouse	Logitech Master 3
Keyboard	Roccat Isku FX
VR HMD	Nop.
Software	WIN 10 \| WIN 11
Benchmark Scores	CB23 SC: i5-2400=641 \| i9-13900k=2281 MC: i5-2400=i9 13900k SC \| i9-13900k=35500

System Name	DarkStar
Processor	AMD Ryzen 7 5800X3D
Motherboard	Gigabyte X570 Aorus Master 1.0 (BIOS F39g)
Cooling	Arctic Liquid Freezer II 420mm AIO (rev4)
Memory	4x8GB Patriot Viper DDR4 4400C19 @ 3733Mhz 14-14-13-27 1T
Video Card(s)	Gigabyte Radeon RX 9070 XT Gaming OC 16GB GDDR6 @ 3400Mhz Core/22Gbps Mem
Storage	1TB Samsung 990 Pro (OS);2TB Samsung PM9A1;4TB XPG S70 Blade (Games);14TB WD UltraStar HC530 (Video)
Display(s)	27" LG UltraGear 27GS85Q-B @ 2560x1440 @ 200Hz, Nano-IPS
Case	be quiet! Dark Base Pro 900 Rev.2
Audio Device(s)	SteelSeries Arctis Nova Pro Wireless
Power Supply	1000W Seasonic PRIME Ultra Titanium;600W APC SMT750i UPS
Mouse	Logitech G604
Keyboard	Logitech G910 Orion Spark
Software	Windows 11 Pro x64 24H2 (Build 26100.4351)

System Name	Beaver's Build
Processor	AMD Ryzen 9800X3D
Motherboard	Asus TUF Gaming X670E Plus WiFi
Cooling	Corsair H115i RGB PLATINUM 97 CFM Liquid
Memory	G.SKILL Trident Z5 Neo DDR5-6000 CL30 RAM 32GB (2x16GB)
Video Card(s)	NVIDIA GeForce RTX 4090 Founders Edition
Storage	WD_BLACK 8TB SN850X NVMe
Display(s)	Alienware AW3225QF 32" 4K 240 Hz OLED
Case	Fractal Design Design Define R6 USB-C
Audio Device(s)	Focusrite 2i4 USB Audio Interface
Power Supply	SuperFlower LEADEX TITANIUM 1600W
Mouse	Razer DeathAdder V2
Keyboard	Corsair K70 RGB Pro
Software	Microsoft Windows 11 Pro
Benchmark Scores	3dmark = https://www.3dmark.com/spy/51229598

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

System Name	Xajel Main
Processor	AMD Ryzen 7 5800X
Motherboard	ASRock X570M Steel Legened
Cooling	Corsair H100i PRO
Memory	G.Skill DDR4 3600 32GB (2x16GB)
Video Card(s)	ZOTAC GAMING GeForce RTX 3080 Ti AMP Holo
Storage	(OS) Gigabyte AORUS NVMe Gen4 1TB + (Personal) WD Black SN850X 2TB + (Store) WD 8TB HDD
Display(s)	LG 38WN95C Ultrawide 3840x1600 144Hz
Case	Cooler Master CM690 III
Audio Device(s)	Built-in Audio + Yamaha SR-C20 Soundbar
Power Supply	Thermaltake 750W
Mouse	Logitech MK710 Combo
Keyboard	Logitech MK710 Combo (M705)
Software	Windows 11 Pro

Processor	AMD 7600x
Motherboard	Asrock x670e Steel Legend
Cooling	Silver Arrow Extreme IBe Rev B with 2x 120 Gentle Typhoons
Memory	4x16Gb Patriot Viper Non RGB @ 6000 30-36-36-36-40
Video Card(s)	XFX 6950XT MERC 319
Storage	2x Crucial P5 Plus 1Tb NVME
Display(s)	3x Dell Ultrasharp U2414h
Case	Coolermaster Stacker 832
Power Supply	Thermaltake Toughpower PF3 850 watt
Mouse	Logitech G502 (OG)
Keyboard	Logitech G512

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

System Name	Incomplete thing 1.0
Processor	Ryzen 2600
Motherboard	B450 Aorus Elite
Cooling	Gelid Phantom Black
Memory	HyperX Fury RGB 3200 CL16 16GB
Video Card(s)	Gigabyte 2060 Gaming OC PRO
Storage	Dual 1TB 970evo
Display(s)	AOC G2U 1440p 144hz, HP e232
Case	CM mb511 RGB
Audio Device(s)	Reloop ADM-4
Power Supply	Sharkoon WPM-600
Mouse	G502 Hero
Keyboard	Sharkoon SGK3 Blue
Software	W10 Pro
Benchmark Scores	2-5% over stock scores

System Name	Metalia
Processor	AMD Ryzen 7 5800X3D
Motherboard	Asus TuF Gaming X570-PLUS
Cooling	ID Cooling 280mm AIO w/ Arctic P14s
Memory	2x32GB DDR4-3600
Video Card(s)	Sapphire Pulse RX 9070 XT
Storage	Optane P5801X 400GB, Samsung 990Pro 2TB
Display(s)	LG ‎32GS95UV 32" OLED 240/480hz 4K/1080P Dual Mode
Case	Geometric Future M8 Dharma
Audio Device(s)	Xonar Essence STX
Power Supply	Seasonic Focus GX-1000 Gold
Mouse	Attack Shark R3 Magnesium - White
Keyboard	Keychron K8 Pro - White - Tactile Brown Switch
Software	Windows 10 IoT Enterprise LTSC 2021

AMD Envisions Stacked DRAM on top of Compute Chiplets in the Near Future

Editor & Senior Moderator