• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Envisions Stacked DRAM on top of Compute Chiplets in the Near Future

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
46,512 (7.66/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
AMD in its ISSCC 2023 presentation detailed how it has advanced data-center energy-efficiency and managed to keep up with Moore's Law, even as semiconductor foundry node advances have tapered. Perhaps its most striking prediction for server processors and HPC accelerators is multi-layer stacked DRAM. The company has, for some time now, made logic products, such as GPUs, with stacked HBM. These have been multi-chip modules (MCMs), in which the logic die and HBM stacks sit on top of a silicon interposer. While this conserves PCB real-estate compared to discrete memory chips/modules; it is inefficient on the substrate, and the interposer is essentially a silicon die that has microscopic wiring between the chips stacked on top of it.

AMD envisions that the high-density server processor of the near-future will have many layers of DRAM stacked on top of logic chips. Such a method of stacking conserves both PCB and substrate real-estate, allowing chip-designers to cram even more cores and memory per socket. The company also sees a greater role of in-memory compute, where trivial simple compute and data-movement functions can be executed directly on the memory, saving round-trips to the processor. Lastly, the company talked about the possibility of an on-package optical PHY, which would simplify network infrastructure.



View at TechPowerUp Main Site | Source
 
Joined
Nov 11, 2016
Messages
3,139 (1.14/day)
System Name The de-ploughminator Mk-II
Processor i7 13700KF
Motherboard MSI Z790 Carbon
Cooling ID-Cooling SE-226-XT + Phanteks T30
Memory 2x16GB G.Skill DDR5 7200Cas34
Video Card(s) Asus RTX4090 TUF
Storage Kingston KC3000 2TB NVME
Display(s) LG OLED CX48"
Case Corsair 5000D Air
Audio Device(s) KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply Corsair HX850
Mouse Razor Viper Ultimate
Keyboard Corsair K75
Software win11
Should be great for mobile gaming, as for desktop, not so much...
 
Joined
Mar 7, 2011
Messages
3,989 (0.83/day)
Given how well 3DV cache works having some DRAM(additional tier of memory) close to cpu die will be greatly helpful for whole range of applications.
 
Joined
Jul 15, 2020
Messages
982 (0.70/day)
System Name Dirt Sheep | Silent Sheep
Processor i5-2400 | 13900K (-0.025mV offset)
Motherboard Asus P8H67-M LE | Gigabyte AERO Z690-G, bios F26 with "Instant 6 GHz" on
Cooling Scythe Katana Type 1 | Noctua NH-U12A chromax.black
Memory G-skill 2*8GB DDR3 | Corsair Vengeance 4*32GB DDR5 5200Mhz C40 @4000MHz
Video Card(s) Gigabyte 970GTX Mini | NV 1080TI FE (cap at 85%, 800mV)
Storage 2*SN850 1TB, 230S 4TB, 840EVO 128GB, WD green 2TB HDD, IronWolf 6TB, 2*HC550 18TB in RAID1
Display(s) LG 21` FHD W2261VP | Lenovo 27` 4K Qreator 27
Case Thermaltake V3 Black|Define 7 Solid, stock 3*14 fans+ 2*12 front&buttom+ out 1*8 (on expansion slot)
Audio Device(s) Beyerdynamic DT 990 (or the screen speakers when I'm too lazy)
Power Supply Enermax Pro82+ 525W | Corsair RM650x (2021)
Mouse Logitech Master 3
Keyboard Roccat Isku FX
VR HMD Nop.
Software WIN 10 | WIN 11
Benchmark Scores CB23 SC: i5-2400=641 | i9-13900k=2325-2281 MC: i5-2400=i9 13900k SC | i9-13900k=37240-35500
Any thoughts on how to effectively cool this stacked tower?
Or this only aimed at low frequency, high core count server CPU`s?

Anyway, nice innovation.
 
Joined
Aug 21, 2013
Messages
1,709 (0.43/day)
I see the point for mobile where space and energy savings are important or servers where compute per area matters but on desktop it's a tough ask. Current iterations of L3 cache sit on top of existing L3 on the chipset. Even so it compromises the chipset boost clock speeds due to voltage limitations and has issues with cooling.

So if AMD plans to utilize this in the future they and TSMC have to first solve the cooling and compromised clocks issue. 7800X3D loses 400-700Mhz of potential boost due to these issues. That's a significant chunk of clock speed.

The other issue is release timing. Currently it takes half a year before desktop gets X3D models. I hope in the future these will be the default on day 1 launch of a new architecture.
But stacking clearly is the future. Sockets are already enormous. Desktop sockets already approach 2000 pins and server sockets exceed 6000.
 
Joined
Jun 22, 2006
Messages
1,056 (0.16/day)
System Name Beaver's Build
Processor AMD Ryzen 9 5950X
Motherboard Asus ROG Crosshair VIII Hero (WI-FI) - X570
Cooling Corsair H115i RGB PLATINUM 97 CFM Liquid
Memory G.Skill Trident Z Neo 32 GB (2 x 16 GB) DDR4-3600 Memory - 16-19-19-39
Video Card(s) NVIDIA GeForce RTX 4090 Founders Edition
Storage Inland 1TB NVMe M.2 (Phison E12) / Samsung 950 Pro M.2 NVMe 512G / WD Black 6TB - 256M cache
Display(s) Alienware AW3225QF 32" 4K 240 Hz OLED
Case Fractal Design Design Define R6 USB-C
Audio Device(s) Focusrite 2i4 USB Audio Interface
Power Supply SuperFlower LEADEX TITANIUM 1600W
Mouse Razer DeathAdder V2
Keyboard Razer Cynosa V2 (Membrane)
Software Microsoft Windows 10 Pro x64
Benchmark Scores 3dmark = https://www.3dmark.com/spy/32087054 Cinebench R15 = 4038 Cinebench R20 = 9210
Joined
Jul 15, 2020
Messages
982 (0.70/day)
System Name Dirt Sheep | Silent Sheep
Processor i5-2400 | 13900K (-0.025mV offset)
Motherboard Asus P8H67-M LE | Gigabyte AERO Z690-G, bios F26 with "Instant 6 GHz" on
Cooling Scythe Katana Type 1 | Noctua NH-U12A chromax.black
Memory G-skill 2*8GB DDR3 | Corsair Vengeance 4*32GB DDR5 5200Mhz C40 @4000MHz
Video Card(s) Gigabyte 970GTX Mini | NV 1080TI FE (cap at 85%, 800mV)
Storage 2*SN850 1TB, 230S 4TB, 840EVO 128GB, WD green 2TB HDD, IronWolf 6TB, 2*HC550 18TB in RAID1
Display(s) LG 21` FHD W2261VP | Lenovo 27` 4K Qreator 27
Case Thermaltake V3 Black|Define 7 Solid, stock 3*14 fans+ 2*12 front&buttom+ out 1*8 (on expansion slot)
Audio Device(s) Beyerdynamic DT 990 (or the screen speakers when I'm too lazy)
Power Supply Enermax Pro82+ 525W | Corsair RM650x (2021)
Mouse Logitech Master 3
Keyboard Roccat Isku FX
VR HMD Nop.
Software WIN 10 | WIN 11
Benchmark Scores CB23 SC: i5-2400=641 | i9-13900k=2325-2281 MC: i5-2400=i9 13900k SC | i9-13900k=37240-35500
This will be 'cool', to skip the CPU block and pass the water right through the inner-core layers :)
I give it zero applicability though, as today CUP`s have more than "10,000 interconnects per cm2", like 10,000 time more.
 
Last edited:
Joined
May 3, 2018
Messages
2,373 (1.07/day)
HBM3 prices have just been increased enormously. Let's see what customers think of the new pricing down the track.
 
Joined
Mar 21, 2016
Messages
2,209 (0.74/day)
The optical part has me fascinated reminds me of some of this discussion I had raised about them and me trying to just brainstorm where and how might these be applied that could work and make sense and provide innovation at the same time.

https://www.techpowerup.com/forums/...ged-photonics-for-nvlink.276139/#post-4418550

Another thing related to optic I had mentioned was this.

"I could see a optical path potentially interconnecting them all quickly as well."

That was in regard to FPGA/CPU/GPU chiplets as well as 3Dstacking.

I spoke some about combining DIMM's with chiplets and felt it could be a good if for no other reason than for compression/decompression from those. This is neat though combining TSV and a chiplet and stacking DRAM directly on top of it. Perhaps some optical interconnects in place of TSV could work too. I think that would add another layer of complications though if you had the optical connection on substrate and then used a optical connect in place of TSV you could shave off some latency I think. I don't know maybe that's a bit of what they've already envisioned here though.

Eventually they could maybe have a I/O die in the center and 8 surrounding chiplets on a substrate and below that another substrate connected to it optically with 3D stacked cache. The way it could connect with the substrate above is each chiplet along the edge around the I/O could connect to a optical connection to the 3D stacked cache below. In fact you could even cool it a bit as well because the cache itself can be inverted and cooled on that side regardless of the optical easily enough. The only barrier I see is the cost of optics and how well it can be shrunk down in size at the same time for functionality as a interconnect.
 
Joined
Jul 12, 2017
Messages
6 (0.00/day)
Location
Romania
Guys, don't get too excited yet. These technologies will surely be quite expensive in the beginning, slim chances to see them soon in consumer products.
 
Joined
Jul 29, 2022
Messages
383 (0.58/day)
I have been envisioning this ever since the first HBM GPU came out. Putting 16/32GB of HBM/2/3 next to a CPU would allow for memory speeds in excess of any DDR4/5 stick, and it would save a tremendous amount of space on the motherboard. Put a decent iGPU in such a chip and it would all but eliminate the mid-range gpu market. And I'm sure that greedy cpu makers like Intel would cream their pants at the possibility of offering every cpu in 2-5 different SKUs based on inbuilt memory amount.
With chiplets it could even be more feasible; one cpu core, one gpu core, one memory controller, and one or multiple HBM stacks. It would be perfect for both consoles and SFF builds.
 
Joined
Jan 3, 2021
Messages
2,781 (2.25/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
HBM3 prices have just been increased enormously. Let's see what customers think of the new pricing down the track.
The same packaging tech is also usable for integrating DDR or LPDDR.

HBM isn't universally better, it requires more exotic dense interconnects and giant memory controllers.
 
Joined
May 12, 2017
Messages
2,207 (0.86/day)
The same packaging tech is also usable for integrating DDR or LPDDR.

HBM isn't universally better, it requires more exotic dense interconnects and giant memory controllers.

Are you sure? I do believe HBM controller is smaller than GDDR(x). ...it's certainly more efficient & requires less power.
 
Joined
Apr 8, 2008
Messages
329 (0.06/day)
I guessed this long time ago to be the future of all PCs.

But, prosumer & consumer products will still need an upgradable RAM of some sort.

I mean the stacked DRAM is there for sure, but the CPU/SoC should still keep the 128/160bit IMC there for expandability. The thing will be tiered Memory Hierarchy will prioritize processes for the Stacked DRAM and/or regular RAM depending on Power/Bandwidth/Latency requirements.

Depending on how hard and costly is the stacked DRAM, AMD might have one or two options, and big OEMs might request some custom orders.
 
Joined
Oct 6, 2021
Messages
1,500 (1.56/day)
This is the most exciting thing I've seen in a long time. This would certainly be a huge step forward for the server market.
 
Joined
Mar 13, 2021
Messages
410 (0.35/day)
Processor AMD 7600x
Motherboard Asrock x670e Steel Legend
Cooling Silver Arrow Extreme IBe Rev B with 2x 120 Gentle Typhoons
Memory 4x16Gb Patriot Viper Non RGB @ 6000 30-36-36-36-40
Video Card(s) XFX 6950XT MERC 319
Storage 2x Crucial P5 Plus 1Tb NVME
Display(s) 3x Dell Ultrasharp U2414h
Case Coolermaster Stacker 832
Power Supply Thermaltake Toughpower PF3 850 watt
Mouse Logitech G502 (OG)
Keyboard Logitech G512
HBM as a Pseudo L4 cache would be a big benefit to a lot of things as X3D has proven that there is a lot of software/games/databases that would take advantage of the wider/faster transfer speeds of HBM vs traditional DRAM.

However there would nearly always be a need for DRAM connectivity as datasets are growing quicker in most areas than what is feesible with direct memory. (Currently capable of 6TB of memory per socket in the server space)

Also you are adding more power requirements to the CPU die (~30 watts) just for the HBM and would need a radically different I/O die to accomodate its integration. So you start running into packaging issues as the size of some of the server CPUs are getting into the realms of ridiculousness. (Consumer CPUs are 4x4Cm2 where as newest AMD Genoa is already pushing 7.5x7.5Cm2)
 
Joined
Oct 12, 2005
Messages
682 (0.10/day)
Should be great for mobile gaming, as for desktop, not so much...
That depend, it would suck if they remove the ability to add more memory via DIMM slots on the motherboard. Else it might not.
HBM as a Pseudo L4 cache would be a big benefit to a lot of things as X3D has proven that there is a lot of software/games/databases that would take advantage of the wider/faster transfer speeds of HBM vs traditional DRAM.

However there would nearly always be a need for DRAM connectivity as datasets are growing quicker in most areas than what is feesible with direct memory. (Currently capable of 6TB of memory per socket in the server space)

Also you are adding more power requirements to the CPU die (~30 watts) just for the HBM and would need a radically different I/O die to accommodate its integration. So you start running into packaging issues as the size of some of the server CPUs are getting into the realms of ridiculousness. (Consumer CPUs are 4x4Cm2 where as newest AMD Genoa is already pushing 7.5x7.5Cm2)

That really depend on how much memory they add on top of the die, if they add more than few GB, a cache would be hard to maintain and it's possible it would be better to just use that memory as standard memory (and bypass all the cache lookup thing). We can do that with NUMA. Memory tiering would be something new on desktop but not on server and mainframe.

It could require some time for Windows to adapt but imagine you have 16 GB on top of your APU and then 32 GB or more of DDR5 DIMM. You play a game and the OS put the most used data into the on die memory and all the extra stuff get moved to the slower DIMM.

If you use less than 16 GB of ram, the DIMM memory could be powered down.
 
Joined
Jul 29, 2022
Messages
383 (0.58/day)
Re-reading the article, they are talking not about using HBM, but by stacking HBM *on top* of the cpu/gpu. Basically imagine the 3d stacking on the 5800X3D, but with 8GB 1024bit modules.

That makes a lot more sense than just adding HBM on an interposer, and may be simpler to manufacture (since they don't need the interposer itself). It would remove one of the big problems of using HBM and make it a lot more viable.
 
Joined
Mar 13, 2021
Messages
410 (0.35/day)
Processor AMD 7600x
Motherboard Asrock x670e Steel Legend
Cooling Silver Arrow Extreme IBe Rev B with 2x 120 Gentle Typhoons
Memory 4x16Gb Patriot Viper Non RGB @ 6000 30-36-36-36-40
Video Card(s) XFX 6950XT MERC 319
Storage 2x Crucial P5 Plus 1Tb NVME
Display(s) 3x Dell Ultrasharp U2414h
Case Coolermaster Stacker 832
Power Supply Thermaltake Toughpower PF3 850 watt
Mouse Logitech G502 (OG)
Keyboard Logitech G512
That really depend on how much memory they add on top of the die, if they add more than few GB, a cache would be hard to maintain and it's possible it would be better to just use that memory as standard memory (and bypass all the cache lookup thing). We can do that with NUMA. Memory tiering would be something new on desktop but not on server and mainframe.
That would then also require a seperate I/O die for the HBM to interact with the rest of the memory and they are VERY pin heavy due to the inherant nature of HBM. Consider that DDR5 are equivalent of 40 data line per channel including ECC vs the 1024 of HBM.

That depend, it would suck if they remove the ability to add more memory via DIMM slots on the motherboard. Else it might not.


It could require some time for Windows to adapt but imagine you have 16 GB on top of your APU and then 32 GB or more of DDR5 DIMM. You play a game and the OS put the most used data into the on die memory and all the extra stuff get moved to the slower DIMM.

If you use less than 16 GB of ram, the DIMM memory could be powered down.
So HBM3 seems to be sitting on 16Gb per die on a 1024 bit bus. Current figures from an H100 shows a memory bandwidth of ~600Gbps per die so 4 dies would be ~2.4Tbps of bandwidth with a capacity of 64Gb

Strangely enough I can see the Consumer CPUs being easier to adapt to include HBM vs the Server CPUs just due to size constraints on current interposers as the extra pins on the I/O die to accomodate the HBM would be a big increase per die over current I/O dies and looking at current Genoa chips its already fairly constrained in terms of size.


Re-reading the article, they are talking not about using HBM, but by stacking HBM *on top* of the cpu/gpu. Basically imagine the 3d stacking on the 5800X3D, but with 8GB 1024bit modules.

That makes a lot more sense than just adding HBM on an interposer, and may be simpler to manufacture (since they don't need the interposer itself). It would remove one of the big problems of using HBM and make it a lot more viable.
Actually they are on about including DRAM via 3D stacking vs HBM for the reasons I describe above. What they are trying to save/show is that by including Some/All of DRAM directly on the die of the CPU vs having it external the energy saved transferring data to and from memory is in the order of 60x

DRAM stacking is FAR easier to do due to the relative simplicity vs an HBM controller.
 
Joined
Oct 12, 2005
Messages
682 (0.10/day)
That would then also require a seperate I/O die for the HBM to interact with the rest of the memory and they are VERY pin heavy due to the inherant nature of HBM. Consider that DDR5 are equivalent of 40 data line per channel including ECC vs the 1024 of HBM.


So HBM3 seems to be sitting on 16Gb per die on a 1024 bit bus. Current figures from an H100 shows a memory bandwidth of ~600Gbps per die so 4 dies would be ~2.4Tbps of bandwidth with a capacity of 64Gb

Strangely enough I can see the Consumer CPUs being easier to adapt to include HBM vs the Server CPUs just due to size constraints on current interposers as the extra pins on the I/O die to accomodate the HBM would be a big increase per die over current I/O dies and looking at current Genoa chips its already fairly constrained in terms of size.



Actually they are on about including DRAM via 3D stacking vs HBM for the reasons I describe above. What they are trying to save/show is that by including Some/All of DRAM directly on the die of the CPU vs having it external the energy saved transferring data to and from memory is in the order of 60x

DRAM stacking is FAR easier to do due to the relative simplicity vs an HBM controller.
It's not know right now if they would go with HBM or a new different solution. But the pin count is not an issue with TSV in the chip. They already connect L3 via that and not sure how many pin they use, but it's a significant humber.

The memory controller is one thing, but also the location of those pin. It would probably require a special type of DRAM that AMD would either made themself or ask a third party to produce to ensure they can connect with those TSV. HBM have pin all around, you can't put TSV all around the chip right now. (Or they use an interposer between the DRAM and the Controller die but that seem costly).

I do not think the amount of silicon space is a real issue for now. They can probably package another 7 nm die or just have a bigger i/o die. We will see what they do.

They could by example have special DRAM die that have the control logic and only use the TSV between that and the i/o die or CCD. There is a lot of possibility.
 
Last edited:
Joined
Nov 4, 2005
Messages
11,746 (1.73/day)
System Name Compy 386
Processor 7800X3D
Motherboard Asus
Cooling Air for now.....
Memory 64 GB DDR5 6400Mhz
Video Card(s) 7900XTX 310 Merc
Storage Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s) 55" Samsung 4K HDR
Audio Device(s) ATI HDMI
Mouse Logitech MX518
Keyboard Razer
Software A lot.
Benchmark Scores Its fast. Enough.
A water block with a recess to hold the stacked layers and liquid metal as the interface medium, no different than the existing stepped vapor chambers, really the ability to put a coating on the active die surface and to stop cooling the inactive die side would make a much larger difference. Bonding is still the same to the fiberglass substrate, wiring the die is more complicated, but maybe soon enough we will print cache on the inactive side anyway and have a two layer piece of glass.
 
Joined
Dec 26, 2020
Messages
367 (0.29/day)
System Name Incomplete thing 1.0
Processor Ryzen 2600
Motherboard B450 Aorus Elite
Cooling Gelid Phantom Black
Memory HyperX Fury RGB 3200 CL16 16GB
Video Card(s) Gigabyte 2060 Gaming OC PRO
Storage Dual 1TB 970evo
Display(s) AOC G2U 1440p 144hz, HP e232
Case CM mb511 RGB
Audio Device(s) Reloop ADM-4
Power Supply Sharkoon WPM-600
Mouse G502 Hero
Keyboard Sharkoon SGK3 Blue
Software W10 Pro
Benchmark Scores 2-5% over stock scores
Unless the memory can transfer heat very well I don't see this ever being great for the compute parts thermals. That thing will overheat once put under some tense mixed load. I remember seeing those in silicon "water" channels, maybe that'll solve it?
 

hs4

Joined
Feb 15, 2022
Messages
106 (0.13/day)
I don't think this is such an expensive technology. Stacking memory on top of logic is already being done by Intel with Lakefield in 2020, and CPUs derived from mobile applications, such as Apple silicon, have already introduced designs that directly connect memory to the CPU package. EMIB or equivalent packaging technology would be a low-cost and thermal-tolerant solution for desktop package. Of course, for faster applications, I would think it would be stacked directly on the logic using Cu-Cu bonding like vcache.

PB2aze8prBgLeghhGRy8BV-1200-80.jpg
 
Last edited:
Joined
Apr 18, 2019
Messages
2,104 (1.13/day)
Location
Olympia, WA
System Name Sleepy Painter
Processor AMD Ryzen 5 3600
Motherboard Asus TuF Gaming X570-PLUS/WIFI
Cooling FSP Windale 6 - Passive
Memory 2x16GB F4-3600C16-16GVKC @ 16-19-21-36-58-1T
Video Card(s) MSI RX580 8GB
Storage 2x Samsung PM963 960GB nVME RAID0, Crucial BX500 1TB SATA, WD Blue 3D 2TB SATA
Display(s) Microboard 32" Curved 1080P 144hz VA w/ Freesync
Case NZXT Gamma Classic Black
Audio Device(s) Asus Xonar D1
Power Supply Rosewill 1KW on 240V@60hz
Mouse Logitech MX518 Legend
Keyboard Red Dragon K552
Software Windows 10 Enterprise 2019 LTSC 1809 17763.1757
Regardless of the new software-side technologies 'Processing in/on RAM / Memory in/on Processing' can facilitate* I have very mixed feelings on the concept.
Ever-increasing 'integration' has been the source of much performance uplift and reduction in latency, but all I see is less modularity.

* - Ovonic Junction compute-in-PCM looks to be a potential 'game changer' in regards to AI/MI hardware acceleration.
Tangentially related: Apparently, that's apart of the whole 'What's going on w/ Optane' situation... IP shenanigans over recognized-as potentially-game-changing tech.
 
Top