• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Envisions Stacked DRAM on top of Compute Chiplets in the Near Future

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,895 (7.38/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
AMD in its ISSCC 2023 presentation detailed how it has advanced data-center energy-efficiency and managed to keep up with Moore's Law, even as semiconductor foundry node advances have tapered. Perhaps its most striking prediction for server processors and HPC accelerators is multi-layer stacked DRAM. The company has, for some time now, made logic products, such as GPUs, with stacked HBM. These have been multi-chip modules (MCMs), in which the logic die and HBM stacks sit on top of a silicon interposer. While this conserves PCB real-estate compared to discrete memory chips/modules; it is inefficient on the substrate, and the interposer is essentially a silicon die that has microscopic wiring between the chips stacked on top of it.

AMD envisions that the high-density server processor of the near-future will have many layers of DRAM stacked on top of logic chips. Such a method of stacking conserves both PCB and substrate real-estate, allowing chip-designers to cram even more cores and memory per socket. The company also sees a greater role of in-memory compute, where trivial simple compute and data-movement functions can be executed directly on the memory, saving round-trips to the processor. Lastly, the company talked about the possibility of an on-package optical PHY, which would simplify network infrastructure.



View at TechPowerUp Main Site | Source
 
Should be great for mobile gaming, as for desktop, not so much...
 
Given how well 3DV cache works having some DRAM(additional tier of memory) close to cpu die will be greatly helpful for whole range of applications.
 
Any thoughts on how to effectively cool this stacked tower?
Or this only aimed at low frequency, high core count server CPU`s?

Anyway, nice innovation.
 
I see the point for mobile where space and energy savings are important or servers where compute per area matters but on desktop it's a tough ask. Current iterations of L3 cache sit on top of existing L3 on the chipset. Even so it compromises the chipset boost clock speeds due to voltage limitations and has issues with cooling.

So if AMD plans to utilize this in the future they and TSMC have to first solve the cooling and compromised clocks issue. 7800X3D loses 400-700Mhz of potential boost due to these issues. That's a significant chunk of clock speed.

The other issue is release timing. Currently it takes half a year before desktop gets X3D models. I hope in the future these will be the default on day 1 launch of a new architecture.
But stacking clearly is the future. Sockets are already enormous. Desktop sockets already approach 2000 pins and server sockets exceed 6000.
 
This will be 'cool', to skip the CPU block and pass the water right through the inner-core layers :)
I give it zero applicability though, as today CUP`s have more than "10,000 interconnects per cm2", like 10,000 time more.
 
Last edited:
HBM3 prices have just been increased enormously. Let's see what customers think of the new pricing down the track.
 
The optical part has me fascinated reminds me of some of this discussion I had raised about them and me trying to just brainstorm where and how might these be applied that could work and make sense and provide innovation at the same time.

https://www.techpowerup.com/forums/...ged-photonics-for-nvlink.276139/#post-4418550

Another thing related to optic I had mentioned was this.

"I could see a optical path potentially interconnecting them all quickly as well."

That was in regard to FPGA/CPU/GPU chiplets as well as 3Dstacking.

I spoke some about combining DIMM's with chiplets and felt it could be a good if for no other reason than for compression/decompression from those. This is neat though combining TSV and a chiplet and stacking DRAM directly on top of it. Perhaps some optical interconnects in place of TSV could work too. I think that would add another layer of complications though if you had the optical connection on substrate and then used a optical connect in place of TSV you could shave off some latency I think. I don't know maybe that's a bit of what they've already envisioned here though.

Eventually they could maybe have a I/O die in the center and 8 surrounding chiplets on a substrate and below that another substrate connected to it optically with 3D stacked cache. The way it could connect with the substrate above is each chiplet along the edge around the I/O could connect to a optical connection to the 3D stacked cache below. In fact you could even cool it a bit as well because the cache itself can be inverted and cooled on that side regardless of the optical easily enough. The only barrier I see is the cost of optics and how well it can be shrunk down in size at the same time for functionality as a interconnect.
 
Guys, don't get too excited yet. These technologies will surely be quite expensive in the beginning, slim chances to see them soon in consumer products.
 
I have been envisioning this ever since the first HBM GPU came out. Putting 16/32GB of HBM/2/3 next to a CPU would allow for memory speeds in excess of any DDR4/5 stick, and it would save a tremendous amount of space on the motherboard. Put a decent iGPU in such a chip and it would all but eliminate the mid-range gpu market. And I'm sure that greedy cpu makers like Intel would cream their pants at the possibility of offering every cpu in 2-5 different SKUs based on inbuilt memory amount.
With chiplets it could even be more feasible; one cpu core, one gpu core, one memory controller, and one or multiple HBM stacks. It would be perfect for both consoles and SFF builds.
 
HBM3 prices have just been increased enormously. Let's see what customers think of the new pricing down the track.
The same packaging tech is also usable for integrating DDR or LPDDR.

HBM isn't universally better, it requires more exotic dense interconnects and giant memory controllers.
 
The same packaging tech is also usable for integrating DDR or LPDDR.

HBM isn't universally better, it requires more exotic dense interconnects and giant memory controllers.

Are you sure? I do believe HBM controller is smaller than GDDR(x). ...it's certainly more efficient & requires less power.
 
I guessed this long time ago to be the future of all PCs.

But, prosumer & consumer products will still need an upgradable RAM of some sort.

I mean the stacked DRAM is there for sure, but the CPU/SoC should still keep the 128/160bit IMC there for expandability. The thing will be tiered Memory Hierarchy will prioritize processes for the Stacked DRAM and/or regular RAM depending on Power/Bandwidth/Latency requirements.

Depending on how hard and costly is the stacked DRAM, AMD might have one or two options, and big OEMs might request some custom orders.
 
This is the most exciting thing I've seen in a long time. This would certainly be a huge step forward for the server market.
 
HBM as a Pseudo L4 cache would be a big benefit to a lot of things as X3D has proven that there is a lot of software/games/databases that would take advantage of the wider/faster transfer speeds of HBM vs traditional DRAM.

However there would nearly always be a need for DRAM connectivity as datasets are growing quicker in most areas than what is feesible with direct memory. (Currently capable of 6TB of memory per socket in the server space)

Also you are adding more power requirements to the CPU die (~30 watts) just for the HBM and would need a radically different I/O die to accomodate its integration. So you start running into packaging issues as the size of some of the server CPUs are getting into the realms of ridiculousness. (Consumer CPUs are 4x4Cm2 where as newest AMD Genoa is already pushing 7.5x7.5Cm2)
 
Should be great for mobile gaming, as for desktop, not so much...
That depend, it would suck if they remove the ability to add more memory via DIMM slots on the motherboard. Else it might not.
HBM as a Pseudo L4 cache would be a big benefit to a lot of things as X3D has proven that there is a lot of software/games/databases that would take advantage of the wider/faster transfer speeds of HBM vs traditional DRAM.

However there would nearly always be a need for DRAM connectivity as datasets are growing quicker in most areas than what is feesible with direct memory. (Currently capable of 6TB of memory per socket in the server space)

Also you are adding more power requirements to the CPU die (~30 watts) just for the HBM and would need a radically different I/O die to accommodate its integration. So you start running into packaging issues as the size of some of the server CPUs are getting into the realms of ridiculousness. (Consumer CPUs are 4x4Cm2 where as newest AMD Genoa is already pushing 7.5x7.5Cm2)

That really depend on how much memory they add on top of the die, if they add more than few GB, a cache would be hard to maintain and it's possible it would be better to just use that memory as standard memory (and bypass all the cache lookup thing). We can do that with NUMA. Memory tiering would be something new on desktop but not on server and mainframe.

It could require some time for Windows to adapt but imagine you have 16 GB on top of your APU and then 32 GB or more of DDR5 DIMM. You play a game and the OS put the most used data into the on die memory and all the extra stuff get moved to the slower DIMM.

If you use less than 16 GB of ram, the DIMM memory could be powered down.
 
Re-reading the article, they are talking not about using HBM, but by stacking HBM *on top* of the cpu/gpu. Basically imagine the 3d stacking on the 5800X3D, but with 8GB 1024bit modules.

That makes a lot more sense than just adding HBM on an interposer, and may be simpler to manufacture (since they don't need the interposer itself). It would remove one of the big problems of using HBM and make it a lot more viable.
 
That really depend on how much memory they add on top of the die, if they add more than few GB, a cache would be hard to maintain and it's possible it would be better to just use that memory as standard memory (and bypass all the cache lookup thing). We can do that with NUMA. Memory tiering would be something new on desktop but not on server and mainframe.
That would then also require a seperate I/O die for the HBM to interact with the rest of the memory and they are VERY pin heavy due to the inherant nature of HBM. Consider that DDR5 are equivalent of 40 data line per channel including ECC vs the 1024 of HBM.

That depend, it would suck if they remove the ability to add more memory via DIMM slots on the motherboard. Else it might not.


It could require some time for Windows to adapt but imagine you have 16 GB on top of your APU and then 32 GB or more of DDR5 DIMM. You play a game and the OS put the most used data into the on die memory and all the extra stuff get moved to the slower DIMM.

If you use less than 16 GB of ram, the DIMM memory could be powered down.
So HBM3 seems to be sitting on 16Gb per die on a 1024 bit bus. Current figures from an H100 shows a memory bandwidth of ~600Gbps per die so 4 dies would be ~2.4Tbps of bandwidth with a capacity of 64Gb

Strangely enough I can see the Consumer CPUs being easier to adapt to include HBM vs the Server CPUs just due to size constraints on current interposers as the extra pins on the I/O die to accomodate the HBM would be a big increase per die over current I/O dies and looking at current Genoa chips its already fairly constrained in terms of size.


Re-reading the article, they are talking not about using HBM, but by stacking HBM *on top* of the cpu/gpu. Basically imagine the 3d stacking on the 5800X3D, but with 8GB 1024bit modules.

That makes a lot more sense than just adding HBM on an interposer, and may be simpler to manufacture (since they don't need the interposer itself). It would remove one of the big problems of using HBM and make it a lot more viable.
Actually they are on about including DRAM via 3D stacking vs HBM for the reasons I describe above. What they are trying to save/show is that by including Some/All of DRAM directly on the die of the CPU vs having it external the energy saved transferring data to and from memory is in the order of 60x

DRAM stacking is FAR easier to do due to the relative simplicity vs an HBM controller.
 
That would then also require a seperate I/O die for the HBM to interact with the rest of the memory and they are VERY pin heavy due to the inherant nature of HBM. Consider that DDR5 are equivalent of 40 data line per channel including ECC vs the 1024 of HBM.


So HBM3 seems to be sitting on 16Gb per die on a 1024 bit bus. Current figures from an H100 shows a memory bandwidth of ~600Gbps per die so 4 dies would be ~2.4Tbps of bandwidth with a capacity of 64Gb

Strangely enough I can see the Consumer CPUs being easier to adapt to include HBM vs the Server CPUs just due to size constraints on current interposers as the extra pins on the I/O die to accomodate the HBM would be a big increase per die over current I/O dies and looking at current Genoa chips its already fairly constrained in terms of size.



Actually they are on about including DRAM via 3D stacking vs HBM for the reasons I describe above. What they are trying to save/show is that by including Some/All of DRAM directly on the die of the CPU vs having it external the energy saved transferring data to and from memory is in the order of 60x

DRAM stacking is FAR easier to do due to the relative simplicity vs an HBM controller.
It's not know right now if they would go with HBM or a new different solution. But the pin count is not an issue with TSV in the chip. They already connect L3 via that and not sure how many pin they use, but it's a significant humber.

The memory controller is one thing, but also the location of those pin. It would probably require a special type of DRAM that AMD would either made themself or ask a third party to produce to ensure they can connect with those TSV. HBM have pin all around, you can't put TSV all around the chip right now. (Or they use an interposer between the DRAM and the Controller die but that seem costly).

I do not think the amount of silicon space is a real issue for now. They can probably package another 7 nm die or just have a bigger i/o die. We will see what they do.

They could by example have special DRAM die that have the control logic and only use the TSV between that and the i/o die or CCD. There is a lot of possibility.
 
Last edited:
A water block with a recess to hold the stacked layers and liquid metal as the interface medium, no different than the existing stepped vapor chambers, really the ability to put a coating on the active die surface and to stop cooling the inactive die side would make a much larger difference. Bonding is still the same to the fiberglass substrate, wiring the die is more complicated, but maybe soon enough we will print cache on the inactive side anyway and have a two layer piece of glass.
 
Unless the memory can transfer heat very well I don't see this ever being great for the compute parts thermals. That thing will overheat once put under some tense mixed load. I remember seeing those in silicon "water" channels, maybe that'll solve it?
 
I don't think this is such an expensive technology. Stacking memory on top of logic is already being done by Intel with Lakefield in 2020, and CPUs derived from mobile applications, such as Apple silicon, have already introduced designs that directly connect memory to the CPU package. EMIB or equivalent packaging technology would be a low-cost and thermal-tolerant solution for desktop package. Of course, for faster applications, I would think it would be stacked directly on the logic using Cu-Cu bonding like vcache.

PB2aze8prBgLeghhGRy8BV-1200-80.jpg
 
Last edited:
Regardless of the new software-side technologies 'Processing in/on RAM / Memory in/on Processing' can facilitate* I have very mixed feelings on the concept.
Ever-increasing 'integration' has been the source of much performance uplift and reduction in latency, but all I see is less modularity.

* - Ovonic Junction compute-in-PCM looks to be a potential 'game changer' in regards to AI/MI hardware acceleration.
Tangentially related: Apparently, that's apart of the whole 'What's going on w/ Optane' situation... IP shenanigans over recognized-as potentially-game-changing tech.
 
Back
Top