• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Micron Ships HBM4 Samples: 12-Hi 36 GB Modules with 2 TB/s Bandwidth

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
3,136 (1.10/day)
Micron has achieved a significant advancement of the HBM4 architecture, which will stack 12 DRAM dies (12-Hi) to provide 36 GB of capacity per package. According to company representatives, initial engineering samples are scheduled to ship to key partners in the coming weeks, paving the way for full production in early 2026. The HBM4 design relies on Micron's established 1β ("one-beta") process node for DRAM tiles, in production since 2022, while it prepares to introduce EUV-enabled 1γ ("one-gamma") later this year for DDR5. By increasing the interface width from 1,024 to 2,048 bits per stack, each HBM4 chip can achieve a sustained memory bandwidth of 2 TB/s, representing a 20% efficiency improvement over the existing HBM3E standard.

NVIDIA and AMD are expected to be early adopters of Micron's HBM4. NVIDIA plans to integrate these memory modules into its upcoming Rubin-Vera AI accelerators in the second half of 2026. AMD is anticipated to incorporate HBM4 into its next-generation Instinct MI400 series, with further information to be revealed at the company's Advancing AI 2025 conference. The increased capacity and bandwidth of HBM4 will address growing demands in generative AI, high-performance computing, and other data-intensive applications. Larger stack heights and expanded interface widths enable more efficient data movement, a critical factor in multi-chip configurations and memory-coherent interconnects. As Micron begins mass production of HBM4, major obstacles to overcome will be thermal performance and real-world benchmarks, which will determine how effectively this new memory standard can support the most demanding AI workloads.



View at TechPowerUp Main Site | Source
 
Place your bets ladies and gentlemen, will we have high end nVidia 6090 and/or AMD RTXx090 GPUs with with 2 TB/s 36GB HBM4 ?
 
Place your bets ladies and gentlemen, will we have high end nVidia 6090 and/or AMD RTXx090 GPUs with with 2 TB/s 36GB HBM4 ?
I doubt the HBM part.
But we're pretty close to 2TB/s already with the 5090, so the next consumer gen with better binned GDDR7 might manage to achieve that.
 
Place your bets ladies and gentlemen, will we have high end nVidia 6090 and/or AMD RTXx090 GPUs with with 2 TB/s 36GB HBM4 ?
Well, you most certainly can have that......







For ~$15-20k, nottaproblemo, hahahahaha :D
 
I doubt the HBM part.
But we're pretty close to 2TB/s already with the 5090, so the next consumer gen with better binned GDDR7 might manage to achieve that.
GDDR7 requires 16 devices for that bandwidth whereas HBM4 can manage it with just one stack. Of course, HBM is far too expensive to be used in consumer GPUs.
 
didn't the Radeon 6900XT have some cache that was 1.5TB/s 5 years ago? wasn't HBM either.
 
Place your bets ladies and gentlemen, will we have high end nVidia 6090 and/or AMD RTXx090 GPUs with with 2 TB/s 36GB HBM4 ?
Extremely low chances of that happening - especially with Nvidia. Slightly higher chance with AMD since UDNA likely includes both G7 and HBM controllers. And not because of price. I think mostly because of supply. All HBM supply is going to data centers. Besides GDDR7 just launched earlier this year and it's not maxed out yet. As much i like HBM and it's compact size i have to be realistic on this.

G7 also has 3GB modules now. Coupled with speeds approaching 40Gbps it would mean the following bandwidth on these bus widths:

92bit: 460 GB/s. Likely 4x2GB for 8GB capacity on an entry level card.
128bit: 640 GB/s. Likely 6x2GB for 12GB capacity on an low end card.
192bit: 960 GB/s. Likely 8x2GB for16GB capacity on an midrange card.
256bit: 1,3 TB/s. Likely 6x3GB for 18GB capacity on an upper midrange card.
320bit: 1,6 TB/s. Likely 8x3GB for 24GB capacity on an high-end card.
352bit: 1,8 TB/s. Likely 10x3GB for 30GB capacity on an high-end card.
384bit: 2,0 TB/s. Likely 12x3GB for 36GB capacity on an enthusiast card.
512bit: 2,5 TB/s. Likely 12x3GB for 36GB capacity on an enthusiast card.

G7 still needs multiples of 2 chips. At least that's the way i see it. Im sure Nvidia will see fit to give us less speeds and predominantly still use 2GB modules.
I included both 384bit and 512bit but generally it has been an OR situation where only one of these has been used for the flagship. Hence the same capacity.
Obviously it's possible to do a clamshell 24x2GB=48GB or 24x3GB=72GB, but such capacities are not really needed for gaming GPU's.

Overall it's possible to match one stack of HBM4 capacity and speed with 12x3GB 384bit G7 configuration. Obviously it will take up more space on the PCB and likely consumes twice as much power vs HBM4.
HBM is far too expensive to be used in consumer GPUs.
Which version? AMD was able to make a profit on 16GB HBM2 card six years ago. I dont believe no one cant do the same today with HBM3 or HBM3e on a high end card. HBM4 will obviously be reserved for data center monsters. Also using HBM would reduce PCB complexity and lower the cost because it's already on an interposer. It would also lower the power consumption of the card either by total or allowing more for the GPU itself.
 
GDDR7 requires 16 devices for that bandwidth whereas HBM4 can manage it with just one stack. Of course, HBM is far too expensive to be used in consumer GPUs.
Yeah, on a 512-bit bus. But I don't think we'd be seeing a single HBM stack on a entry/mid-level GPU anyway, so the comparison still stands.
I may be wrong on that, but I believe that 16x GDDR7 modules should be cheaper than an HBM stack, specially when we include the production cost.

92bit: 460 GB/s. Likely 4x2GB for 8GB capacity on an entry level card.
128bit: 640 GB/s. Likely 6x2GB for 12GB capacity on an low end card.
192bit: 960 GB/s. Likely 8x2GB for16GB capacity on an midrange card.
256bit: 1,3 TB/s. Likely 6x3GB for 18GB capacity on an upper midrange card.
320bit: 1,6 TB/s. Likely 8x3GB for 24GB capacity on an high-end card.
352bit: 1,8 TB/s. Likely 10x3GB for 30GB capacity on an high-end card.
384bit: 2,0 TB/s. Likely 12x3GB for 36GB capacity on an enthusiast card.
512bit: 2,5 TB/s. Likely 12x3GB for 36GB capacity on an enthusiast card.
Minor nit but i guess you got some numbers wrong.
As an example, 320-bit would be 10 channels, so either 10x 2 or 3GB modules for 20 or 30GB in total. Assuming those 40Gbps modules.
512b would be 16-channels, so 16x 2 or 3GB modules for 32 or 48GB in total (not accounting for clamshell).

G7 still needs multiples of 2 chips.
It doesn't, that's a matter of how many controllers you have. Nothing stops you from having 5 controllers for a 160-bit bus, or 11 controllers for a 352-bit bus.
 
It doesn't, that's a matter of how many controllers you have. Nothing stops you from having 5 controllers for a 160-bit bus, or 11 controllers for a 352-bit bus.
It's doable, but not a good idea due to performance and complexity reasons. Look how GTX 970 turned out.
 
It's doable, but not a good idea due to performance and complexity reasons. Look how GTX 970 turned out.
The 970 had a different issue on how the controllers were wired up, it had a proper 256-bit bus with all of its 4x 64-bit controllers perfectly in place. The issue was how one of those controllers was connected to its GPC.
There's no complexity or performance issues related to having an odd number of controllers.
The 1080ti and 2080ti are great counter-examples, with their 352-bit bus, along with the RX 6700 and its 160-bit bus.
 
Yeah, on a 512-bit bus. But I don't think we'd be seeing a single HBM stack on a entry/mid-level GPU anyway, so the comparison still stands.
I may be wrong on that, but I believe that 16x GDDR7 modules should be cheaper than an HBM stack, specially when we include the production cost.

...
Given previous estimates for the cost of HBM, I wouldn't be surprised if one stack of HBM was significantly more expensive than 32 GB of GDDR7. As @Tomorrow pointed out, an additional benefit of HBM is reduced chip area devoted to memory PHYs. This would allow an even larger GPU or a slightly smaller GPU with reduced TDP due to the greater power efficiency of HBM when compared to GDDR.
...
Which version? AMD was able to make a profit on 16GB HBM2 card six years ago. I dont believe no one cant do the same today with HBM3 or HBM3e on a high end card. HBM4 will obviously be reserved for data center monsters. Also using HBM would reduce PCB complexity and lower the cost because it's already on an interposer. It would also lower the power consumption of the card either by total or allowing more for the GPU itself.
A product like the 5090 could certainly make money using HBM, but at least in 2023, CoWoS, the packaging required for HBM was a bottleneck. Given the existence of this bottleneck, it makes sense to utilize HBM only for the most expensive datacenter products such as Nvidia's B200 and AMD's MI325X.
 
Back
Top