• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Patents Chiplet-based GPU Design With Active Cache Bridge

Joined
Jan 3, 2021
Messages
130 (1.34/day)
Location
Exexfirstladyland
AMD may be experimenting with ways to separate processing cores, built on the latest tech they can get their hands on, and cache. The cache could be built using second best - now GlobalFoundries' 12mm, later something like TSMC 7nm. Static RAM doesn't scale well with node shrinks - at least the surface area doesn't scale well, I don't know about performance and power. So the cache is possibly a good candidate for being offloaded to a cheaper die, the latency would obviously go up but maintaining cache coherence would be an easier task, higher latency can also be mitigated with increased size, and AMD needs to keep buying something from GloFo anyway.
 
Joined
Jul 13, 2016
Messages
1,009 (0.58/day)
Processor Ryzen 3700X
Motherboard ASRock X570 Taichi
Cooling Le Grand Macho
Memory 32GB DDR4 3600 CL16
Video Card(s) EVGA 1080 Ti
Storage Too much
Display(s) Acer 144Hz 1440p IPS 27"
Case Thermaltake Core X9
Audio Device(s) JDS labs The Element II, Dan Clark Audio Aeon II
Power Supply EVGA 850w P2
Mouse G305
Keyboard iGK64 w/ 30n optical switches
Joined
Nov 4, 2005
Messages
10,741 (1.91/day)
System Name MoFo 2
Processor AMD PhenomII 1100T @ 4.2Ghz
Motherboard Asus Crosshair IV
Cooling Swiftec 655 pump, Apogee GT,, MCR360mm Rad, 1/2 loop.
Memory 8GB DDR3-2133 @ 1900 8.9.9.24 1T
Video Card(s) HD7970 1250/1750
Storage Agility 3 SSD 6TB RAID 0 on RAID Card
Display(s) 46" 1080P Toshiba LCD
Case Rosewill R6A34-BK modded (thanks to MKmods)
Audio Device(s) ATI HDMI
Power Supply 750W PC Power & Cooling modded (thanks to MKmods)
Software A lot.
Benchmark Scores Its fast. Enough.
The biggest gains will be in clock speed, multiple domains for multiple chiplets and each can be engineered for IPC, clock speed, and or latency as required.

Imagine 4 chiplets with 4Ghz boost speeds, a 2Ghz cache that is massively parallel with compression technology, a couple tiny chiplets for video encode/decode and for low power applications.

Now add on the stacked die tech that has been learned to create a parallel pipeline for pure vector math for Ray tracing stacked on each of the main 4 chiplets that can read and write to caches on the primary die. Ray tracing with the only performance penalty being extra heat and a fraction of the latency.
 
Joined
May 3, 2018
Messages
313 (0.29/day)
Possibly a glimpse of RDNA4's future, doubt we'll see this in RDNA3. Mostly likely will go up against Hopper which was delayed and replaced by Lovelace for next gen.
 
Joined
Jul 13, 2016
Messages
1,009 (0.58/day)
Processor Ryzen 3700X
Motherboard ASRock X570 Taichi
Cooling Le Grand Macho
Memory 32GB DDR4 3600 CL16
Video Card(s) EVGA 1080 Ti
Storage Too much
Display(s) Acer 144Hz 1440p IPS 27"
Case Thermaltake Core X9
Audio Device(s) JDS labs The Element II, Dan Clark Audio Aeon II
Power Supply EVGA 850w P2
Mouse G305
Keyboard iGK64 w/ 30n optical switches
Doesn't it say MCM adds as much as +1GHz!

Correct. By separating the CPU cores into a separate die you gain the ability to further bin which CPU die ends up on which CPU. This is how AMD is able to have it's 16 core 5950X that consumes less power than it's 12 core while also using less power. The 5950X is about 28% more power efficient than other Ryzen 5000 series CPUs through binning alone. AMD likely decided to go for efficiency instead of extra clocks for two reasons 1) Intel doesn't have anything competitive to it's 12 and 16 core mainstream CPUs 2) The power consumption goes up much faster above the sweet spot. Increasing the GHz would improve ST performance but at a cost. AMD likely calculated that given Intel's current prospects, it would be better to focus on efficiency.
 
Joined
Jul 12, 2017
Messages
19 (0.01/day)
System Name ROU-Think-Fast
Processor AMD Ryzen 7 5800X
Motherboard B550 AORUS PRO V2 (rev. 1.0)
Cooling Corsair H110i w/ Noctua 2000 RPM fans
Memory 4x8 GB Kingston Hyper X KHX3466C16D4/8GX (B-Die) @ 3600, 15-15-15-30
Video Card(s) ROG-STRIX-GTX1080TI-O11G-GAMING
Storage ADATA SX8200 Pro 1 TB + 250 GB Samsung 850 Evo + 2 x 2 TB Seagate Barracuda
Display(s) Acer Predator XB271HU
Case Fractal Design Define S
Audio Device(s) -
Power Supply EVGA 750W Gold
This is mostly true altought less and less true as there are more and more technique that reuse generated data. This is also why SLI/Crossfire is dead. The latency to move these data was just way too big. Temporal AA, ScreenSpace reflection, etc...

Can't you have one chiplet dealing with frame/scene level calculations after you've powered through the more easily parallelizable tasks? As in 1 Bigger (perhaps on the hub chip to reduce latency to the cache) + N Small(er)?
 
Joined
Jun 3, 2010
Messages
1,746 (0.44/day)
Increasing the GHz would improve ST performance but at a cost.
You approach from a cpu stand point. On a gpu, the ST isn't the only factor, internal bandwidth is a major proponent. The bandwidth is a lot on a gpu however bandwidth per CU needs a lot of use to leverage fully, since the memory unit is external to the chip. Running it faster solves that problem.
Bets: 3.5GHz gpus over the horizon, or not?
 
Joined
Jul 13, 2016
Messages
1,009 (0.58/day)
Processor Ryzen 3700X
Motherboard ASRock X570 Taichi
Cooling Le Grand Macho
Memory 32GB DDR4 3600 CL16
Video Card(s) EVGA 1080 Ti
Storage Too much
Display(s) Acer 144Hz 1440p IPS 27"
Case Thermaltake Core X9
Audio Device(s) JDS labs The Element II, Dan Clark Audio Aeon II
Power Supply EVGA 850w P2
Mouse G305
Keyboard iGK64 w/ 30n optical switches
You approach from a cpu stand point. On a gpu, the ST isn't the only factor, internal bandwidth is a major proponent. The bandwidth is a lot on a gpu however bandwidth per CU needs a lot of use to leverage fully, since the memory unit is external to the chip. Running it faster solves that problem.
Bets: 3.5GHz gpus over the horizon, or not?
I'd say it's equally as possible that we see MCM GPU architectures that simply target the frequency sweetspot and spend any extra power budget add more cores, cache, ect. It really depends though, for all we know AMD or Nvidia could design their GPU chiplets to clock very high and thus the sweetspot would follow suite. I'm not knowledgeable enough on the topic to say to the extent that Nvidia / AMD and TSMC can influence ideal GPU clockspeed based on design / node.
 
Joined
Jun 3, 2010
Messages
1,746 (0.44/day)
I'd say it's equally as possible that we see MCM GPU architectures that simply target the frequency sweetspot and spend any extra power budget add more cores, cache, ect. It really depends though, for all we know AMD or Nvidia could design their GPU chiplets to clock very high and thus the sweetspot would follow suite. I'm not knowledgeable enough on the topic to say to the extent that Nvidia / AMD and TSMC can influence ideal GPU clockspeed based on design / node.
Me neither, although some would consider me an old timer.
Gpus, do associate with high frequency because the power cost is already paid for. Remember Hawaii series? AMD never integrated tiled 'buffered' rasterization up until Vega and thus the memory interface never slowed down since it was always running in immediate mode whereas Nvidia can keep tabs at various memory clocks.
It could improve utilization if the shaders request at a higher rate - gpus are throughput oriented, after all...
 
Joined
Jan 8, 2017
Messages
6,593 (4.25/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
Imagine 4 chiplets with 4Ghz boost speeds

There is going to be a long time before we'll see that if ever. Every kind of chip seems to start scaling horribly past the 3 Ghz mark, a GPU in particular will be horrendous efficiency wise at those kinds of speeds.
 
Joined
Apr 16, 2019
Messages
368 (0.51/day)
So for those of you waiting for AMD to do to nVidia what they did to Intel....

Here it is.

Sounds like RDNA 3 will be an interesting generation for sure!
What they did to Intel? You mean, as soon as they got competitive, they also became both more expensive and hard to get in the first place - what a fantastic prospect for the already beleaguered graphics cards market indeed!
 
Joined
Mar 30, 2021
Messages
10 (0.91/day)
System Name Dell Alienware Aurora R10
Processor Ryzen 5600x
Motherboard Dell 570 or B550
Cooling Alienware AIO sandwiched between two Corsair ML120 Pro's
Memory G.SKILL Ripjaws V Series 32GB cl16
Video Card(s) Radeon RX 6800 XT
Storage Western Digital WD BLACK SN750 NVMe M.2 2280 2TB
Display(s) GIGABYTE G34WQC 34" 144Hz (plus 2 Dell 19" 1280x1024 to flank it)
Case Alienware Auraor r10
Audio Device(s) onboard
Power Supply Dell 1KW
Mouse Logitech Trackman Marble
Keyboard blue glowy thinhy 104 key KB
What they did to Intel? You mean, as soon as they got competitive, they also became both more expensive and hard to get in the first place - what a fantastic prospect for the already beleaguered graphics cards market indeed!
You DO realize that this is market forces at work right?

Demand outstripped supply so far that even though TSMC is running FLAT OUT they still cannot keep up!
They now spending 100 BILLION DOLLATRS over the next three years to build more plants so they can deal with the demand.

Then you have people buying them by the millisecond so fast with their bots that you cannot buy them through normal channels making a bad situation even worse.
But hey they do it because they can make 25 to 50% profit selling on ebay and through the gray market.

AMD made the decision to focus on supplying computer manufacturers and not direct sellers like newegg and amazon.
I just got a 6800xt and 5600x from Dell.
Placed my order, waited a month and here it is! AND I got both for what appears to be MSRP or close to it.


Be sure you are looking at the BIG PICTURE before lambasting people and companies for things that are out of their control.
 
Last edited:
Joined
Jun 3, 2010
Messages
1,746 (0.44/day)
Every kind of chip seems to start scaling horribly past the 3 Ghz mark, a GPU in particular will be horrendous efficiency wise at those kinds of speeds.
This could bring a split multiplier to run internal caches faster than the gpu. Don't dismiss it, the scaling isn't linear because memory is external and not helpful in the gpu pipeline flow directly - gpu speed, however, is. Nothing outside of cache speed changes that(maybe texture caching, too).
 
Joined
Jan 8, 2017
Messages
6,593 (4.25/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
This could bring a split multiplier to run internal caches faster than the gpu. Don't dismiss it, the scaling isn't linear because memory is external and not helpful in the gpu pipeline flow directly - gpu speed, however, is. Nothing outside of cache speed changes that(maybe texture caching, too).

Caches are power hogs, very high energy density per area, for that reason they usually run slower than the processor itself. The only portions of memory that run as fast the processor are the registers, everything else, including L1 caches typically run slower.
 
Joined
Jun 3, 2010
Messages
1,746 (0.44/day)
Caches are power hogs, very high energy density per area, for that reason they usually run slower than the processor itself. The only portions of memory that run as fast the processor are the registers, everything else, including L1 caches typically run slower.
Well, guess what consumes power at an even higher rate than the caches - memory devices. The futility with saving power by cutting the effective rate is self explanatory. There is a way that is uses buffering to reduce accesses to memory and texture caching to supplant memory by sram. It ties with actual data flow across the die whereas the memory devices don't solve any bottlenecks, they are last level.
I'm not well versed enough, but there is no free lunch. SRAM offers much more than its substitutes.
 
Joined
Jan 8, 2017
Messages
6,593 (4.25/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
Well, guess what consumes power at an even higher rate than the caches - memory devices. The futility with saving power by cutting the effective rate is self explanatory. There is a way that is uses buffering to reduce accesses to memory and texture caching to supplant memory by sram. It ties with actual data flow across the die whereas the memory devices don't solve any bottlenecks, they are last level.
I'm not well versed enough, but there is no free lunch. SRAM offers much more than its substitutes.
Yes access to global memory is very inefficient power wise and cache hits improves that. But the problem is caches live on die and need to be cooled and eat away at the power budget of the chip.

1617720116916.png


Remember how the Infinity cache is placed around the CUs and not between them as to how you'd expect it to be ? I think it was a deliberate choice to place this huge chunk of cache on the extremities of the chip to reduce heat spots.
 
Joined
Dec 24, 2020
Messages
651 (6.08/day)
Location
Austria
System Name 12 y/o me's dream
Processor AMD Ryzen 9 3900X (stock, chipset driver 2.13.27.501, Ryzen Balanced power plan)
Motherboard ASUS ROG Strix B550-F (BIOS 2006, AGESA 1.2.0.1 Patch A)
Cooling Noctua NH-D15(S) Chromax Black (70% max fan speed @ >60C) | Noctua NT-H2 10g
Memory 4x 8 GB G.Skill Trident Z Neo 3600 16-16-16-36-52-324 65ns
Video Card(s) MSI RTX 3070 Gaming X Trio (Re-Bar On / 2100 MHz @ 950mV / +1 GHz memory)
Storage 1x Samsung 980 PRO 500GB / 1x SanDisk X400 512 GB / 2x Crucial MX500 1 TB
Display(s) 1440p144 ASUS TUF VG27AQ / 1080p75 LG 22MP68VQ-P
Case be quiet! Pure Base 500DX Black
Power Supply Seasonic Prime PX-750 80+ Platinum Fully Modular
Mouse ASUS ROG Chakram
Keyboard ASUS ROG Strix Flare Cherry MX Red RGB
Software Windows 10 Pro 20H2 Build 19042.906
Benchmark Scores can play minesweeper
So for those of you waiting for AMD to do to nVidia what they did to Intel....

Here it is.

Sounds like RDNA 3 will be an interesting generation for sure!
Didn't they say they'll take this chiplet approach on CDNA first and not RDNA?

they also became both more expensive and hard to get in the first place
Wasn't the case until Zen 3 and this chipocalypse... Zen 2 swept the floor with Intel and it was a real market disruptor.

All AMD did was force Intel to get off their ass and make reasonable products at a more reasonable price, and even force down the price on their 10th gens, which is always good for everyone. If it weren't for them I wouldn't have a 12 core in my system right now, and would probably have to make do with 6 cores from Intel, on my old 8700.

Now, if they could make Ngreedia do the same, that'd be great... but I'm not having high hopes here. Unlike Intel, NVIDIA has never been sleeping. They are a worthy competitor to AMD. We'll see how this approach works on CDNA first - doubt the next RDNA gen will have this. Maybe the one after.
 
Last edited:
Joined
Jun 3, 2010
Messages
1,746 (0.44/day)
Yes access to global memory is very inefficient power wise and cache hits improves that. But the problem is caches live on die and need to be cooled and eat away at the power budget of the chip.

View attachment 195489

Remember how the Infinity cache is placed around the CUs and not between them as to how you'd expect them to be ? I think it was a deliberate choice to place this huge chunk of cache on the extremities of the chip to reduce heat spots.
Thanks for citing fancy references. I agree with most points, but I think we are being repetitive.
 
Top