AMD Patents Chiplet-based GPU Design With Active Cache Bridge

Wirko · Apr 5, 2021

AMD may be experimenting with ways to separate processing cores, built on the latest tech they can get their hands on, and cache. The cache could be built using second best - now GlobalFoundries' 12mm, later something like TSMC 7nm. Static RAM doesn't scale well with node shrinks - at least the surface area doesn't scale well, I don't know about performance and power. So the cache is possibly a good candidate for being offloaded to a cheaper die, the latency would obviously go up but maintaining cache coherence would be an easier task, higher latency can also be mitigated with increased size, and AMD needs to keep buying something from GloFo anyway.

evernessince · Apr 5, 2021

TheoneandonlyMrK said:
While I agree with most of your points, I so think your wrong on efficiency and IPC because people (Not AMD but scientists I can't recall including those of Nvidia)have already proven that it can be both more efficient and give higher IPC, forget people even, AMD themselves also proved it with the Zen architecture

Correct. This is what you are referring to in particular: http://www.cs.ucy.ac.cy/conferences/pact2018/material/PACT2008-public.pdf

Steevo · Apr 6, 2021

The biggest gains will be in clock speed, multiple domains for multiple chiplets and each can be engineered for IPC, clock speed, and or latency as required.

Imagine 4 chiplets with 4Ghz boost speeds, a 2Ghz cache that is massively parallel with compression technology, a couple tiny chiplets for video encode/decode and for low power applications.

Now add on the stacked die tech that has been learned to create a parallel pipeline for pure vector math for Ray tracing stacked on each of the main 4 chiplets that can read and write to caches on the primary die. Ray tracing with the only performance penalty being extra heat and a fraction of the latency.

mtcn77 · Apr 6, 2021

evernessince said:
Correct. This is what you are referring to in particular: http://www.cs.ucy.ac.cy/conferences/pact2018/material/PACT2008-public.pdf

Doesn't it say MCM adds as much as +1GHz!

Minus Infinity · Apr 6, 2021

Possibly a glimpse of RDNA4's future, doubt we'll see this in RDNA3. Mostly likely will go up against Hopper which was delayed and replaced by Lovelace for next gen.

evernessince · Apr 6, 2021

mtcn77 said:
Doesn't it say MCM adds as much as +1GHz!

Correct. By separating the CPU cores into a separate die you gain the ability to further bin which CPU die ends up on which CPU. This is how AMD is able to have it's 16 core 5950X that consumes less power than it's 12 core while also using less power. The 5950X is about 28% more power efficient than other Ryzen 5000 series CPUs through binning alone. AMD likely decided to go for efficiency instead of extra clocks for two reasons 1) Intel doesn't have anything competitive to it's 12 and 16 core mainstream CPUs 2) The power consumption goes up much faster above the sweet spot. Increasing the GHz would improve ST performance but at a cost. AMD likely calculated that given Intel's current prospects, it would be better to focus on efficiency.

ltkAlpha · Apr 6, 2021

Punkenjoy said:
This is mostly true altought less and less true as there are more and more technique that reuse generated data. This is also why SLI/Crossfire is dead. The latency to move these data was just way too big. Temporal AA, ScreenSpace reflection, etc...

Can't you have one chiplet dealing with frame/scene level calculations after you've powered through the more easily parallelizable tasks? As in 1 Bigger (perhaps on the hub chip to reduce latency to the cache) + N Small(er)?

mtcn77 · Apr 6, 2021

evernessince said:
Increasing the GHz would improve ST performance but at a cost.

You approach from a cpu stand point. On a gpu, the ST isn't the only factor, internal bandwidth is a major proponent. The bandwidth is a lot on a gpu however bandwidth per CU needs a lot of use to leverage fully, since the memory unit is external to the chip. Running it faster solves that problem.
Bets: 3.5GHz gpus over the horizon, or not?

evernessince · Apr 6, 2021

mtcn77 said:
You approach from a cpu stand point. On a gpu, the ST isn't the only factor, internal bandwidth is a major proponent. The bandwidth is a lot on a gpu however bandwidth per CU needs a lot of use to leverage fully, since the memory unit is external to the chip. Running it faster solves that problem.
Bets: 3.5GHz gpus over the horizon, or not?

I'd say it's equally as possible that we see MCM GPU architectures that simply target the frequency sweetspot and spend any extra power budget add more cores, cache, ect. It really depends though, for all we know AMD or Nvidia could design their GPU chiplets to clock very high and thus the sweetspot would follow suite. I'm not knowledgeable enough on the topic to say to the extent that Nvidia / AMD and TSMC can influence ideal GPU clockspeed based on design / node.

mtcn77 · Apr 6, 2021

evernessince said:
I'd say it's equally as possible that we see MCM GPU architectures that simply target the frequency sweetspot and spend any extra power budget add more cores, cache, ect. It really depends though, for all we know AMD or Nvidia could design their GPU chiplets to clock very high and thus the sweetspot would follow suite. I'm not knowledgeable enough on the topic to say to the extent that Nvidia / AMD and TSMC can influence ideal GPU clockspeed based on design / node.

Me neither, although some would consider me an old timer.
Gpus, do associate with high frequency because the power cost is already paid for. Remember Hawaii series? AMD never integrated tiled 'buffered' rasterization up until Vega and thus the memory interface never slowed down since it was always running in immediate mode whereas Nvidia can keep tabs at various memory clocks.
It could improve utilization if the shaders request at a higher rate - gpus are throughput oriented, after all...

Vya Domus · Apr 6, 2021

Steevo said:
Imagine 4 chiplets with 4Ghz boost speeds

There is going to be a long time before we'll see that if ever. Every kind of chip seems to start scaling horribly past the 3 Ghz mark, a GPU in particular will be horrendous efficiency wise at those kinds of speeds.

HenrySomeone · Apr 6, 2021

Aranarth said:
So for those of you waiting for AMD to do to nVidia what they did to Intel....

Here it is.

Sounds like RDNA 3 will be an interesting generation for sure!

What they did to Intel? You mean, as soon as they got competitive, they also became both more expensive and hard to get in the first place - what a fantastic prospect for the already beleaguered graphics cards market indeed!

Aranarth · Apr 6, 2021

HenrySomeone said:
What they did to Intel? You mean, as soon as they got competitive, they also became both more expensive and hard to get in the first place - what a fantastic prospect for the already beleaguered graphics cards market indeed!

You DO realize that this is market forces at work right?

Demand outstripped supply so far that even though TSMC is running FLAT OUT they still cannot keep up!
They now spending 100 BILLION DOLLATRS over the next three years to build more plants so they can deal with the demand.

Then you have people buying them by the millisecond so fast with their bots that you cannot buy them through normal channels making a bad situation even worse.
But hey they do it because they can make 25 to 50% profit selling on ebay and through the gray market.

AMD made the decision to focus on supplying computer manufacturers and not direct sellers like newegg and amazon.
I just got a 6800xt and 5600x from Dell.
Placed my order, waited a month and here it is! AND I got both for what appears to be MSRP or close to it.

Be sure you are looking at the BIG PICTURE before lambasting people and companies for things that are out of their control.

mtcn77 · Apr 6, 2021

Vya Domus said:
Every kind of chip seems to start scaling horribly past the 3 Ghz mark, a GPU in particular will be horrendous efficiency wise at those kinds of speeds.

This could bring a split multiplier to run internal caches faster than the gpu. Don't dismiss it, the scaling isn't linear because memory is external and not helpful in the gpu pipeline flow directly - gpu speed, however, is. Nothing outside of cache speed changes that(maybe texture caching, too).

Vya Domus · Apr 6, 2021

mtcn77 said:
This could bring a split multiplier to run internal caches faster than the gpu. Don't dismiss it, the scaling isn't linear because memory is external and not helpful in the gpu pipeline flow directly - gpu speed, however, is. Nothing outside of cache speed changes that(maybe texture caching, too).

Caches are power hogs, very high energy density per area, for that reason they usually run slower than the processor itself. The only portions of memory that run as fast the processor are the registers, everything else, including L1 caches typically run slower.

mtcn77 · Apr 6, 2021

Vya Domus said:
Caches are power hogs, very high energy density per area, for that reason they usually run slower than the processor itself. The only portions of memory that run as fast the processor are the registers, everything else, including L1 caches typically run slower.

Well, guess what consumes power at an even higher rate than the caches - memory devices. The futility with saving power by cutting the effective rate is self explanatory. There is a way that is uses buffering to reduce accesses to memory and texture caching to supplant memory by sram. It ties with actual data flow across the die whereas the memory devices don't solve any bottlenecks, they are last level.
I'm not well versed enough, but there is no free lunch. SRAM offers much more than its substitutes.

Vya Domus · Apr 6, 2021

mtcn77 said:
Well, guess what consumes power at an even higher rate than the caches - memory devices. The futility with saving power by cutting the effective rate is self explanatory. There is a way that is uses buffering to reduce accesses to memory and texture caching to supplant memory by sram. It ties with actual data flow across the die whereas the memory devices don't solve any bottlenecks, they are last level.
I'm not well versed enough, but there is no free lunch. SRAM offers much more than its substitutes.

Yes access to global memory is very inefficient power wise and cache hits improves that. But the problem is caches live on die and need to be cooled and eat away at the power budget of the chip.

Remember how the Infinity cache is placed around the CUs and not between them as to how you'd expect it to be ? I think it was a deliberate choice to place this huge chunk of cache on the extremities of the chip to reduce heat spots.

Deleted member 205776 · Apr 6, 2021

Aranarth said:
So for those of you waiting for AMD to do to nVidia what they did to Intel....

Here it is.

Sounds like RDNA 3 will be an interesting generation for sure!

Didn't they say they'll take this chiplet approach on CDNA first and not RDNA?

HenrySomeone said:
they also became both more expensive and hard to get in the first place

Wasn't the case until Zen 3 and this chipocalypse... Zen 2 swept the floor with Intel and it was a real market disruptor.

All AMD did was force Intel to get off their ass and make reasonable products at a more reasonable price, and even force down the price on their 10th gens, which is always good for everyone. If it weren't for them I wouldn't have a 12 core in my system right now, and would probably have to make do with 6 cores from Intel, on my old 8700.

Now, if they could make Ngreedia do the same, that'd be great... but I'm not having high hopes here. Unlike Intel, NVIDIA has never been sleeping. They are a worthy competitor to AMD. We'll see how this approach works on CDNA first - doubt the next RDNA gen will have this. Maybe the one after.

mtcn77 · Apr 6, 2021

Vya Domus said:
Yes access to global memory is very inefficient power wise and cache hits improves that. But the problem is caches live on die and need to be cooled and eat away at the power budget of the chip.

View attachment 195489

Remember how the Infinity cache is placed around the CUs and not between them as to how you'd expect them to be ? I think it was a deliberate choice to place this huge chunk of cache on the extremities of the chip to reduce heat spots.

Thanks for citing fancy references. I agree with most points, but I think we are being repetitive.

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

Processor	Ryzen 9800X3D
Motherboard	ASRock X670E Taichi
Cooling	Noctua NH-D15 Chromax
Memory	64GB DDR5 6000 CL26
Video Card(s)	MSI RTX 4090 Trio
Storage	P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s)	Acer Predator XB3 27" 240 Hz
Case	Thermaltake Core X9
Audio Device(s)	JDS Element IV, DCA Aeon II
Power Supply	Seasonic Prime Titanium 850w
Mouse	PMM P-305
Keyboard	Wooting HE60
VR HMD	Valve Index
Software	Win 10

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

Processor	Ryzen 9800X3D
Motherboard	ASRock X670E Taichi
Cooling	Noctua NH-D15 Chromax
Memory	64GB DDR5 6000 CL26
Video Card(s)	MSI RTX 4090 Trio
Storage	P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s)	Acer Predator XB3 27" 240 Hz
Case	Thermaltake Core X9
Audio Device(s)	JDS Element IV, DCA Aeon II
Power Supply	Seasonic Prime Titanium 850w
Mouse	PMM P-305
Keyboard	Wooting HE60
VR HMD	Valve Index
Software	Win 10

System Name	ROU-Think-Fast
Processor	AMD Ryzen 7 5800X
Motherboard	B550 AORUS PRO V2 (rev. 1.0)
Cooling	Custom (2 x Alphacool NexXxoS XT45, front/top, Alphacool Eisblock Aurora Acryl GPX-N RTX 3090/3080 )
Memory	4x8 GB Kingston Hyper X KHX3466C16D4/8GX (B-Die) @ 3600, C16-16-16-32
Video Card(s)	RTX 3080 10GB
Storage	ADATA SX8200 Pro 1 TB
Display(s)	Acer Predator XB271HU
Case	Fractal Design Meshify 2
Power Supply	EVGA 750W Gold

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Dell Alienware Aurora R10
Processor	Ryzen 5600x
Motherboard	Dell 570 or B550
Cooling	Alienware AIO sandwiched between two Corsair ML120 Pro's
Memory	G.SKILL Ripjaws V Series 32GB cl16
Video Card(s)	Radeon RX 6800 XT
Storage	Western Digital WD BLACK SN750 NVMe M.2 2280 2TB
Display(s)	GIGABYTE G34WQC 34" 144Hz (plus 2 Dell 19" 1280x1024 to flank it)
Case	Alienware Auraor r10
Audio Device(s)	onboard
Power Supply	Dell 1KW
Mouse	Logitech Trackman Marble
Keyboard	blue glowy thinhy 104 key KB

AMD Patents Chiplet-based GPU Design With Active Cache Bridge

Deleted member 205776

Guest