• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

[Theory] Multi-GPU graphic card that behaves like a single GPU?

Joined
Oct 2, 2004
Messages
13,791 (1.93/day)
I got some brainstorming going on around AMD's Navi architecture, particularly the "scaling" part they mention, which is suggesting multi-GPU setups on single PCB. So, I've expanded that a bit...

We know creating huge GPU's is very costly, mostly because of defective GPU's from the waffers.
So, more small GPU's fit on a single waffer, meaning you get more of them and less defective ones -> cheaper.

Fitting more GPU's the way we do it now requires SLi/CrossfireX profiles which are always pain in the ass and it just doesn't scale well.

Now, this is pure brainstorming with very little actual electronics knowledge...

Would it be possible to design smaller GPU's in such a way that you could stack 4 or 6 of them on a single PCB, but would present themselves to the system and behave as a single GPU without the use of ANY software or driver feature for that? I mean, like having more separate GPU chips that work together as a single physical GPU and not as multi-GPU which are tied together with SLI/Crossfire software.
Imagine internal compute units of it being connected to each other, but spread across multiple actual chips.
Or to go even further, separating compute units from memory controller and decoders and have them as individual chips on a card. Would it be even theoretically possible to go with such radical change to the way how graphic cards are being designed?

This would bring:
- Cheaper high end graphics due to simple stacking of slower smaller cores
- No scaling issues
- Decentralized heat output (multiple moderate heat points opposed to current single highly concentrated one)

What I'm suspecting is the limitation:
- Latency between GPU's once signal leaves the actual chip
- Memory connections & cache design
 

Mussels

Freshwater Moderator
Staff member
Joined
Oct 6, 2004
Messages
58,413 (8.18/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
DX12 multi GPU could simplify it greatly, in coming years. Could effectively slap a bunch of laptop size/TDP cards onto a single PCB and away you go.


harder to backport to DX11 and older, so it wont happen any time soon.
 
Joined
Oct 2, 2004
Messages
13,791 (1.93/day)
Basically DX12 does that in software what I'm thinking it should be done in hardware entirely. It's combining separate compute "pools" intoa single unit. So, AMD is kinda betting everything on that with Navi architecture.

They will probably create dual GPU at first where single GPU will be fast enough to handle DX11 games even without any scaling and for DX12 it would harness full dual GPU power. At least in theory...

I'm just concerned a bit knowing the history of VSA100. Though granted, back then there was NO support for it. Now we have DX12, even if only partially.
 

Mussels

Freshwater Moderator
Staff member
Joined
Oct 6, 2004
Messages
58,413 (8.18/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
well imagine if they wrote a DX12 wrapper that converted it to other systems like DX9/10/11?
Bam, superduper fancy pants multi GPU support.

that'd bring up their share price.
 
Joined
May 9, 2012
Messages
8,408 (1.92/day)
Location
Ovronnaz, Wallis, Switzerland
System Name main/SFFHTPCARGH!(tm)/Xiaomi Mi TV Stick/Samsung Galaxy S23/Ally
Processor Ryzen 7 5800X3D/i7-3770/S905X/Snapdragon 8 Gen 2/Ryzen Z1 Extreme
Motherboard MSI MAG B550 Tomahawk/HP SFF Q77 Express/uh?/uh?/Asus
Cooling Enermax ETS-T50 Axe aRGB /basic HP HSF /errr.../oh! liqui..wait, no:sizable vapor chamber/a nice one
Memory 64gb Corsair Vengeance Pro 3600mhz DDR4/8gb DDR3 1600/2gb LPDDR3/8gb LPDDR5x 4200/16gb LPDDR5
Video Card(s) Hellhound Spectral White RX 7900 XTX 24gb/GT 730/Mali 450MP5/Adreno 740/RDNA3 768 core
Storage 250gb870EVO/500gb860EVO/2tbSandisk/NVMe2tb+1tb/4tbextreme V2/1TB Arion/500gb/8gb/256gb/2tb SN770M
Display(s) X58222 32" 2880x1620/32"FHDTV/273E3LHSB 27" 1920x1080/6.67"/AMOLED 2X panel FHD+120hz/FHD 120hz
Case Cougar Panzer Max/Elite 8300 SFF/None/back/back-front Gorilla Glass Victus 2+ UAG Monarch Carbon
Audio Device(s) Logi Z333/SB Audigy RX/HDMI/HDMI/Dolby Atmos/KZ x HBB PR2/Edifier STAX Spirit S3 & SamsungxAKG beans
Power Supply Chieftec Proton BDF-1000C /HP 240w/12v 1.5A/4Smart Voltplug PD 30W/Asus USB-C 65W
Mouse Speedlink Sovos Vertical-Asus ROG Spatha-Logi Ergo M575/Xiaomi XMRM-006/touch/touch
Keyboard Endorfy Thock 75% <3/none/touch/virtual
VR HMD Medion Erazer
Software Win10 64/Win8.1 64/Android TV 8.1/Android 13/Win11 64
Benchmark Scores bench...mark? i do leave mark on bench sometime, to remember which one is the most comfortable. :o
the design you talk was already a thing back in the early 2000 end 90s (yep indeed the VSA100)
3dfx Voodoo 5 6000 AGP 128MB Rev_A1 1500 Octa fan card a.JPG 3dfx Voodoo 5 6000 AGP RevA1 1500 Front.jpg

and as of today a multi gpu single card would ever behave as a CFX/SLI card (look back the 590 690, 295x2 Radeon Pro Duo ARES III etc ...) although they are preferable to multi GPU multi card

or you mean a logic unit (coordinator) that dispatch the load across the chip on the card ... in that case that would not make the card cheaper but rather more expensive as it would require additional R&D work to see the light of the day.
 
Joined
Mar 23, 2012
Messages
777 (0.18/day)
Location
Norway
System Name Games/internet/usage
Processor I7 5820k 4.2 Ghz
Motherboard ASUS X99-A2
Cooling custom water loop for cpu and gpu
Memory 16GiB Crucial Ballistix Sport 2666 MHz
Video Card(s) Radeon Rx 6800 XT
Storage Samsung XP941 500 GB + 1 TB SSD
Display(s) Dell 3008WFP
Case Caselabs Magnum M8
Audio Device(s) Shiit Modi 2 Uber -> Matrix m-stage -> HD650
Power Supply beQuiet dark power pro 1200W
Mouse Logitech MX518
Keyboard Corsair K95 RGB
Software Win 10 Pro
I think you would need the chips to be on an interposer (like Fiji is with HBM) for the latencies to be low enough for it to work. This makes the interposer the biggest "chip" but making wires is simpler than making transistors, and the lithography can be bigger and with better yield.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.94/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Would it be possible to design smaller GPU's in such a way that you could stack 4 or 6 of them on a single PCB, but would present themselves to the system and behave as a single GPU without the use of ANY software or driver feature for that? I mean, like having more separate GPU chips that work together as a single physical GPU and not as multi-GPU which are tied together with SLI/Crossfire software.
You just explained why the Cell processor on the PS3 was garbage for devs. Trying to orchestrate a bunch of different "cores," if you will, would put an incredible burden on a co-processor or the CPU. All of this adds latency.

I think you're over-simplifying the problem. You may want to leave the design aspect to electrical engineers because there is a lot more to a GPU than merely rendering frames. A lot happens from the time that commands are dispatched and the actual result. Just like a CPU, there are internal components that need to be tightly coupled which makes the horizontal scaling you suggest to be infeasible.

- Latency between GPU's once signal leaves the actual chip
This is exactly why micro-stutter is a thing. Unequal latencies between two GPUs with the same load. A secondary GPU would need to get its data early to the primary to eliminate this problem or the primary would need to have latency purposely introduced to keep per-frame latencies consistent.

Simply put, making the engine wider already results in poor performance. We've seen this already with AMD even on single GPU setups where beefier CUs/SMs do better than more CUs/SMs with smaller shader counts.

Now, this is pure brainstorming with very little actual electronics knowledge...

Quick question: What makes you think EEs haven't been considering all of this for years? I suspect that they know how to do their job and is probably unwise to try to simplify such a complex problem, unless you've been withholding your EE credentials from us. ;)
 
Joined
Oct 2, 2004
Messages
13,791 (1.93/day)
While I agree knowing how things actually work helps a ton, but sometimes dumbest ideas deliver best results. When you're so focused on pre-existing designs and you base all your work off that, it's hard to truly think outside the box.

I know that quite well from software development side of things, I think hardware is no different. Experts are unable to think on a "dumb level". But when dumb people like myself have an idea, that may spark somethin glorious in expert's heads because they never thought of it on that level. Maybe not for this exact example, but you get the drift.
 

Mussels

Freshwater Moderator
Staff member
Joined
Oct 6, 2004
Messages
58,413 (8.18/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
we probably just described the idiot explanation for async shaders in DX12
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.94/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
While I agree knowing how things actually work helps a ton, but sometimes dumbest ideas deliver best results. When you're so focused on pre-existing designs and you base all your work off that, it's hard to truly think outside the box.

I know that quite well from software development side of things, I think hardware is no different. Experts are unable to think on a "dumb level". But when dumb people like myself have an idea, that may spark somethin glorious in expert's heads because they never thought of it on that level. Maybe not for this exact example, but you get the drift.
You underestimate the overhead of doing stuff in parallel though (at least I think that's the case.) I've been doing multi-threaded system integration for years now and let me tell you something, it's not an easy problem to solve. Orchestrating asynchronous processes is a bear to say the least.
 
Joined
Oct 29, 2012
Messages
1,926 (0.46/day)
Location
UK
System Name TITAN Slayer / CPUCannon / MassFX
Processor i7 5960X @ 4.6Ghz / i7 3960x @5.0Ghz / FX6350 @ 4.?Ghz
Motherboard Rampage V Extreme / Rampage IV Extreme / MSI 970 Gaming
Cooling Phanteks PHTC14PE 2.5K 145mm TRs / Custom waterloop / Phanteks PHTC14PE + 3K 140mm Noctuas
Memory Crucial 2666 11-13-13-25 1.45V / G.skill RipjawsX 2400 10-12-12-34 1.7V / Crucial 2133 9-9-9-27 1.7V
Video Card(s) 3 Fury X in CF / R9 Fury 3840 cores 1145/570 1.3V / Nothing ATM
Storage 500GB Crucial SSD and 3TB WD Black / WD 1TB Black(OS) + WD 3TB Green / WD 1TB Blue
Display(s) LG 29UM67 80Hz/Asus mx299q 2560x1080 @ 84Hz / Asus VX239 1920x1080 @60hz
Case Dismatech easy v3.0 / Xigmatek Alfar (Open side panel)
Audio Device(s) M-audio M-track / realtek ALC 1150
Power Supply EVGA G2 1600W / CoolerMaster V1000 / Seasonic 620 M12-II
Mouse Mouse in review process/Razer Naga Epic 2011/Razer Naga 2014
Keyboard Keyboard in review process / Razer Blackwidow Ultimate 2014/Razer Blackwidow Ultimate 2011
Software Windows 7 Ultimate / Windows 7 ultimate / Windows 7 ultimate
Benchmark Scores cinebench 15.41 3960x @ 5.3ghz Wprime32m 3.352 3960x @ 5.25ghz Super PI 32m: 6m 42s 472ms @5.25ghz
The way I see it it would be a huge gain if you just took each frame split into as many chunks as there are GPUs and told each GPU to render each chunk and then pass all that to the displaying GPU which would stich everything back together. Sure it wouldn't get anywhere near 100% performance scaling if one section of a frame had a vastly different amount of work in it than another however it would be highly scalable and stutter free while also relatively easy to support unlike AFR which right now barely ever works. All the GPUs would still need basically all the same data in memory so you wouldn't see an improvement in memory efficiency but it would still IMO be the simplest approach.

Now I'm not a 3D graphics dev but I had to implement this I would basically just make 1 camera for each GPU and then stitch the different camera views together on the primary card. Sure some of the GPUs would spend large amounts of time idling due to disparancies in workload but on average frame pushing would get faster and faster even with silly amounts of GPUs. CPU workload would also end up pretty high in my approach which would suck but hey you gotta use those i7s for something right?
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.94/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Aren't GPU's already just that? A huge parallel computing thingie?
Parts of it are, there are a lot of parts that aren't. The ACEs and command processor aren't, L2 cache isn't (which is shared with the entire GPU,) the memory controller isn't and the PCI-E controller isn't. This is all stuff that would need to be replicated and things like the memory controller and L2 cache tend to be pretty big. Not sharing those would increase latency by a huge amount and would make the entire design unrealistic. Not all of this stuff used to be close together. We used to have things like off-die L2 cache and off-die memory controllers but, they've all been moved closer to the compute cores because the further away they are, the worse latency and bandwidth are going to get.

The way I see it it would be a huge gain if you just took each frame split into as many chunks as there are GPUs and told each GPU to render each chunk and then pass all that to the displaying GPU which would stich everything back together. Sure it wouldn't get anywhere near 100% performance scaling if one section of a frame had a vastly different amount of work in it than another however it would be highly scalable and stutter free while also relatively easy to support unlike AFR which right now barely ever works. All the GPUs would still need basically all the same data in memory so you wouldn't see an improvement in memory efficiency but it would still IMO be the simplest approach.

Now I'm not a 3D graphics dev but I had to implement this I would basically just make 1 camera for each GPU and then stitch the different camera views together on the primary card. Sure some of the GPUs would spend large amounts of time idling due to disparancies in workload but on average frame pushing would get faster and faster even with silly amounts of GPUs. CPU workload would also end up pretty high in my approach which would suck but hey you gotta use those i7s for something right?
The problem that needs to be solved is the issue of the latency between the time that the frame is done and rendered and the time it can find its way to the frame buffer that will be displaying it. Two GPUs at the same clock are going to (mostly,) render frames at the same rate. You start running into the micro-stutter problem due the latency of transferring that frame to the primary card in order to be displayed. I would argue that an on-board frame buffer that the GPUs send the frames to would be a far better solution. I say this because instead of trying to do something weird to smooth out the difference in latencies, you're actively putting both GPUs an equal distance away from the thing that will be rendering each frame. The cost is slightly higher overall render latency, the benefit would be reduced frame-to-frame latency variation.

So long is there is a primary GPU that handles displays, I would argue that multi-GPU setups will never solve the micro-stutter problem.
 
Joined
Oct 2, 2004
Messages
13,791 (1.93/day)
So, how does DX12 solve that? I mean, I think I've heard right that DX12 not only scales better it also has almost no frame pacing issues. With that, once DX12 becomes a norm, graphic cards with stacked lower end GPU's would make sense, wouldn't they?
 

cadaveca

My name is Dave
Joined
Apr 10, 2006
Messages
17,232 (2.61/day)
So, how does DX12 solve that? I mean, I think I've heard right that DX12 not only scales better it also has almost no frame pacing issues. With that, once DX12 becomes a norm, graphic cards with stacked lower end GPU's would make sense, wouldn't they?
Not really. NUMA. You know, the thing that basically makes AMD's CPUs slower.
 

Toothless

Tech, Games, and TPU!
Supporter
Joined
Mar 26, 2014
Messages
9,278 (2.52/day)
Location
Washington, USA
System Name Veral
Processor 5950x
Motherboard MSI MEG x570 Ace
Cooling Corsair H150i RGB Elite
Memory 4x16GB G.Skill TridentZ
Video Card(s) Powercolor 7900XTX Red Devil
Storage Crucial P5 Plus 1TB, Samsung 980 1TB, Teamgroup MP34 4TB
Display(s) Acer Nitro XZ342CK Pbmiiphx + 2x AOC 2425W
Case Fractal Design Meshify Lite 2
Audio Device(s) Blue Yeti + SteelSeries Arctis 5 / Samsung HW-T550
Power Supply Corsair HX850
Mouse Corsair Nightsword
Keyboard Corsair K55
VR HMD HP Reverb G2
Software Windows 11 Professional
Benchmark Scores PEBCAK
Sooo what we need is a laptop based i5 and two GTX1080s all on one PCB with touching dies to be butt buddies and four 8 pin pci power connectors to balance the 16gb vram and 20mb of l4 cache.

did i do it right
 
Top