[Theory] Multi-GPU graphic card that behaves like a single GPU?

RejZoR · Jun 29, 2016

I got some brainstorming going on around AMD's Navi architecture, particularly the "scaling" part they mention, which is suggesting multi-GPU setups on single PCB. So, I've expanded that a bit...

We know creating huge GPU's is very costly, mostly because of defective GPU's from the waffers.
So, more small GPU's fit on a single waffer, meaning you get more of them and less defective ones -> cheaper.

Fitting more GPU's the way we do it now requires SLi/CrossfireX profiles which are always pain in the ass and it just doesn't scale well.

Now, this is pure brainstorming with very little actual electronics knowledge...

Would it be possible to design smaller GPU's in such a way that you could stack 4 or 6 of them on a single PCB, but would present themselves to the system and behave as a single GPU without the use of ANY software or driver feature for that? I mean, like having more separate GPU chips that work together as a single physical GPU and not as multi-GPU which are tied together with SLI/Crossfire software.
Imagine internal compute units of it being connected to each other, but spread across multiple actual chips.
Or to go even further, separating compute units from memory controller and decoders and have them as individual chips on a card. Would it be even theoretically possible to go with such radical change to the way how graphic cards are being designed?

This would bring:
- Cheaper high end graphics due to simple stacking of slower smaller cores
- No scaling issues
- Decentralized heat output (multiple moderate heat points opposed to current single highly concentrated one)

What I'm suspecting is the limitation:
- Latency between GPU's once signal leaves the actual chip
- Memory connections & cache design

Mussels · Jun 29, 2016

DX12 multi GPU could simplify it greatly, in coming years. Could effectively slap a bunch of laptop size/TDP cards onto a single PCB and away you go.

harder to backport to DX11 and older, so it wont happen any time soon.

RejZoR · Jun 29, 2016

Basically DX12 does that in software what I'm thinking it should be done in hardware entirely. It's combining separate compute "pools" intoa single unit. So, AMD is kinda betting everything on that with Navi architecture.

They will probably create dual GPU at first where single GPU will be fast enough to handle DX11 games even without any scaling and for DX12 it would harness full dual GPU power. At least in theory...

I'm just concerned a bit knowing the history of VSA100. Though granted, back then there was NO support for it. Now we have DX12, even if only partially.

Mussels · Jun 29, 2016

well imagine if they wrote a DX12 wrapper that converted it to other systems like DX9/10/11?
Bam, superduper fancy pants multi GPU support.

that'd bring up their share price.

GreiverBlade · Jun 29, 2016

the design you talk was already a thing back in the early 2000 end 90s (yep indeed the VSA100)

3dfx Voodoo 5 6000 AGP 128MB Rev_A1 1500 Octa fan card a.JPG

3dfx Voodoo 5 6000 AGP RevA1 1500 Front.jpg

and as of today a multi gpu single card would ever behave as a CFX/SLI card (look back the 590 690, 295x2 Radeon Pro Duo ARES III etc ...) although they are preferable to multi GPU multi card

or you mean a logic unit (coordinator) that dispatch the load across the chip on the card ... in that case that would not make the card cheaper but rather more expensive as it would require additional R&D work to see the light of the day.

Brusfantomet · Jun 29, 2016

I think you would need the chips to be on an interposer (like Fiji is with HBM) for the latencies to be low enough for it to work. This makes the interposer the biggest "chip" but making wires is simpler than making transistors, and the lithography can be bigger and with better yield.

Aquinus · Jun 29, 2016

RejZoR said:
Would it be possible to design smaller GPU's in such a way that you could stack 4 or 6 of them on a single PCB, but would present themselves to the system and behave as a single GPU without the use of ANY software or driver feature for that? I mean, like having more separate GPU chips that work together as a single physical GPU and not as multi-GPU which are tied together with SLI/Crossfire software.

You just explained why the Cell processor on the PS3 was garbage for devs. Trying to orchestrate a bunch of different "cores," if you will, would put an incredible burden on a co-processor or the CPU. All of this adds latency.

I think you're over-simplifying the problem. You may want to leave the design aspect to electrical engineers because there is a lot more to a GPU than merely rendering frames. A lot happens from the time that commands are dispatched and the actual result. Just like a CPU, there are internal components that need to be tightly coupled which makes the horizontal scaling you suggest to be infeasible.

RejZoR said:
- Latency between GPU's once signal leaves the actual chip

This is exactly why micro-stutter is a thing. Unequal latencies between two GPUs with the same load. A secondary GPU would need to get its data early to the primary to eliminate this problem or the primary would need to have latency purposely introduced to keep per-frame latencies consistent.

Simply put, making the engine wider already results in poor performance. We've seen this already with AMD even on single GPU setups where beefier CUs/SMs do better than more CUs/SMs with smaller shader counts.

RejZoR said:
Now, this is pure brainstorming with very little actual electronics knowledge...

Quick question: What makes you think EEs haven't been considering all of this for years? I suspect that they know how to do their job and is probably unwise to try to simplify such a complex problem, unless you've been withholding your EE credentials from us.

RejZoR · Jun 29, 2016

While I agree knowing how things actually work helps a ton, but sometimes dumbest ideas deliver best results. When you're so focused on pre-existing designs and you base all your work off that, it's hard to truly think outside the box.

I know that quite well from software development side of things, I think hardware is no different. Experts are unable to think on a "dumb level". But when dumb people like myself have an idea, that may spark somethin glorious in expert's heads because they never thought of it on that level. Maybe not for this exact example, but you get the drift.

Mussels · Jun 29, 2016

we probably just described the idiot explanation for async shaders in DX12

Aquinus · Jun 29, 2016

RejZoR said:
While I agree knowing how things actually work helps a ton, but sometimes dumbest ideas deliver best results. When you're so focused on pre-existing designs and you base all your work off that, it's hard to truly think outside the box.

I know that quite well from software development side of things, I think hardware is no different. Experts are unable to think on a "dumb level". But when dumb people like myself have an idea, that may spark somethin glorious in expert's heads because they never thought of it on that level. Maybe not for this exact example, but you get the drift.

You underestimate the overhead of doing stuff in parallel though (at least I think that's the case.) I've been doing multi-threaded system integration for years now and let me tell you something, it's not an easy problem to solve. Orchestrating asynchronous processes is a bear to say the least.

RejZoR · Jun 29, 2016

Aren't GPU's already just that? A huge parallel computing thingie?

buildzoid · Jun 29, 2016

The way I see it it would be a huge gain if you just took each frame split into as many chunks as there are GPUs and told each GPU to render each chunk and then pass all that to the displaying GPU which would stich everything back together. Sure it wouldn't get anywhere near 100% performance scaling if one section of a frame had a vastly different amount of work in it than another however it would be highly scalable and stutter free while also relatively easy to support unlike AFR which right now barely ever works. All the GPUs would still need basically all the same data in memory so you wouldn't see an improvement in memory efficiency but it would still IMO be the simplest approach.

Now I'm not a 3D graphics dev but I had to implement this I would basically just make 1 camera for each GPU and then stitch the different camera views together on the primary card. Sure some of the GPUs would spend large amounts of time idling due to disparancies in workload but on average frame pushing would get faster and faster even with silly amounts of GPUs. CPU workload would also end up pretty high in my approach which would suck but hey you gotta use those i7s for something right?

Aquinus · Jun 30, 2016

RejZoR said:
Aren't GPU's already just that? A huge parallel computing thingie?

Parts of it are, there are a lot of parts that aren't. The ACEs and command processor aren't, L2 cache isn't (which is shared with the entire GPU,) the memory controller isn't and the PCI-E controller isn't. This is all stuff that would need to be replicated and things like the memory controller and L2 cache tend to be pretty big. Not sharing those would increase latency by a huge amount and would make the entire design unrealistic. Not all of this stuff used to be close together. We used to have things like off-die L2 cache and off-die memory controllers but, they've all been moved closer to the compute cores because the further away they are, the worse latency and bandwidth are going to get.

buildzoid said:
The way I see it it would be a huge gain if you just took each frame split into as many chunks as there are GPUs and told each GPU to render each chunk and then pass all that to the displaying GPU which would stich everything back together. Sure it wouldn't get anywhere near 100% performance scaling if one section of a frame had a vastly different amount of work in it than another however it would be highly scalable and stutter free while also relatively easy to support unlike AFR which right now barely ever works. All the GPUs would still need basically all the same data in memory so you wouldn't see an improvement in memory efficiency but it would still IMO be the simplest approach.

Now I'm not a 3D graphics dev but I had to implement this I would basically just make 1 camera for each GPU and then stitch the different camera views together on the primary card. Sure some of the GPUs would spend large amounts of time idling due to disparancies in workload but on average frame pushing would get faster and faster even with silly amounts of GPUs. CPU workload would also end up pretty high in my approach which would suck but hey you gotta use those i7s for something right?

The problem that needs to be solved is the issue of the latency between the time that the frame is done and rendered and the time it can find its way to the frame buffer that will be displaying it. Two GPUs at the same clock are going to (mostly,) render frames at the same rate. You start running into the micro-stutter problem due the latency of transferring that frame to the primary card in order to be displayed. I would argue that an on-board frame buffer that the GPUs send the frames to would be a far better solution. I say this because instead of trying to do something weird to smooth out the difference in latencies, you're actively putting both GPUs an equal distance away from the thing that will be rendering each frame. The cost is slightly higher overall render latency, the benefit would be reduced frame-to-frame latency variation.

So long is there is a primary GPU that handles displays, I would argue that multi-GPU setups will never solve the micro-stutter problem.

RejZoR · Jun 30, 2016

So, how does DX12 solve that? I mean, I think I've heard right that DX12 not only scales better it also has almost no frame pacing issues. With that, once DX12 becomes a norm, graphic cards with stacked lower end GPU's would make sense, wouldn't they?

cadaveca · Jun 30, 2016

RejZoR said:
So, how does DX12 solve that? I mean, I think I've heard right that DX12 not only scales better it also has almost no frame pacing issues. With that, once DX12 becomes a norm, graphic cards with stacked lower end GPU's would make sense, wouldn't they?

Not really. NUMA. You know, the thing that basically makes AMD's CPUs slower.

Toothless · Jun 30, 2016

Sooo what we need is a laptop based i5 and two GTX1080s all on one PCB with touching dies to be butt buddies and four 8 pin pci power connectors to balance the 16gb vram and 20mb of l4 cache.

did i do it right

System Name	Rainbow Sparkles (Power efficient, <350W gaming load)
Processor	Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard	Asus x570-F (BIOS Modded)
Cooling	Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory	2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s)	Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage	2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s)	Phillips 32 32M1N5800A (4k144), LG 32" (4K60) \| Gigabyte G32QC (2k165) \| Phillips 328m6fjrmb (2K144)
Case	Fractal Design R6
Audio Device(s)	Logitech G560 \| Corsair Void pro RGB \|Blue Yeti mic
Power Supply	Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse	Logitech G Pro wireless + Steelseries Prisma XL
Keyboard	Razer Huntsman TE ( Sexy white keycaps)
VR HMD	Oculus Rift S + Quest 2
Software	Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores	Nyooom.

System Name	Rainbow Sparkles (Power efficient, <350W gaming load)
Processor	Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard	Asus x570-F (BIOS Modded)
Cooling	Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory	2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s)	Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage	2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s)	Phillips 32 32M1N5800A (4k144), LG 32" (4K60) \| Gigabyte G32QC (2k165) \| Phillips 328m6fjrmb (2K144)
Case	Fractal Design R6
Audio Device(s)	Logitech G560 \| Corsair Void pro RGB \|Blue Yeti mic
Power Supply	Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse	Logitech G Pro wireless + Steelseries Prisma XL
Keyboard	Razer Huntsman TE ( Sexy white keycaps)
VR HMD	Oculus Rift S + Quest 2
Software	Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores	Nyooom.

System Name	main/SFFHTPCARGH!(tm)/Xiaomi Mi TV Stick/Samsung Galaxy S23/Ally
Processor	Ryzen 7 5800X3D/i7-3770/S905X/Snapdragon 8 Gen 2/Ryzen Z1 Extreme
Motherboard	MSI MAG B550 Tomahawk/HP SFF Q77 Express/uh?/uh?/Asus
Cooling	Enermax ETS-T50 Axe aRGB /basic HP HSF /errr.../oh! liqui..wait, no:sizable vapor chamber/a nice one
Memory	64gb Corsair Vengeance Pro 3600mhz DDR4/8gb DDR3 1600/2gb LPDDR3/8gb LPDDR5x 4200/16gb LPDDR5
Video Card(s)	Hellhound Spectral White RX 7900 XTX 24gb/GT 730/Mali 450MP5/Adreno 740/RDNA3 768 core
Storage	250gb870EVO/500gb860EVO/2tbSandisk/NVMe2tb+1tb/4tbextreme V2/1TB Arion/500gb/8gb/256gb/2tb SN770M
Display(s)	X58222 32" 2880x1620/32"FHDTV/273E3LHSB 27" 1920x1080/6.67"/AMOLED 2X panel FHD+120hz/FHD 120hz
Case	Cougar Panzer Max/Elite 8300 SFF/None/back/back-front Gorilla Glass Victus 2+ UAG Monarch Carbon
Audio Device(s)	Logi Z333/SB Audigy RX/HDMI/HDMI/Dolby Atmos/KZ x HBB PR2/Edifier STAX Spirit S3 & SamsungxAKG beans
Power Supply	Chieftec Proton BDF-1000C /HP 240w/12v 1.5A/4Smart Voltplug PD 30W/Asus USB-C 65W
Mouse	Speedlink Sovos Vertical-Asus ROG Spatha-Logi Ergo M575/Xiaomi XMRM-006/touch/touch
Keyboard	Endorfy Thock 75% <3/none/touch/virtual
VR HMD	Medion Erazer
Software	Win10 64/Win8.1 64/Android TV 8.1/Android 13/Win11 64
Benchmark Scores	bench...mark? i do leave mark on bench sometime, to remember which one is the most comfortable. :o

System Name	Games/internet/usage
Processor	I7 5820k 4.2 Ghz
Motherboard	ASUS X99-A2
Cooling	custom water loop for cpu and gpu
Memory	16GiB Crucial Ballistix Sport 2666 MHz
Video Card(s)	Radeon Rx 6800 XT
Storage	Samsung XP941 500 GB + 1 TB SSD
Display(s)	Dell 3008WFP
Case	Caselabs Magnum M8
Audio Device(s)	Shiit Modi 2 Uber -> Matrix m-stage -> HD650
Power Supply	beQuiet dark power pro 1200W
Mouse	Logitech MX518
Keyboard	Corsair K95 RGB
Software	Win 10 Pro

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 4TB External
Display(s)	Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	96w Power Adapter
Mouse	Logitech MX Master 3
Keyboard	Logitech G915, GL Clicky
Software	MacOS 12.1

[Theory] Multi-GPU graphic card that behaves like a single GPU?

RejZoR

Mussels

Freshwater Moderator

RejZoR

Mussels

Freshwater Moderator

GreiverBlade

Brusfantomet

Aquinus

Resident Wat-man

RejZoR

Mussels

Freshwater Moderator

Aquinus

Resident Wat-man

RejZoR

buildzoid

Aquinus

Resident Wat-man

RejZoR

cadaveca

My name is Dave

Toothless

Tech, Games, and TPU!

System Name	TITAN Slayer / CPUCannon / MassFX
Processor	i7 5960X @ 4.6Ghz / i7 3960x @5.0Ghz / FX6350 @ 4.?Ghz
Motherboard	Rampage V Extreme / Rampage IV Extreme / MSI 970 Gaming
Cooling	Phanteks PHTC14PE 2.5K 145mm TRs / Custom waterloop / Phanteks PHTC14PE + 3K 140mm Noctuas
Memory	Crucial 2666 11-13-13-25 1.45V / G.skill RipjawsX 2400 10-12-12-34 1.7V / Crucial 2133 9-9-9-27 1.7V
Video Card(s)	3 Fury X in CF / R9 Fury 3840 cores 1145/570 1.3V / Nothing ATM
Storage	500GB Crucial SSD and 3TB WD Black / WD 1TB Black(OS) + WD 3TB Green / WD 1TB Blue
Display(s)	LG 29UM67 80Hz/Asus mx299q 2560x1080 @ 84Hz / Asus VX239 1920x1080 @60hz
Case	Dismatech easy v3.0 / Xigmatek Alfar (Open side panel)
Audio Device(s)	M-audio M-track / realtek ALC 1150
Power Supply	EVGA G2 1600W / CoolerMaster V1000 / Seasonic 620 M12-II
Mouse	Mouse in review process/Razer Naga Epic 2011/Razer Naga 2014
Keyboard	Keyboard in review process / Razer Blackwidow Ultimate 2014/Razer Blackwidow Ultimate 2011
Software	Windows 7 Ultimate / Windows 7 ultimate / Windows 7 ultimate
Benchmark Scores	cinebench 15.41 3960x @ 5.3ghz Wprime32m 3.352 3960x @ 5.25ghz Super PI 32m: 6m 42s 472ms @5.25ghz

System Name	Veral
Processor	5950x
Motherboard	MSI MEG x570 Ace
Cooling	Corsair H150i RGB Elite
Memory	4x16GB G.Skill TridentZ
Video Card(s)	Powercolor 7900XTX Red Devil
Storage	Crucial P5 Plus 1TB, Samsung 980 1TB, Teamgroup MP34 4TB
Display(s)	Acer Nitro XZ342CK Pbmiiphx + 2x AOC 2425W
Case	Fractal Design Meshify Lite 2
Audio Device(s)	Blue Yeti + SteelSeries Arctis 5 / Samsung HW-T550
Power Supply	Corsair HX850
Mouse	Corsair Nightsword
Keyboard	Corsair K55
VR HMD	HP Reverb G2
Software	Windows 11 Professional
Benchmark Scores	PEBCAK