• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

[Theory] Multi-GPU graphic card that behaves like a single GPU?

Joined
Oct 2, 2004
Messages
13,930 (1.84/day)
System Name Dark Monolith
Processor AMD Ryzen 7 5800X3D
Motherboard ASUS Strix X570-E
Cooling Arctic Cooling Freezer II 240mm + 2x SilentWings 3 120mm
Memory 64 GB G.Skill Ripjaws V Black 3600 MHz
Video Card(s) XFX Radeon RX 9070 XT Mercury OC Magnetic Air
Storage Seagate Firecuda 530 4 TB SSD + Samsung 850 Pro 2 TB SSD + Seagate Barracuda 8 TB HDD
Display(s) ASUS ROG Swift PG27AQDM 240Hz OLED
Case Silverstone Kublai KL-07
Audio Device(s) Sound Blaster AE-9 MUSES Edition + Altec Lansing MX5021 2.1 Nichicon Gold
Power Supply BeQuiet DarkPower 11 Pro 750W
Mouse Logitech G502 Proteus Spectrum
Keyboard UVI Pride MechaOptical
Software Windows 11 Pro
I got some brainstorming going on around AMD's Navi architecture, particularly the "scaling" part they mention, which is suggesting multi-GPU setups on single PCB. So, I've expanded that a bit...

We know creating huge GPU's is very costly, mostly because of defective GPU's from the waffers.
So, more small GPU's fit on a single waffer, meaning you get more of them and less defective ones -> cheaper.

Fitting more GPU's the way we do it now requires SLi/CrossfireX profiles which are always pain in the ass and it just doesn't scale well.

Now, this is pure brainstorming with very little actual electronics knowledge...

Would it be possible to design smaller GPU's in such a way that you could stack 4 or 6 of them on a single PCB, but would present themselves to the system and behave as a single GPU without the use of ANY software or driver feature for that? I mean, like having more separate GPU chips that work together as a single physical GPU and not as multi-GPU which are tied together with SLI/Crossfire software.
Imagine internal compute units of it being connected to each other, but spread across multiple actual chips.
Or to go even further, separating compute units from memory controller and decoders and have them as individual chips on a card. Would it be even theoretically possible to go with such radical change to the way how graphic cards are being designed?

This would bring:
- Cheaper high end graphics due to simple stacking of slower smaller cores
- No scaling issues
- Decentralized heat output (multiple moderate heat points opposed to current single highly concentrated one)

What I'm suspecting is the limitation:
- Latency between GPU's once signal leaves the actual chip
- Memory connections & cache design
 
DX12 multi GPU could simplify it greatly, in coming years. Could effectively slap a bunch of laptop size/TDP cards onto a single PCB and away you go.


harder to backport to DX11 and older, so it wont happen any time soon.
 
Basically DX12 does that in software what I'm thinking it should be done in hardware entirely. It's combining separate compute "pools" intoa single unit. So, AMD is kinda betting everything on that with Navi architecture.

They will probably create dual GPU at first where single GPU will be fast enough to handle DX11 games even without any scaling and for DX12 it would harness full dual GPU power. At least in theory...

I'm just concerned a bit knowing the history of VSA100. Though granted, back then there was NO support for it. Now we have DX12, even if only partially.
 
well imagine if they wrote a DX12 wrapper that converted it to other systems like DX9/10/11?
Bam, superduper fancy pants multi GPU support.

that'd bring up their share price.
 
the design you talk was already a thing back in the early 2000 end 90s (yep indeed the VSA100)
3dfx Voodoo 5 6000 AGP 128MB Rev_A1 1500 Octa fan card a.JPG 3dfx Voodoo 5 6000 AGP RevA1 1500 Front.jpg

and as of today a multi gpu single card would ever behave as a CFX/SLI card (look back the 590 690, 295x2 Radeon Pro Duo ARES III etc ...) although they are preferable to multi GPU multi card

or you mean a logic unit (coordinator) that dispatch the load across the chip on the card ... in that case that would not make the card cheaper but rather more expensive as it would require additional R&D work to see the light of the day.
 
I think you would need the chips to be on an interposer (like Fiji is with HBM) for the latencies to be low enough for it to work. This makes the interposer the biggest "chip" but making wires is simpler than making transistors, and the lithography can be bigger and with better yield.
 
Would it be possible to design smaller GPU's in such a way that you could stack 4 or 6 of them on a single PCB, but would present themselves to the system and behave as a single GPU without the use of ANY software or driver feature for that? I mean, like having more separate GPU chips that work together as a single physical GPU and not as multi-GPU which are tied together with SLI/Crossfire software.
You just explained why the Cell processor on the PS3 was garbage for devs. Trying to orchestrate a bunch of different "cores," if you will, would put an incredible burden on a co-processor or the CPU. All of this adds latency.

I think you're over-simplifying the problem. You may want to leave the design aspect to electrical engineers because there is a lot more to a GPU than merely rendering frames. A lot happens from the time that commands are dispatched and the actual result. Just like a CPU, there are internal components that need to be tightly coupled which makes the horizontal scaling you suggest to be infeasible.

- Latency between GPU's once signal leaves the actual chip
This is exactly why micro-stutter is a thing. Unequal latencies between two GPUs with the same load. A secondary GPU would need to get its data early to the primary to eliminate this problem or the primary would need to have latency purposely introduced to keep per-frame latencies consistent.

Simply put, making the engine wider already results in poor performance. We've seen this already with AMD even on single GPU setups where beefier CUs/SMs do better than more CUs/SMs with smaller shader counts.

Now, this is pure brainstorming with very little actual electronics knowledge...

Quick question: What makes you think EEs haven't been considering all of this for years? I suspect that they know how to do their job and is probably unwise to try to simplify such a complex problem, unless you've been withholding your EE credentials from us. ;)
 
While I agree knowing how things actually work helps a ton, but sometimes dumbest ideas deliver best results. When you're so focused on pre-existing designs and you base all your work off that, it's hard to truly think outside the box.

I know that quite well from software development side of things, I think hardware is no different. Experts are unable to think on a "dumb level". But when dumb people like myself have an idea, that may spark somethin glorious in expert's heads because they never thought of it on that level. Maybe not for this exact example, but you get the drift.
 
we probably just described the idiot explanation for async shaders in DX12
 
While I agree knowing how things actually work helps a ton, but sometimes dumbest ideas deliver best results. When you're so focused on pre-existing designs and you base all your work off that, it's hard to truly think outside the box.

I know that quite well from software development side of things, I think hardware is no different. Experts are unable to think on a "dumb level". But when dumb people like myself have an idea, that may spark somethin glorious in expert's heads because they never thought of it on that level. Maybe not for this exact example, but you get the drift.
You underestimate the overhead of doing stuff in parallel though (at least I think that's the case.) I've been doing multi-threaded system integration for years now and let me tell you something, it's not an easy problem to solve. Orchestrating asynchronous processes is a bear to say the least.
 
Aren't GPU's already just that? A huge parallel computing thingie?
 
The way I see it it would be a huge gain if you just took each frame split into as many chunks as there are GPUs and told each GPU to render each chunk and then pass all that to the displaying GPU which would stich everything back together. Sure it wouldn't get anywhere near 100% performance scaling if one section of a frame had a vastly different amount of work in it than another however it would be highly scalable and stutter free while also relatively easy to support unlike AFR which right now barely ever works. All the GPUs would still need basically all the same data in memory so you wouldn't see an improvement in memory efficiency but it would still IMO be the simplest approach.

Now I'm not a 3D graphics dev but I had to implement this I would basically just make 1 camera for each GPU and then stitch the different camera views together on the primary card. Sure some of the GPUs would spend large amounts of time idling due to disparancies in workload but on average frame pushing would get faster and faster even with silly amounts of GPUs. CPU workload would also end up pretty high in my approach which would suck but hey you gotta use those i7s for something right?
 
Aren't GPU's already just that? A huge parallel computing thingie?
Parts of it are, there are a lot of parts that aren't. The ACEs and command processor aren't, L2 cache isn't (which is shared with the entire GPU,) the memory controller isn't and the PCI-E controller isn't. This is all stuff that would need to be replicated and things like the memory controller and L2 cache tend to be pretty big. Not sharing those would increase latency by a huge amount and would make the entire design unrealistic. Not all of this stuff used to be close together. We used to have things like off-die L2 cache and off-die memory controllers but, they've all been moved closer to the compute cores because the further away they are, the worse latency and bandwidth are going to get.

The way I see it it would be a huge gain if you just took each frame split into as many chunks as there are GPUs and told each GPU to render each chunk and then pass all that to the displaying GPU which would stich everything back together. Sure it wouldn't get anywhere near 100% performance scaling if one section of a frame had a vastly different amount of work in it than another however it would be highly scalable and stutter free while also relatively easy to support unlike AFR which right now barely ever works. All the GPUs would still need basically all the same data in memory so you wouldn't see an improvement in memory efficiency but it would still IMO be the simplest approach.

Now I'm not a 3D graphics dev but I had to implement this I would basically just make 1 camera for each GPU and then stitch the different camera views together on the primary card. Sure some of the GPUs would spend large amounts of time idling due to disparancies in workload but on average frame pushing would get faster and faster even with silly amounts of GPUs. CPU workload would also end up pretty high in my approach which would suck but hey you gotta use those i7s for something right?
The problem that needs to be solved is the issue of the latency between the time that the frame is done and rendered and the time it can find its way to the frame buffer that will be displaying it. Two GPUs at the same clock are going to (mostly,) render frames at the same rate. You start running into the micro-stutter problem due the latency of transferring that frame to the primary card in order to be displayed. I would argue that an on-board frame buffer that the GPUs send the frames to would be a far better solution. I say this because instead of trying to do something weird to smooth out the difference in latencies, you're actively putting both GPUs an equal distance away from the thing that will be rendering each frame. The cost is slightly higher overall render latency, the benefit would be reduced frame-to-frame latency variation.

So long is there is a primary GPU that handles displays, I would argue that multi-GPU setups will never solve the micro-stutter problem.
 
So, how does DX12 solve that? I mean, I think I've heard right that DX12 not only scales better it also has almost no frame pacing issues. With that, once DX12 becomes a norm, graphic cards with stacked lower end GPU's would make sense, wouldn't they?
 
So, how does DX12 solve that? I mean, I think I've heard right that DX12 not only scales better it also has almost no frame pacing issues. With that, once DX12 becomes a norm, graphic cards with stacked lower end GPU's would make sense, wouldn't they?
Not really. NUMA. You know, the thing that basically makes AMD's CPUs slower.
 
Sooo what we need is a laptop based i5 and two GTX1080s all on one PCB with touching dies to be butt buddies and four 8 pin pci power connectors to balance the 16gb vram and 20mb of l4 cache.

did i do it right
 
Back
Top