• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

Joined
Aug 22, 2010
Messages
772 (0.14/day)
Location
Germany
System Name Acer Nitro 5 (AN515-45-R715)
Processor AMD Ryzen 9 5900HX
Motherboard AMD Promontory / Bixby FCH
Cooling Acer Nitro Sense
Memory 32 GB
Video Card(s) AMD Radeon Graphics (Cezanne) / NVIDIA RTX 3080 Laptop GPU
Storage WDC PC SN530 SDBPNPZ
Display(s) BOE CQ NE156QHM-NY3
Software Windows 11 beta channel
mcm_0.png


Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

Source: http://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs
 
I wonder if this is why their sli support has been slipping. Working in basically multi gpu single chip setups instead.
 
this would be awesome
 
nvidia threadripper GTX1280Ti / Titan XPXp :slap: :D (just kidding about btw)
 
nvidia threadripper GTX1280Ti / Titan XPXp :slap: :D (just kidding about btw)
Every joke has a grain of truth :laugh:
'Cause, it looks exactly like a threadripper block chart, which is not necessarily a bad thing.
 
I think it's cost effctive (smaller dies = better yields), but what's the point ?
That design is stupid at this point in time (more stacked RAM = A LOT more expensive package and end product).
Not to mention HBM 2.0 memory is in short supply (so all the yield gained in smaller GPU dies, will be wasted by stacked RAM).
 
Yep , it's the way to go , cram more silicon into a package without the increased cost and difficulty of manufacturing a massive die. Bonus for GPUs is you can dramatically increase shading power , and not having to worry as much on improving on power efficiency since you can use lower clocks on smaller dies. It's the only to win some more time till a big revolution happens in terms of the manufacturing process of chips ( like using different materials than silicon or something else).

It's not just AMD that will do this , Nvidia and Intel will be obligated to do it too at some point. The first to do it can have a huge advantage.

I think it's cost effctive (smaller dies = better yields), but what's the point ?

You already answered your question : smaller dies = better yields. It's not the memory that's relevant , you can use conventional memory too and still reap the advantages of this approach.
 
Last edited:
You already answered your question : smaller dies = better yields.
1) HBM 2.0 is expensive and scarse right now and GDDR5X/6 isn't economically capable of driving four proper GPU's on the same package.
Memory bandwidth simply won't be there (128-bit per GPU = 512-bit overall for four MCM GPU setup).
2) Package will have to get A LOT bigger (and it's kinda at it's limit right now)
3) Useful package space will go down (because interconnect), and multi chip will use more space than a single chip (assuming transistor density is equall/similar for both MCM and single die).
4) Four GTX 1050 Ti's on a single package ?
Not the best idea from performance standpoint.
5) Making software will be a nightmare (Ryzen game performance ain't got nothing on framepacing)
 
1) HBM 2.0 is expensive and scarse right now and GDDR5X/6 isn't economically capable of driving four proper GPU's on the same package.
Memory bandwidth simply won't be there (128-bit per GPU = 512-bit overall for four MCM GPU setup).
2) Package will have to get A LOT bigger (and it's kinda at it's limit right now)
3) Useful package space will go down (because interconnect), and multi chip will use more space than a single chip (assuming transistor density is equall/similar for both MCM and single die).
4) Four GTX 1050 Ti's on a single package ?
Not the best idea from performance standpoint.
5) Making software will be a nightmare (Ryzen game performance ain't got nothing on framepacing)

1) You don't know that , memory bandwidth is what it's important and I don't see why conventional memory wouldn't be able "drive" multiple dies and HBM would. Of course its not going to perform as good but honestly it's more of a proof of concept thing. I am sure this is ultimately designed with HBM in mind, the price will come down eventually. You are focusing too much on memory , that's not the point of this.
2) You are confusing things , it's the dies sizes that are the limit , not the package , that's the whole idea of this : to not be limited by the processor size. Putting many dies on an interposer is certainly not going be an extremely easy task , but I don't see it impractical. Main problem is fragility apparently , Fury X was notorious for being easy as hell to damage the die while removing the cooler.
3) Again it's not the size that's the issue.
4) I do not understand what you mean by that , 4x GTX 1050ti dies = something between GTX 1080 and 1080ti in terms of shading power (give or take) with whatever clocks and power consumption such a chip would have ( between 1080 and 1080ti ) BUT here's the advantage : it can cost LESS to manufacture. How is that not the best idea ? GPU's are meant for high parallelism and near perfect scalability as the main design paradigm, it doesn't matter that those shaders are not on the same die nearly as much as you think it dose. And the goal is to use chips that are already at their limit in terms of manufacturing process to get something that's faster when you simply couldn't do it otherwise.
5) No it will not , you are again making the mistake of comparing CPU's with GPU , two things that couldn't be more different. The point of all this is to use the multiple dies as a single pool of resources , like you would normally do , otherwise why bother ? Don't you see the single SYS+I/O badge thing in the diagram connected to all GPUs ? This suggests that by design there is one initial level of instruction pipe-lining and scheduling , think of it as a global GigaThread engine which is what Nvidia calls it in their architectures.

The only inherent design flaw of this is latency : but it can be mitigated to an extent by an efficient way to connect the dies and bigger caches.

To add to all of this , gaming is : you guessed it ! NOT the main thing that will benefit from this. It's compute that will benefit greatly. And this is not just for GPUs , CPUs can benefit to a certain degree too (well really they already do) . Or any other form of ASIC.
 
Last edited:
If they want to keep clocks up and be able to continue sucking down power, they need to spread the heat out which means pulling the GPU apart and this is the way to do it. Even if HBM isn't part of the picture, physically spreading computation resources out will help scaling. Two dies on one interposer with a massive IMC in the middle to drive GDDR5X isn't a bad idea.
 
Back
Top