• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

Joined
Aug 22, 2010
Messages
756 (0.15/day)
Location
Germany
System Name Acer Nitro 5 (AN515-45-R715)
Processor AMD Ryzen 9 5900HX
Motherboard AMD Promontory / Bixby FCH
Cooling Acer Nitro Sense
Memory 32 GB
Video Card(s) AMD Radeon Graphics (Cezanne) / NVIDIA RTX 3080 Laptop GPU
Storage WDC PC SN530 SDBPNPZ
Display(s) BOE CQ NE156QHM-NY3
Software Windows 11 beta channel


Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

Source: http://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs
 

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.27/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P
I wonder if this is why their sli support has been slipping. Working in basically multi gpu single chip setups instead.
 
Joined
Nov 13, 2007
Messages
10,233 (1.70/day)
Location
Austin Texas
Processor 13700KF Undervolted @ 5.6/ 5.5, 4.8Ghz Ring 200W PL1
Motherboard MSI 690-I PRO
Cooling Thermalright Peerless Assassin 120 w/ Arctic P12 Fans
Memory 48 GB DDR5 7600 MHZ CL36
Video Card(s) RTX 4090 FE
Storage 2x 2TB WDC SN850, 1TB Samsung 960 prr
Display(s) Alienware 32" 4k 240hz OLED
Case SLIGER S620
Audio Device(s) Yes
Power Supply Corsair SF750
Mouse Xlite V2
Keyboard RoyalAxe
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
this would be awesome
 
Joined
May 28, 2005
Messages
4,994 (0.72/day)
Location
South of England
System Name Box of Distraction
Processor Ryzen 7 1800X
Motherboard Crosshair VI Hero
Cooling Custom watercooling
Memory G.Skill TridentZ 2x8GB @ 3466MHz CL14 1T
Video Card(s) EVGA 1080Ti FE. WC'd & TDP limit increased to 360W.
Storage Samsung 960 Evo 500GB & WD Black 2TB storage drive.
Display(s) Asus ROG Swift PG278QR 27" 1440P 165hz Gsync
Case Phanteks Enthoo Pro M
Audio Device(s) Phillips Fidelio X2 headphones / basic Bose speakers
Power Supply EVGA Supernova 750W G3
Mouse Logitech G602
Keyboard Cherry MX Board 6.0 (mx red switches)
Software Win 10 & Linux Mint
Benchmark Scores https://hwbot.org/user/infrared
nvidia threadripper GTX1280Ti / Titan XPXp :slap: :D (just kidding about btw)
 

silentbogo

Moderator
Staff member
Joined
Nov 20, 2013
Messages
5,474 (1.44/day)
Location
Kyiv, Ukraine
System Name WS#1337
Processor Ryzen 7 3800X
Motherboard ASUS X570-PLUS TUF Gaming
Cooling Xigmatek Scylla 240mm AIO
Memory 4x8GB Samsung DDR4 ECC UDIMM
Video Card(s) Inno3D RTX 3070 Ti iChill
Storage ADATA Legend 2TB + ADATA SX8200 Pro 1TB
Display(s) Samsung U24E590D (4K/UHD)
Case ghetto CM Cosmos RC-1000
Audio Device(s) ALC1220
Power Supply SeaSonic SSR-550FX (80+ GOLD)
Mouse Logitech G603
Keyboard Modecom Volcano Blade (Kailh choc LP)
VR HMD Google dreamview headset(aka fancy cardboard)
Software Windows 11, Ubuntu 20.04 LTS
nvidia threadripper GTX1280Ti / Titan XPXp :slap: :D (just kidding about btw)
Every joke has a grain of truth :laugh:
'Cause, it looks exactly like a threadripper block chart, which is not necessarily a bad thing.
 
Joined
Dec 6, 2016
Messages
748 (0.28/day)
Joined
May 8, 2016
Messages
1,741 (0.60/day)
System Name BOX
Processor Core i7 6950X @ 4,26GHz (1,28V)
Motherboard X99 SOC Champion (BIOS F23c + bifurcation mod)
Cooling Thermalright Venomous-X + 2x Delta 38mm PWM (Push-Pull)
Memory Patriot Viper Steel 4000MHz CL16 4x8GB (@3240MHz CL12.12.12.24 CR2T @ 1,48V)
Video Card(s) Titan V (~1650MHz @ 0.77V, HBM2 1GHz, Forced P2 state [OFF])
Storage WD SN850X 2TB + Samsung EVO 2TB (SATA) + Seagate Exos X20 20TB (4Kn mode)
Display(s) LG 27GP950-B
Case Fractal Design Meshify 2 XL
Audio Device(s) Motu M4 (audio interface) + ATH-A900Z + Behringer C-1
Power Supply Seasonic X-760 (760W)
Mouse Logitech RX-250
Keyboard HP KB-9970
Software Windows 10 Pro x64
I think it's cost effctive (smaller dies = better yields), but what's the point ?
That design is stupid at this point in time (more stacked RAM = A LOT more expensive package and end product).
Not to mention HBM 2.0 memory is in short supply (so all the yield gained in smaller GPU dies, will be wasted by stacked RAM).
 
Joined
Jan 8, 2017
Messages
8,933 (3.35/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Yep , it's the way to go , cram more silicon into a package without the increased cost and difficulty of manufacturing a massive die. Bonus for GPUs is you can dramatically increase shading power , and not having to worry as much on improving on power efficiency since you can use lower clocks on smaller dies. It's the only to win some more time till a big revolution happens in terms of the manufacturing process of chips ( like using different materials than silicon or something else).

It's not just AMD that will do this , Nvidia and Intel will be obligated to do it too at some point. The first to do it can have a huge advantage.

I think it's cost effctive (smaller dies = better yields), but what's the point ?

You already answered your question : smaller dies = better yields. It's not the memory that's relevant , you can use conventional memory too and still reap the advantages of this approach.
 
Last edited:
Joined
May 8, 2016
Messages
1,741 (0.60/day)
System Name BOX
Processor Core i7 6950X @ 4,26GHz (1,28V)
Motherboard X99 SOC Champion (BIOS F23c + bifurcation mod)
Cooling Thermalright Venomous-X + 2x Delta 38mm PWM (Push-Pull)
Memory Patriot Viper Steel 4000MHz CL16 4x8GB (@3240MHz CL12.12.12.24 CR2T @ 1,48V)
Video Card(s) Titan V (~1650MHz @ 0.77V, HBM2 1GHz, Forced P2 state [OFF])
Storage WD SN850X 2TB + Samsung EVO 2TB (SATA) + Seagate Exos X20 20TB (4Kn mode)
Display(s) LG 27GP950-B
Case Fractal Design Meshify 2 XL
Audio Device(s) Motu M4 (audio interface) + ATH-A900Z + Behringer C-1
Power Supply Seasonic X-760 (760W)
Mouse Logitech RX-250
Keyboard HP KB-9970
Software Windows 10 Pro x64
You already answered your question : smaller dies = better yields.
1) HBM 2.0 is expensive and scarse right now and GDDR5X/6 isn't economically capable of driving four proper GPU's on the same package.
Memory bandwidth simply won't be there (128-bit per GPU = 512-bit overall for four MCM GPU setup).
2) Package will have to get A LOT bigger (and it's kinda at it's limit right now)
3) Useful package space will go down (because interconnect), and multi chip will use more space than a single chip (assuming transistor density is equall/similar for both MCM and single die).
4) Four GTX 1050 Ti's on a single package ?
Not the best idea from performance standpoint.
5) Making software will be a nightmare (Ryzen game performance ain't got nothing on framepacing)
 
Joined
Jan 8, 2017
Messages
8,933 (3.35/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
1) HBM 2.0 is expensive and scarse right now and GDDR5X/6 isn't economically capable of driving four proper GPU's on the same package.
Memory bandwidth simply won't be there (128-bit per GPU = 512-bit overall for four MCM GPU setup).
2) Package will have to get A LOT bigger (and it's kinda at it's limit right now)
3) Useful package space will go down (because interconnect), and multi chip will use more space than a single chip (assuming transistor density is equall/similar for both MCM and single die).
4) Four GTX 1050 Ti's on a single package ?
Not the best idea from performance standpoint.
5) Making software will be a nightmare (Ryzen game performance ain't got nothing on framepacing)

1) You don't know that , memory bandwidth is what it's important and I don't see why conventional memory wouldn't be able "drive" multiple dies and HBM would. Of course its not going to perform as good but honestly it's more of a proof of concept thing. I am sure this is ultimately designed with HBM in mind, the price will come down eventually. You are focusing too much on memory , that's not the point of this.
2) You are confusing things , it's the dies sizes that are the limit , not the package , that's the whole idea of this : to not be limited by the processor size. Putting many dies on an interposer is certainly not going be an extremely easy task , but I don't see it impractical. Main problem is fragility apparently , Fury X was notorious for being easy as hell to damage the die while removing the cooler.
3) Again it's not the size that's the issue.
4) I do not understand what you mean by that , 4x GTX 1050ti dies = something between GTX 1080 and 1080ti in terms of shading power (give or take) with whatever clocks and power consumption such a chip would have ( between 1080 and 1080ti ) BUT here's the advantage : it can cost LESS to manufacture. How is that not the best idea ? GPU's are meant for high parallelism and near perfect scalability as the main design paradigm, it doesn't matter that those shaders are not on the same die nearly as much as you think it dose. And the goal is to use chips that are already at their limit in terms of manufacturing process to get something that's faster when you simply couldn't do it otherwise.
5) No it will not , you are again making the mistake of comparing CPU's with GPU , two things that couldn't be more different. The point of all this is to use the multiple dies as a single pool of resources , like you would normally do , otherwise why bother ? Don't you see the single SYS+I/O badge thing in the diagram connected to all GPUs ? This suggests that by design there is one initial level of instruction pipe-lining and scheduling , think of it as a global GigaThread engine which is what Nvidia calls it in their architectures.

The only inherent design flaw of this is latency : but it can be mitigated to an extent by an efficient way to connect the dies and bigger caches.

To add to all of this , gaming is : you guessed it ! NOT the main thing that will benefit from this. It's compute that will benefit greatly. And this is not just for GPUs , CPUs can benefit to a certain degree too (well really they already do) . Or any other form of ASIC.
 
Last edited:

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.94/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
If they want to keep clocks up and be able to continue sucking down power, they need to spread the heat out which means pulling the GPU apart and this is the way to do it. Even if HBM isn't part of the picture, physically spreading computation resources out will help scaling. Two dies on one interposer with a massive IMC in the middle to drive GDDR5X isn't a bad idea.
 
Top