MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

StefanM · Jun 27, 2017

Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

Source: http://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs

cdawall · Jun 27, 2017

I wonder if this is why their sli support has been slipping. Working in basically multi gpu single chip setups instead.

phanbuey · Jun 27, 2017

this would be awesome

infrared · Jun 27, 2017

nvidia ~~threadripper~~ GTX1280Ti / Titan XPXp :slap:

(just kidding about btw)

silentbogo · Jun 27, 2017

infrared said:
nvidia ~~threadripper~~ GTX1280Ti / Titan XPXp (just kidding about btw)

Every joke has a grain of truth :laugh:

'Cause, it looks exactly like a threadripper block chart, which is not necessarily a bad thing.

kruk · Jun 27, 2017

@rtwjunkie: I guess I was right:

kruk said:
Raja Koduri already said more than a year ago that the future is multi-gpu: https://www.pcper.com/news/Graphics...past-CrossFire-smaller-GPU-dies-HBM2-and-more

Navi could already have multiple GPUs on a interposer connected by infinity fabric and being fed by the HBCC. You laugh at AMD now, but when nVidia is also going to use this approach, you are probably going to clap how innovative they are ...

agent_x007 · Jun 27, 2017

I think it's cost effctive (smaller dies = better yields), but what's the point ?
That design is stupid at this point in time (more stacked RAM = A LOT more expensive package and end product).
Not to mention HBM 2.0 memory is in short supply (so all the yield gained in smaller GPU dies, will be wasted by stacked RAM).

Vya Domus · Jun 27, 2017

Yep , it's the way to go , cram more silicon into a package without the increased cost and difficulty of manufacturing a massive die. Bonus for GPUs is you can dramatically increase shading power , and not having to worry as much on improving on power efficiency since you can use lower clocks on smaller dies. It's the only to win some more time till a big revolution happens in terms of the manufacturing process of chips ( like using different materials than silicon or something else).

It's not just AMD that will do this , Nvidia and Intel will be obligated to do it too at some point. The first to do it can have a huge advantage.

agent_x007 said:
I think it's cost effctive (smaller dies = better yields), but what's the point ?

You already answered your question : smaller dies = better yields. It's not the memory that's relevant , you can use conventional memory too and still reap the advantages of this approach.

agent_x007 · Jun 28, 2017

Vya Domus said:
You already answered your question : smaller dies = better yields.

1) HBM 2.0 is expensive and scarse right now and GDDR5X/6 isn't economically capable of driving four proper GPU's on the same package.
Memory bandwidth simply won't be there (128-bit per GPU = 512-bit overall for four MCM GPU setup).
2) Package will have to get A LOT bigger (and it's kinda at it's limit right now)
3) Useful package space will go down (because interconnect), and multi chip will use more space than a single chip (assuming transistor density is equall/similar for both MCM and single die).
4) Four GTX 1050 Ti's on a single package ?
Not the best idea from performance standpoint.
5) Making software will be a nightmare (Ryzen game performance ain't got nothing on framepacing)

Vya Domus · Jun 28, 2017

agent_x007 said:
1) HBM 2.0 is expensive and scarse right now and GDDR5X/6 isn't economically capable of driving four proper GPU's on the same package.
Memory bandwidth simply won't be there (128-bit per GPU = 512-bit overall for four MCM GPU setup).
2) Package will have to get A LOT bigger (and it's kinda at it's limit right now)
3) Useful package space will go down (because interconnect), and multi chip will use more space than a single chip (assuming transistor density is equall/similar for both MCM and single die).
4) Four GTX 1050 Ti's on a single package ?
Not the best idea from performance standpoint.
5) Making software will be a nightmare (Ryzen game performance ain't got nothing on framepacing)

1) You don't know that , memory bandwidth is what it's important and I don't see why conventional memory wouldn't be able "drive" multiple dies and HBM would. Of course its not going to perform as good but honestly it's more of a proof of concept thing. I am sure this is ultimately designed with HBM in mind, the price will come down eventually. You are focusing too much on memory , that's not the point of this.
2) You are confusing things , it's the dies sizes that are the limit , not the package , that's the whole idea of this : to not be limited by the processor size. Putting many dies on an interposer is certainly not going be an extremely easy task , but I don't see it impractical. Main problem is fragility apparently , Fury X was notorious for being easy as hell to damage the die while removing the cooler.
3) Again it's not the size that's the issue.
4) I do not understand what you mean by that , 4x GTX 1050ti dies = something between GTX 1080 and 1080ti in terms of shading power (give or take) with whatever clocks and power consumption such a chip would have ( between 1080 and 1080ti ) BUT here's the advantage : it can cost LESS to manufacture. How is that not the best idea ? GPU's are meant for high parallelism and near perfect scalability as the main design paradigm, it doesn't matter that those shaders are not on the same die nearly as much as you think it dose. And the goal is to use chips that are already at their limit in terms of manufacturing process to get something that's faster when you simply couldn't do it otherwise.
5) No it will not , you are again making the mistake of comparing CPU's with GPU , two things that couldn't be more different. The point of all this is to use the multiple dies as a single pool of resources , like you would normally do , otherwise why bother ? Don't you see the single SYS+I/O badge thing in the diagram connected to all GPUs ? This suggests that by design there is one initial level of instruction pipe-lining and scheduling , think of it as a global GigaThread engine which is what Nvidia calls it in their architectures.

The only inherent design flaw of this is latency : but it can be mitigated to an extent by an efficient way to connect the dies and bigger caches.

To add to all of this , gaming is : you guessed it ! NOT the main thing that will benefit from this. It's compute that will benefit greatly. And this is not just for GPUs , CPUs can benefit to a certain degree too (well really they already do) . Or any other form of ASIC.

Aquinus · Jun 28, 2017

If they want to keep clocks up and be able to continue sucking down power, they need to spread the heat out which means pulling the GPU apart and this is the way to do it. Even if HBM isn't part of the picture, physically spreading computation resources out will help scaling. Two dies on one interposer with a massive IMC in the middle to drive GDDR5X isn't a bad idea.

System Name	Acer Nitro 5 (AN515-45-R715)
Processor	AMD Ryzen 9 5900HX
Motherboard	AMD Promontory / Bixby FCH
Cooling	Acer Nitro Sense
Memory	32 GB
Video Card(s)	AMD Radeon Graphics (Cezanne) / NVIDIA RTX 3080 Laptop GPU
Storage	WDC PC SN530 SDBPNPZ
Display(s)	BOE CQ NE156QHM-NY3
Software	Windows 11 beta channel

System Name	Moving into the mobile space
Processor	7940HS
Motherboard	HP trash
Cooling	HP trash
Memory	2x8GB
Video Card(s)	4070 mobile
Storage	512GB+2TB NVME
Display(s)	some 165hz thing that isn't as nice as it sounded

System Name	stress-less
Processor	9800X3D @ 5.42GHZ
Motherboard	MSI PRO B650M-A Wifi
Cooling	Thermalright Phantom Spirit EVO
Memory	64GB DDR5 6600 1:2 CL36, FCLK 2200
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	65% HE Keyboard
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

System Name	Box of Distraction
Processor	Ryzen 7 1800X
Motherboard	Crosshair VI Hero
Cooling	Custom watercooling
Memory	G.Skill TridentZ 2x8GB @ 3466MHz CL14 1T
Video Card(s)	EVGA 1080Ti FE. WC'd & TDP limit increased to 360W.
Storage	Samsung 960 Evo 500GB & WD Black 2TB storage drive.
Display(s)	Asus ROG Swift PG278QR 27" 1440P 165hz Gsync
Case	Phanteks Enthoo Pro M
Audio Device(s)	Phillips Fidelio X2 headphones / basic Bose speakers
Power Supply	EVGA Supernova 750W G3
Mouse	Logitech G602
Keyboard	Cherry MX Board 6.0 (mx red switches)
Software	Win 10 & Linux Mint
Benchmark Scores	https://hwbot.org/user/infrared

System Name	WS#1337
Processor	Ryzen 7 5700X3D
Motherboard	ASUS X570-PLUS TUF Gaming
Cooling	Xigmatek Scylla 240mm AIO
Memory	64GB DDR4-3600(4x16)
Video Card(s)	MSI RTX 3070 Gaming X Trio
Storage	ADATA Legend 2TB
Display(s)	Samsung Viewfinity Ultra S6 (34" UW)
Case	ghetto CM Cosmos RC-1000
Audio Device(s)	ALC1220
Power Supply	SeaSonic SSR-550FX (80+ GOLD)
Mouse	Logitech G603
Keyboard	Modecom Volcano Blade (Kailh choc LP)
VR HMD	Google dreamview headset(aka fancy cardboard)
Software	Windows 11, Ubuntu 24.04 LTS

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

StefanM

cdawall

where the hell are my stars

phanbuey

infrared

silentbogo

Moderator

kruk

agent_x007

Vya Domus

agent_x007

Vya Domus

Aquinus

Resident Wat-man

System Name	BOX
Processor	Core i7 6950X @ 4,26GHz (1,28V)
Motherboard	X99 SOC Champion (BIOS F23c + bifurcation mod)
Cooling	Thermalright Venomous-X + 2x Delta 38mm PWM (Push-Pull)
Memory	Patriot Viper Steel 4000MHz CL16 4x8GB (@3240MHz CL12.12.12.24 CR2T @ 1,48V)
Video Card(s)	Titan V (~1650MHz @ 0.77V, HBM2 1GHz, Forced P2 state [OFF])
Storage	WD SN850X 2TB + Samsung EVO 2TB (SATA) + Seagate Exos X20 20TB (4Kn mode)
Display(s)	LG 27GP950-B
Case	Fractal Design Meshify 2 XL
Audio Device(s)	Motu M4 (audio interface) + ATH-A900Z + Behringer C-1
Power Supply	Seasonic X-760 (760W)
Mouse	Logitech RX-250
Keyboard	HP KB-9970
Software	Windows 10 Pro x64

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.3.1