Wednesday, July 5th 2017

NVIDIA Laying Groundwork for Multi-Chip-Module GPUs

Multi-Chip-Module accelerators are nothing new, really. Though there are earlier implementations, when it comes to recognizable hardware most of us has already heard of, these solutions harken back to Intel's Kentsfield and Yorkfield quad-core processors (built on the 65 nm process for the LGA 775 package.) However, a singular issue with this kind of approach is having a powerful, performant-enough interconnect that allows the different cores in each module to really "talk" to each other and work perfectly in tandem. More recently, AMD has demonstrated the advantages of a true MCM (Multi-Chip-Module) approach with its Ryzen CPUs. These result from the development of a modular CPU architecture with a powerful interconnect (Infinity Fabric), which has allowed AMD to keep die size to a minimum (as it relates to a true 8-core design, at least), while enabling the company to profitably scale up to 16-cores (2 MCMs) with Threadripper, and 4 MCMs with Epyc (32 cores.)

AMD has already given hints in that its still long-coming Navi architecture (I mean, we're still waiting for Vega) will bring a true MCM design to GPUs. Vega already supports AMD's Infinity Fabric interconnect as well, paving the way for future APU designs from the company, but also MCM GPU ones, leveraging the same technology. And NVIDIA itself seems to be making strides towards an MCM-enabled future, looking to abandon the monolithic die design approach it has been taking for a long time now.
NVIDIA believes a modular approach is the best, currently technically and technologically feasible solution to a stagnating Moore's Law. CPU and GPU performance and complexity has been leaning heavily on increasing transistor counts and density, whose development and more importantly, production deployment, is slowing down (the curve that seemed to be exponential is actually sigmoidal, eh!). In fact, it is currently estimated that the biggest die-size achievable with today's technology is ~800 mm². The point is driven home when we consider that the company's Tesla V100 comes in at a staggering 815 mm², thus already straining the technical die-size limit. This fact, coupled with the industry's ever-increasing need of ever-increasing performance, leads us to believe that the GV100 GPU will be one of NVIDIA's last monolithic design GPUs (there is still a chance that 7 nm manufacturing will give the company a little more time in developing a true MCM solution, but I would say that odds are NVIDIA's next product will already manifest in such a design.
In a paper published by the company, NVIDIA itself says that the way ahead is towards integration of multiple GPU processing modules in a single package, thus allowing the GPU world to achieve what Ryzen and its Threadripper and EPYC older brothers are already achieving: scaling performance with small dies, and therefore, higher yields... Specifically NVIDIA says that they "(...) propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies." In its white paper, NVIDIA says that "the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU (...)", and that their "optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth."
These developments go on to show engineering's ingenuity and drive to improve, and looks extremely promising for companies, since abandoning the monolithic design philosophy and scaling with a variable number of smaller dies should allow for greater yields and improved performance scaling, thus both keeping the high-performance market's needs sated, and the tech companies' bottom line a little better off than they (mostly) already are. Go on ahead and follow the source NVIDIA link for the white paper, it's a very interesting read.
Sources: NVIDIA MCM Paper, Radar.O'Reilly.com
Add your own comment

49 Comments on NVIDIA Laying Groundwork for Multi-Chip-Module GPUs

#1
FordGT90Concept
"I go fast!1!11!1!"
I think AMD was a little too obvious talking about Navi. Of course NVIDIA is going to try to head them off.

That said, I'm all for MCM because it means higher yields and higher yields means more bang for the buck. My concern/fear is that they'll run into the same problems as SLI/Crossfire where support is spotty.
Posted on Reply
#2
XiGMAKiD
I don't think it will be as spotty as SLI, more like Ryzen module spotty
Posted on Reply
#3
newtekie1
Semi-Retired Folder
RaevenlordThough there earlier implementations, when it comes to recognizable hardware most of us has already heard of, these solutions harken back to Intel's Kentsfield and Yorkfield quad-core processors (built on the 65 nm process for the LGA 775 package.)
Even earlier than that, the Pentium D Smithfield and Presler processors were MCM.
Posted on Reply
#4
DeathtoGnomes
FordGT90ConceptI think AMD was a little too obvious talking about Navi. Of course NVIDIA is going to try to head them off.

That said, I'm all for MCM because it means higher yields and higher yields means more bang for the buck. My concern/fear is that they'll run into the same problems as SLI/Crossfire where support is spotty.
support was spotty cuz even the support/CS was clueless how to fix anything
Posted on Reply
#5
cryohellinc
And this means only one thing - Vulcan / DX12 and anthing else that will come in future will benefit lots of CPU cores and Multiple gpu's.
Posted on Reply
#6
TheoneandonlyMrK
FordGT90ConceptI think AMD was a little too obvious talking about Navi. Of course NVIDIA is going to try to head them off.

That said, I'm all for MCM because it means higher yields and higher yields means more bang for the buck. My concern/fear is that they'll run into the same problems as SLI/Crossfire where support is spotty.
If you do the maths big voltas never seeing consumer land ,its die size is unworkable for consumer amounts and costs.
Smaller nodes will double the cost per chip due to double the number of steps to manufacture , until euv hits mass manufature likely not soon(full euv 9step production not mearly a step or two done with it) , pellicles have yet to be made that are upto the job while all other issues are on the way to resolution they still haven't defined an appropriate material to use that can withstand euv , this is a four year issue at worst.
All this means Nvidia and Amd are going to Have to make small dies work.
Since doing 7Nm and they will, would require massive price rises to accommodate the production cost.
They're all going the way of epyc for a time , small chips in a cluster ,its the only way forward.
Question is who realised it first Amd or Nvidia , epyc ,says Amd but Nvidia could have had this up their sleeves a while too.
Posted on Reply
#7
Joss
theoneandonlymrkQuestion is who realised it first Amd or Nvidia , epyc ,says Amd but Nvidia could have had this up their sleeves a while too
Question is: will there be a multi vega? Does AMD have that up their sleeve?
It would be epic.
Posted on Reply
#8
dir_d
JossQuestion is: will there be a multi vega? Does AMD have that up their sleeve?
It would be epic.
Infinity Fabric
Posted on Reply
#9
jabbadap
newtekie1Even earlier than that, the Pentium D Smithfield and Presler processors were MCM.
And even earlier than that, Pentium Pro from 1995.
Posted on Reply
#10
Steevo
FordGT90ConceptI think AMD was a little too obvious talking about Navi. Of course NVIDIA is going to try to head them off.

That said, I'm all for MCM because it means higher yields and higher yields means more bang for the buck. My concern/fear is that they'll run into the same problems as SLI/Crossfire where support is spotty.
The biggest difference is going to be in a hardware and dispatch controller. If I glued 4 Polaris chips together with a single chip fast enough and smart enough to keep all 4 busy, fed, and could communicate with each to assemble the output, it would perform like a card 4X as powerful. This could never happen, as each step adds latency, and latency is essentially the enemy, the only (long) way around is to have a GPU die run let's say 1Ghz, set frequency, and the controller run 4.4Ghz to allow as fast or slightly faster decompile, fetch for core 1-4 while receiving the computed results on the next clock cycle.

I would wager that MCM in implementation not allow or handle overclock and they are busy finding ways to time everything through on die cache and the cache misses are one of the holdups.

**I'm on mobile and between autocorrect, and fat fingers.....
Posted on Reply
#11
Basard
Yeah.... laying the groudwork alright.... Cuz AMD didn't already lay it down with Fiji... (an interposer is all you need)
Posted on Reply
#12
Blueberries
If NVIDIA maintains their dominance in Perf/Watt over AMD an MCM GPU would in fact be groundbreaking.
Posted on Reply
#14
FordGT90Concept
"I go fast!1!11!1!"
SteevoThe biggest difference is going to be in a hardware and dispatch controller.
That's the problem though: I don't think these appear to the OS as one GPU, they appear as four. Unless they found a revolutionary way to merge the capabilities of them all so they mimic one monolithic GPU, this is mostly only going to appeal to the compute market.

Direct3D 12 and Vulkan allow draw and compute calls to be sent to multiple GPUs easily, the problem is that how to execute it is not intrinsic to their design. Software developers have to go the extra mile to enable multi-GPU capabilities and the lion's share of them don't.

The only solution is to "fake it until you make it." I just don't know if that is even possible and if it is, there will be drawbacks.


I think the fact they don't use the word "core" to describe these may be telling. If there is, in fact, only one dispatcher that covers all the modules, the dispatcher design limits how many modules can be attached. Certainly fits the "fake it until you make it" approach. Problem is, only benefit is improving yields. Because they give up so much space for each module on the interposer, I doubt these MCM'd chips can keep up with the larger monolithic chips with far more transistors.
Posted on Reply
#15
phanbuey
FordGT90ConceptThat's the problem though: I don't think these appear to the OS as one GPU, they appear as four. Unless they found a revolutionary way to merge the capabilities of them all so they mimic one monolithic GPU, this is mostly only going to appeal to the compute market.

Direct3D 12 and Vulkan allow draw and compute calls to be sent to multiple GPUs easily, the problem is that how to execute it is not intrinsic to their design. Software developers have to go the extra mile to enable multi-GPU capabilities and the lion's share of them don't.

The only solution is to "fake it until you make it." I just don't know if that is even possible and if it is, there will be drawbacks.


I think the fact they don't use the word "core" to describe these may be telling. If there is, in fact, only one dispatcher that covers all the modules, the dispatcher design limits how many modules can be attached. Certainly fits the "fake it until you make it" approach. Problem is, only benefit is improving yields. Because they give up so much space for each module on the interposer, I doubt these MCM'd chips can keep up with the larger monolithic chips with far more transistors.
Except when you slap largest manufacture-able chips together into an MCM, then there is nothing to keep up with - forgot where I read it, but seems to be their plan.
Posted on Reply
#16
Steevo
FordGT90ConceptThat's the problem though: I don't think these appear to the OS as one GPU, they appear as four. Unless they found a revolutionary way to merge the capabilities of them all so they mimic one monolithic GPU, this is mostly only going to appeal to the compute market.

Direct3D 12 and Vulkan allow draw and compute calls to be sent to multiple GPUs easily, the problem is that how to execute it is not intrinsic to their design. Software developers have to go the extra mile to enable multi-GPU capabilities and the lion's share of them don't.

The only solution is to "fake it until you make it." I just don't know if that is even possible and if it is, there will be drawbacks.


I think the fact they don't use the word "core" to describe these may be telling. If there is, in fact, only one dispatcher that covers all the modules, the dispatcher design limits how many modules can be attached. Certainly fits the "fake it until you make it" approach. Problem is, only benefit is improving yields. Because they give up so much space for each module on the interposer, I doubt these MCM'd chips can keep up with the larger monolithic chips with far more transistors.
I think that is whats going have to happen, a controller smart enough to communicate VIA base drivers to start, then the vendor specific drivers that reveals hardware capabilities to the kernel and the controller than acts like a hardware accelerated RAID card to do the grunt work while the drivers run on CPU time to make sure of higher branch prediction to feed the controller. We have been missing the hardware accelerated part, instead the drivers doing this work, but just like software RAID it has limitations, and what we are asking to do is beyond the feasible power of software control.

I really think advanced AI will come into play like it supposedly does in Ryzen in the GPU domain, to keep MCM on track. All the hardware limitations can be overcome with more transistors and better logic, and maybe the answer is a branch.xml for the future that can be kept in memory to allow hardware agnostic performance so that scenarios can be loaded partially at each major branch and weighted against X number of runs.
Posted on Reply
#17
evernessince
I'm pretty sure it's going to be much harder for Nvidia to get a MCM GPU to work then AMD. After all, AMD already has infinity fabric while Nvidia has zero experience creating one. If it were something easy Intel would be doing it too but Intel's best MCM implementation isn't nearly as good as infinity fabric.
Posted on Reply
#18
Brusfantomet
evernessinceI'm pretty sure it's going to be much harder for Nvidia to get a MCM GPU to work then AMD. After all, AMD already has infinity fabric while Nvidia has zero experience creating one. If it were something easy Intel would be doing it too but Intel's best MCM implementation isn't nearly as good as infinity fabric.
Nvidia could be using on package NVLink for the MCM communication.

As for the dispatcher, one could have a central dispatch core that uses NVLink or infinity fabric to send jobs to a number of cores, with each core having a memory interface and dedicated L1 and L2 memory with access to a shared PCIe link.
Posted on Reply
#19
Xzibit
I don't see a dedicated PhsyX Core :(
Posted on Reply
#20
erocker
*
Somebody let me know when they sell one of these that works as a single GPU. This seems to be a GPU maker's pipe dream 'till it happens.
Posted on Reply
#21
efikkan
cryohellincAnd this means only one thing - Vulcan / DX12 and anthing else that will come in future will benefit lots of CPU cores and Multiple gpu's.
The whole point of MCM GPUs is that the GPU will appear as one unit, not as multiple GPUs, so multi-GPU support in APIs should be irrelevant.
Posted on Reply
#22
dorsetknob
"YOUR RMA REQUEST IS CON-REFUSED"
erockerSomebody let me know when they sell one of these that works as a single GPU
you in the market for a "PUP" then :)
Posted on Reply
#23
Steevo
efikkanThe whole point of MCM GPUs is that the GPU will appear as one unit, not as multiple GPUs, so multi-GPU support in APIs should be irrelevant.
I agree, but so far its been mostly the drivers fault for not supporting it, or not being smart enough to subdivide instruction sets, or the latency introduced by some dependent code paths unable to be broken apart for split or multicore rendering.

Closer cores and a piece of hardware to check the code path in advance and communicate that back through drivers to fetch data from each branch possibility and run unused cycles even if its wasted would still be faster than some of the negative scaling we have seen, and you are correct that if its presented to all games unless they request lower level access and are optimized for it, that it should run as fast if not faster by at least some percent all of the time.
Posted on Reply
#24
Fluffmeister
evernessinceI'm pretty sure it's going to be much harder for Nvidia to get a MCM GPU to work then AMD. After all, AMD already has infinity fabric while Nvidia has zero experience creating one. If it were something easy Intel would be doing it too but Intel's best MCM implementation isn't nearly as good as infinity fabric.
I suspect Nvidia will be fine, after all we heard similar rumblings in the past that their HBM implementations where doomed to failure too, no experience... AMD have a massive head start etc.

Yet the GP100 is doing just dandy on the HPC market, whilst Vega is enjoyed by a handful of "pioneers".
Posted on Reply
#25
Solidstate89
FordGT90ConceptI think AMD was a little too obvious talking about Navi. Of course NVIDIA is going to try to head them off.

That said, I'm all for MCM because it means higher yields and higher yields means more bang for the buck. My concern/fear is that they'll run into the same problems as SLI/Crossfire where support is spotty.
These companies work on timelines of several years when it comes to arch development. AMD wasn't revealing anything that nVidia wouldn't have already been working on themselves. No company can adapt and build a new arch that quickly.

Hell, look at Intel trying to move up its LGA-20166 mobo and CPU launch by just a few months and the disaster that ended up being with an underprepared and buggy platform launch. And that was an already expected platform launch that was accelerated by just a few months.
Posted on Reply
Add your own comment
Apr 28th, 2024 19:42 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts