Wednesday, July 5th 2017

NVIDIA Laying Groundwork for Multi-Chip-Module GPUs

Multi-Chip-Module accelerators are nothing new, really. Though there are earlier implementations, when it comes to recognizable hardware most of us has already heard of, these solutions harken back to Intel's Kentsfield and Yorkfield quad-core processors (built on the 65 nm process for the LGA 775 package.) However, a singular issue with this kind of approach is having a powerful, performant-enough interconnect that allows the different cores in each module to really "talk" to each other and work perfectly in tandem. More recently, AMD has demonstrated the advantages of a true MCM (Multi-Chip-Module) approach with its Ryzen CPUs. These result from the development of a modular CPU architecture with a powerful interconnect (Infinity Fabric), which has allowed AMD to keep die size to a minimum (as it relates to a true 8-core design, at least), while enabling the company to profitably scale up to 16-cores (2 MCMs) with Threadripper, and 4 MCMs with Epyc (32 cores.)

AMD has already given hints in that its still long-coming Navi architecture (I mean, we're still waiting for Vega) will bring a true MCM design to GPUs. Vega already supports AMD's Infinity Fabric interconnect as well, paving the way for future APU designs from the company, but also MCM GPU ones, leveraging the same technology. And NVIDIA itself seems to be making strides towards an MCM-enabled future, looking to abandon the monolithic die design approach it has been taking for a long time now.
NVIDIA believes a modular approach is the best, currently technically and technologically feasible solution to a stagnating Moore's Law. CPU and GPU performance and complexity has been leaning heavily on increasing transistor counts and density, whose development and more importantly, production deployment, is slowing down (the curve that seemed to be exponential is actually sigmoidal, eh!). In fact, it is currently estimated that the biggest die-size achievable with today's technology is ~800 mm². The point is driven home when we consider that the company's Tesla V100 comes in at a staggering 815 mm², thus already straining the technical die-size limit. This fact, coupled with the industry's ever-increasing need of ever-increasing performance, leads us to believe that the GV100 GPU will be one of NVIDIA's last monolithic design GPUs (there is still a chance that 7 nm manufacturing will give the company a little more time in developing a true MCM solution, but I would say that odds are NVIDIA's next product will already manifest in such a design.
In a paper published by the company, NVIDIA itself says that the way ahead is towards integration of multiple GPU processing modules in a single package, thus allowing the GPU world to achieve what Ryzen and its Threadripper and EPYC older brothers are already achieving: scaling performance with small dies, and therefore, higher yields... Specifically NVIDIA says that they "(...) propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies." In its white paper, NVIDIA says that "the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU (...)", and that their "optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth."
These developments go on to show engineering's ingenuity and drive to improve, and looks extremely promising for companies, since abandoning the monolithic design philosophy and scaling with a variable number of smaller dies should allow for greater yields and improved performance scaling, thus both keeping the high-performance market's needs sated, and the tech companies' bottom line a little better off than they (mostly) already are. Go on ahead and follow the source NVIDIA link for the white paper, it's a very interesting read.
Sources: NVIDIA MCM Paper, Radar.O'Reilly.com
Add your own comment

49 Comments on NVIDIA Laying Groundwork for Multi-Chip-Module GPUs

#26
erocker
*
dorsetknobyou in the market for a "PUP" then :)
No idea.
Posted on Reply
#28
efikkan
SteevoI agree, but so far its been mostly the drivers fault for not supporting it, or not being smart enough to subdivide instruction sets, or the latency introduced by some dependent code paths unable to be broken apart for split or multicore rendering.

Closer cores and a piece of hardware to check the code path in advance and communicate that back through drivers to fetch data from each branch possibility and run unused cycles even if its wasted would still be faster than some of the negative scaling we have seen, and you are correct that if its presented to all games unless they request lower level access and are optimized for it, that it should run as fast if not faster by at least some percent all of the time.
Low level scheduling has to be done in hardware, the latency from the CPU will never let this be done in the driver. For MCM GPUs to work well, the bandwidth between them has to be good and the latency very low, which is why traditional multi-GPU never will work well like this for games.
Posted on Reply
#29
evernessince
BrusfantometNvidia could be using on package NVLink for the MCM communication.

As for the dispatcher, one could have a central dispatch core that uses NVLink or infinity fabric to send jobs to a number of cores, with each core having a memory interface and dedicated L1 and L2 memory with access to a shared PCIe link.
NVLink isn't designed to be used to make two GPU cores appear as one. Right now AMD is the only one to have multiple dies appear as one, and that's Ryzen.
Posted on Reply
#30
justimber
Nvidia engineers musy be busy reverse engineering a ryzen now :p
Posted on Reply
#31
TheGuruStud
evernessinceI'm pretty sure it's going to be much harder for Nvidia to get a MCM GPU to work then AMD. After all, AMD already has infinity fabric while Nvidia has zero experience creating one. If it were something easy Intel would be doing it too but Intel's best MCM implementation isn't nearly as good as infinity fabric.
They employed three universities to do the work for them. Does anyone think those dogs are going to invent something themselves? LOL
nem..T800 chip pic
Daaaamn, Cameron is good. He predicted the future...maybe it's AMDnet.
Posted on Reply
#32
HopelesslyFaithful
RaevenlordMulti-Chip-Module accelerators are nothing new, really. Though there are earlier implementations, when it comes to recognizable hardware most of us has already heard of, these solutions harken back to Intel's Kentsfield and Yorkfield quad-core processors (built on the 65 nm process for the LGA 775 package.) However, a singular issue with this kind of approach is having a powerful, performant-enough interconnect that allows the different cores in each module to really "talk" to each other and work perfectly in tandem. More recently, AMD has demonstrated the advantages of a true MCM (Multi-Chip-Module) approach with its Ryzen CPUs. These result from the development of a modular CPU architecture with a powerful interconnect (Infinity Fabric), which has allowed AMD to keep die size to a minimum (as it relates to a true 8-core design, at least), while enabling the company to profitably scale up to 16-cores (2 MCMs) with Threadripper, and 4 MCMs with Epyc (32 cores.)

AMD has already given hints in that its still long-coming Navi architecture (I mean, we're still waiting for Vega) will bring a true MCM design to GPUs. Vega already supports AMD's Infinity Fabric interconnect as well, paving the way for future APU designs from the company, but also MCM GPU ones, leveraging the same technology. And NVIDIA itself seems to be making strides towards an MCM-enabled future, looking to abandon the monolithic die design approach it has been taking for a long time now.



[---]

NVIDIA believes a modular approach is the best, currently technically and technologically feasible solution to a stagnating Moore's Law. CPU and GPU performance and complexity has been leaning heavily on increasing transistor counts and density, whose development and more importantly, production deployment, is slowing down (the curve that seemed to be exponential is actually sigmoidal, eh!). In fact, it is currently estimated that the biggest die-size achievable with today's technology is ~800 mm². The point is driven home when we consider that the company's Tesla V100 comes in at a staggering 815 mm², thus already straining the technical die-size limit. This fact, coupled with the industry's ever-increasing need of ever-increasing performance, leads us to believe that the GV100 GPU will be one of NVIDIA's last monolithic design GPUs (there is still a chance that 7 nm manufacturing will give the company a little more time in developing a true MCM solution, but I would say that odds are NVIDIA's next product will already manifest in such a design.



In a paper published by the company, NVIDIA itself says that the way ahead is towards integration of multiple GPU processing modules in a single package, thus allowing the GPU world to achieve what Ryzen and its Threadripper and EPYC older brothers are already achieving: scaling performance with small dies, and therefore, higher yields... Specifically NVIDIA says that they "(...) propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies." In its white paper, NVIDIA says that "the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU (...)", and that their "optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth."



These developments go on to show engineering's ingenuity and drive to improve, and looks extremely promising for companies, since abandoning the monolithic design philosophy and scaling with a variable number of smaller dies should allow for greater yields and improved performance scaling, thus both keeping the high-performance market's needs sated, and the tech companies' bottom line a little better off than they (mostly) already are. Go on ahead and follow the source NVIDIA link for the white paper, it's a very interesting read.

Sources: NVIDIA MCM Paper, Radar.O'Reilly.com
.....ryzen is a 2x4 core OP.....your article has historical errors everywhere. You really need to retract and update with all the fixes that the community has stated.....Do we need to start writing for you? Good god this was awful.

CCX is 4 cores not 8 cores.

www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/5

Another Ravenlord epic fail of an article...retract please!
Posted on Reply
#33
Prima.Vera
XiGMAKiDI don't think it will be as spotty as SLI, more like Ryzen module spotty
SLI/CFX kind of support will be only hardware this time, not software (game optimizations)
Posted on Reply
#34
Octopuss
Haven't multi GPU solutions been proved as useless in the recent years? Be it either SLI or multi GPU cards.

Also, what sort of technological limit is there to die size?
Posted on Reply
#35
TheoneandonlyMrK
HopelesslyFaithful.....ryzen is a 2x4 core OP.....your article has historical errors everywhere. You really need to retract and update with all the fixes that the community has stated.....Do we need to start writing for you? Good god this was awful.

CCX is 4 cores not 8 cores.

www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/5

Another Ravenlord epic fail of an article...retract please!
a ccx might be 4 cores but he was talking about the die which is 8.
Posted on Reply
#36
bug
FordGT90ConceptThat's the problem though: I don't think these appear to the OS as one GPU, they appear as four.
Look at the first block diagram: there's only one I/O block in there.
Posted on Reply
#37
HopelesslyFaithful
theoneandonlymrka ccx might be 4 cores but he was talking about the die which is 8.
which has allowed AMD to keep die size to a minimum (as it relates to a true 8-core design, at least
he said this which is factually false.


CCX in of itself is a MCM.
Posted on Reply
#38
FordGT90Concept
"I go fast!1!11!1!"
bugLook at the first block diagram: there's only one I/O block in there.
Yes, but where's the "Sys I/O" at? The modules appear uniform in size. Do they all have a Sys I/O and only one is functional?

Another concern is that the memory controller: if it isn't in the Sys I/O, that strongly suggests each module has it's own memory controller which makes accesses between pools higher latency. The easiest solution is like SLI and Crossfire: mirroring the VRAM. That's extremely wasteful.

TL;DR I'll believe it when I see it.
Posted on Reply
#39
bug
FordGT90ConceptYes, but where's the "Sys I/O" at? The modules appear uniform in size. Do they all have a Sys I/O and only one is functional?

Another concern is that the memory controller: if it isn't in the Sys I/O, that strongly suggests each module has it's own memory controller which makes accesses between pools higher latency. The easiest solution is like SLI and Crossfire: mirroring the VRAM. That's extremely wasteful.

TL;DR I'll believe it when I see it.
Just look at the damn thing. Memory is clearly connected to each module, the Sys I/O isn't.
Posted on Reply
#40
Raevenlord
News Editor
HopelesslyFaithful.....ryzen is a 2x4 core OP.....your article has historical errors everywhere. You really need to retract and update with all the fixes that the community has stated.....Do we need to start writing for you? Good god this was awful.

CCX is 4 cores not 8 cores.

www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/5

Another Ravenlord epic fail of an article...retract please!
You need to take a huge pill of chill and learn how to read.
Posted on Reply
#41
bug
RaevenlordYou need to take a huge pill of chill and learn how to read.
On teh internetz? NEVAAAAA!!!
Posted on Reply
#42
TheoneandonlyMrK
HopelesslyFaithfulhe said this which is factually false.


CCX in of itself is a MCM.
no its not its not the same the die is the design two such design are not just stuck together, this makes a die, an Mcm is multiple Die on an interposer ala 2.5d
Posted on Reply
#43
FordGT90Concept
"I go fast!1!11!1!"
bugJust look at the damn thing. Memory is clearly connected to each module, the Sys I/O isn't.
It doesn't tell us how the memory is structured/accessed and I was talking about the physical picture where all the modules appear the same on the surface so where is the Sys I/O physically at?
Posted on Reply
#44
HopelesslyFaithful
theoneandonlymrkno its not its not the same the die is the design two such design are not just stuck together, this makes a die, an Mcm is multiple Die on an interposer ala 2.5d
it still isn't a true 8 core. It is 2x4 cores packed on 1 die...basically the same thing with all the cons of 2 separate CPUs. (large penalty switching between clusters) Functionally, CCX on die/interposer are the same and definitely not a 8 core.
RaevenlordYou need to take a huge pill of chill and learn how to read.
maybe be factually accurate and admit you wrote a crap article?

CCX is a 4 core clusters and not a true 8 core. You need to learn how to read and have integrity in your articles and redact and fix them.
FordGT90ConceptIt doesn't tell us how the memory is structured/accessed and I was talking about the physical picture where all the modules appear the same on the surface so where is the Sys I/O physically at?
Is this that confusing? look at the lines........
Posted on Reply
#45
TheoneandonlyMrK
HopelesslyFaithfulit still isn't a true 8 core. It is 2x4 cores packed on 1 die...basically the same thing with all the cons of 2 separate CPUs. (large penalty switching between clusters) Functionally, CCX on die/interposer are the same and definitely not a 8 core.

maybe be factually accurate and admit you wrote a crap article?

CCX is a 4 core clusters and not a true 8 core. You need to learn how to read and have integrity in your articles and redact and fix them.



Is this that confusing? look at the lines........
The design means nothing to this debate.

An Mcm is more than one seperately made die put into the same package .............. full stop thats the fact

Thats all but please feel free to argue about random shit to my ignore hand as much as you like.
Posted on Reply
#46
HopelesslyFaithful
theoneandonlymrkThe design means nothing to this debate.

An Mcm is more than one seperately made die put into the same package .............. full stop thats the fact

Thats all but please feel free to argue about random shit to my ignore hand as much as you like.
he was still factually false in multiple places in that terribly written article....full stop:


Stop scapegoating for him.

CCX is 4 core CPU. 100% fact. Not an 8 core CPU.
Posted on Reply
#47
newtekie1
Semi-Retired Folder
jabbadapAnd even earlier than that, Pentium Pro from 1995.
Good call! I was thinking of only chips that used the same chip multiple times. But I forgot the Pentium Pro had a separate L2 chip on the package..

But then again, if we think of it like that, wouldn't anything with HBM be considered a MCM?
Posted on Reply
#48
bug
newtekie1Good call! I was thinking of only chips that used the same chip multiple times. But I forgot the Pentium Pro had a separate L2 chip on the package..

But then again, if we think of it like that, wouldn't anything with HBM be considered a MCM?
Well, it is not called Multi-SameChip-Module...
Posted on Reply
#49
ratirt
I wonder guys if you seen this?
This link is a presentation of multi GPU's and AMD's plan. Well I suggest you all to watch the entire video to understand the situtation and all the information given since this brings also a conclusion . This video explains a lot with the multi GPU's and now we have Nvidia's MCM tech which has been announced officially not so long ago in terms of GPU's. Wonder what do you think about it? Is this an AMD plan indeed and Nvidia sees this and has to adjust or it's not related? Well maybe there's some truth in this and we see it happing now.
Posted on Reply
Add your own comment
May 14th, 2024 12:49 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts