Sunday, April 12th 2015

NVIDIA "Pascal" GP100 Silicon Detailed

The upcoming "Pascal" GPU architecture from NVIDIA is shaping up to be a pixel-crunching monstrosity. Introduced as more of a number-cruncher in its Tesla P100 unveil at GTC 2016, we got our hands on the block diagram of the "GP100" silicon which drives it. To begin with, the GP100 is a multi-chip module, much like AMD's "Fiji," consisting of a large GPU die, four memory-stacks, and silicon wafer (interposer) acting as substrate for the GPU and memory stacks, letting NVIDIA drive microscopic wires between the two. The GP100 features a 4096-bit wide HBM2 memory interface, with typical memory bandwidths of up to 1 TB/s. On the P100, the memory ticks at 720 GB/s.

At its most top-level hierarchy, the GP100 is structured much like other NVIDIA GPUs, with the exception of two key interfaces - bus and memory. A PCI-Express gen 3.0 x16 host interface connects the GPU to your system, the GigaThread Engine distributes workload between six graphics processing clusters (GPCs). Eight memory controllers make up the 4096-bit wide HBM2 memory interface, and a new "High-speed Hub" component, wires out four NVLink ports. At this point it's not known if each port has a throughput of 80 GB/s (per-direction), or all four ports put together.
The GP100 features six graphics processing clusters (GPCs). These are highly independent subdivisions of the GPU, with their own render front and back-ends. With the "Pascal" architecture, at least with the way it's implemented on the GP100, each GPC features 10 streaming multiprocessors (SMs), the basic number crunching machinery of the GPU. Each SM holds 64 CUDA cores. The GPC hence holds a total of 640 CUDA cores, and the entire GP100 chip holds 3,840 CUDA cores. Other vital specs include 240 TMUs. On the Tesla P100, NVIDIA enabled just 56 of the 60 streaming multiprocessors, working out to a CUDA core count of 3,584.

The "Pascal" architecture appears to facilitate very high clock speeds. The Tesla P100, despite being an enterprise part, features a core clock speed as high as 1328 MHz, with GPU Boost frequency of 1480 MHz, and a TDP of 300W. This might scare you, but you have to take into account that the memory stacks have been moved to the GPU package, and so the heatsink interfacing with it all, will have to cope with the combined thermal loads of the GPU die, the memory stacks, and whatever else makes heat on the multi-chip module.

Lastly, there's the concept of NVLink. This interconnect developed in-house by NVIDIA makes multi-GPU setups work much like a modern multi-socket CPU machine, in which QPI (Intel) or HyperTransport (AMD) links provide super-highways between neighboring sockets. Each NVLink path offers a bandwidth of up to 80 GB/s (per direction), enabling true memory virtualization between multiple GPUs. This could prove useful for GPU-accelerated HPC systems, in which one GPU has to access memory controlled by a neighboring GPU, while the software sees the sum of the two GPUs' memory as one unified and contiguous block. The Pascal Unified Memory system lets advanced GPU programming models like CUDA 8 oversubscribe memory beyond what the GPU physically controls, and up to the system memory.
Add your own comment

50 Comments on NVIDIA "Pascal" GP100 Silicon Detailed

#2
xkm1948
Ikaruga said:
My body is ready
You are right, I bet the next Titan is gonna cost both arms and legs, leaving the body ready. :D
Posted on Reply
#3
R-T-B
xkm1948 said:
You are right, I bet the next Titan is gonna cost both arms and legs, leaving the body ready. :D
Is that why they are selling us so hard on VR? We won't be able to move our arms and legs anywhere but in the VR realm because they have been stolen/sold?
Posted on Reply
#4
Frick
Fishfaced Nincompoop
R-T-B said:
Is that why they are selling us so hard on VR? We won't be able to move our arms and legs anywhere but in the VR realm because they have been stolen/sold?
Not until we have implants attached directly to our brains yo.
Posted on Reply
#5
R-T-B
Frick said:
Not until we have implants attached directly to our brains yo.
That's the part you sell your other arm for. You know, the one you masturbate with. But it's ok, you can VR masturbate now.
Posted on Reply
#6
ensabrenoir


...........take the green pill.....and enter a new V.R. world beyond the matrix .....


.............yes......................... I dug that up....
Posted on Reply
#7
xvi
Frick said:
Not until we have implants attached directly to our brains yo.
Things I need. ^
Posted on Reply
#8
FordGT90Concept
"I go fast!1!11!1!"
This doesn't sound like much/any improvement on the async shaders front. It also only represents an increase of 16.7% in CUDA core count compared to Titan X. I think I'm disappointed.

Maxwell SM
Titan X

Graphics Processing Clusters: 6 -> 6
Streaming Multiprocessors: 4 -> 10
CUDA Cores: 128 -> 64
Register File: 16,384 -> 32,768

It should be noted that this Tesla card has 2:1 DP Units which means its double precision performance should be quite substantial. I doubt those DP Units will find their way into the graphics card.

The high GPU clock speeds is really the only positive thing I'm taking away from this but that could also be a function of being a Tesla card.
Posted on Reply
#9
HumanSmoke
FordGT90Concept said:
This doesn't sound like much/any improvement on the async shaders front. It also only represents an increase of 16.7% in CUDA core count compared to Titan X. I think I'm disappointed.

Maxwell SM
Titan X

Graphics Processing Clusters: 6 -> 6
Streaming Multiprocessors: 8 -> 10
CUDA Cores: 64 -> 64
Register File: 16,384 -> 32,768

It should be noted that this Tesla card has 2:1 DP Units which means its double precision performance should be quite substantial. I doubt those DP Units will find their way into the graphics card.

The high GPU clock speeds is really the only positive thing I'm taking away from this but that could also be a function of being a Tesla card.
I wouldn't get too caught up in what is an HPC card. The four NVLink interfaces have no consumer value, and neither do the 1,920 FP64 units More likely a GeForce card would/should be a doubled in size GP104 (GP102?) for consumer use. Not too sure how you got "CUDA Cores 64 -> 64". Maxwell has 128 cores per SM, Pascal has 64 per SM, GM 200 has 24 SM, GP100 has 60 (or 56 in its initial iteration), GM200 has 192 TMUs, GP100 has 240 (224 in the announced version), and in its initial guise 14MB of register file versus 6 in the fully enabled GM200, along with double the registers per thread compared with GM200. The only metric it lacks is that its 4MB of L2 is spread across more ALUs than the 3MB of GM200.
I think I'd wait to make a judgement for any application aside from enterprise until we find out whether the consumer card is a version of this, or a purpose designed gaming card. I'd also like to see some definitive ROP count just for the sake of completeness.

I'd also note that Tesla's are invariably clocked lower than GeForce variants. Even the high clocked M40 (the only exception in the Tesla family) has a max boost of 1114MHz where the average guaranteed boost of the GM200 is 1075MHz, and usually peaks higher than the Tesla card.
Posted on Reply
#10
Ferrum Master
HumanSmoke said:
NVLink interfaces have no consumer value
Unless they use it in dual cards instead of the PLX bridge or some laptop setups... most probably the block will be ommited in consumer cards.
Posted on Reply
#11
john_
ensabrenoir said:


...........take the green pill.....and enter a new V.R. world beyond the matrix .....


.............yes......................... I dug that up....
Take the green pill and enter a VR world full of dizziness

<div class="youtube-embed" data-id="PmJM0X3mVnc"><img src="https://i.ytimg.com/vi/PmJM0X3mVnc/hqdefault.jpg" /><div class="youtube-play"></div><a href="https://www.youtube.com/watch?v=PmJM0X3mVnc" target="_blank" class="youtube-title"></a></div>

06:00
Posted on Reply
#12
Ebo
Until Nvidia actually has shown Pascal and AMD has shown Polaris I remain "skeptic".

Im keeping my money in my pocket until the products is tested on several sites.
Posted on Reply
#13
The Quim Reaper
Who cares....GP100 cards are still nearly a year away.

Wake me up a week before launch.
Posted on Reply
#14
ZoneDymo
Ebo said:
Until Nvidia actually has shown Pascal and AMD has shown Polaris I remain "skeptic".

Im keeping my money in my pocket until the products is tested on several sites.
Well its not like you could buy them before that anyway.....
Posted on Reply
#15
MxPhenom 216
Corsair Fanboy
I really hope i get this hardware engineering internship now at Microsoft this summer!!!
Posted on Reply
#16
FordGT90Concept
"I go fast!1!11!1!"
HumanSmoke said:
Not too sure how you got "CUDA Cores 64 -> 64".
Edit: I see it now. The diagrams are misleading because there's 4 clusters in each SM instead of 2. That means Maxwell had 4 SMs per GPC instead of 8 SMs. The architecture is quite different there. I corrected the numbers in the post.
Posted on Reply
#17
xenocide
In all likelihood it will be GP104 is on par for Titan X and GP100 is a notable increase in performance over Titan X. I'm okay with that. If it means I can get 980/Titan X level performance for ~$300-400 I'm alright with that. Plus the HBM2 on GP100 will probably make the GPU scale very well with resolution, making it more suitable for 4K which is at the very least a selling point.
Posted on Reply
#18
bug
Ferrum Master said:
Unless they use it in dual cards instead of the PLX bridge or some laptop setups... most probably the block will be ommited in consumer cards.
It was officially announced NVLink only works with POWER CPUs at this time. So no, it's not for home use.
Posted on Reply
#19
jabbadap
bug said:
It was officially announced NVLink only works with POWER CPUs at this time. So no, it's not for home use.
Well yeah there's no x86 processor which have nvlink support(and I don't believe there ever will be such a processor). But GPU-to-GPU link should be possible with x86 processor, thus dual gpu cards could use it between gpus and use pcie to communicate with cpu(GTC 2016 nvlink graph).
Posted on Reply
#20
medi01
R-T-B said:
Is that why they are selling us so hard on VR?
That's mostly fear, imo.
VR MIGHT be next big thing, and they are scared they'll miss it. ("I feel dizzy" / no async (c) Maxwell, cough)
Posted on Reply
#21
BorisDG
The Quim Reaper said:
Who cares....GP100 cards are still nearly a year away.

Wake me up a week before launch.
GP100 for sure is coming this year. Probably oct/nov. ;) Every year is like this. Summer = GP104; Winter = GP100.

My body is also ready for GP100. Glad that I skipped the ugly Maxwell architecture.
Posted on Reply
#22
Ferrum Master
jabbadap said:
Well yeah there's no x86 processor which have nvlink support(and I don't believe there ever will be such a processor). But GPU-to-GPU link should be possible with x86 processor, thus dual gpu cards could use it between gpus and use pcie to communicate with cpu(GTC 2016 nvlink graph).
They might repeat history and make something like NF200 bridge... nvidia shelling a useless piece of silicon to reside on motherboards to have some sort of shady SLI gimmick :D.
Posted on Reply
#23
HisDivineOrder
BorisDG said:
GP100 for sure is coming this year. Probably oct/nov. ;) Every year is like this. Summer = GP104; Winter = GP100.

My body is also ready for GP100. Glad that I skipped the ugly Maxwell architecture.
Titan products usually show up at the beginning of a year, around Feb. Geforce products directly derived from those Titan products usually arrive the June after that.

So... I think Pascal Titan will show up Feb/March of next year (giving them a year's worth of production for other uses) and then magically a cut-down version of it will appear as a high end Geforce product in May/June.
Posted on Reply
#24
Xzibit
jabbadap said:
Well yeah there's no x86 processor which have nvlink support(and I don't believe there ever will be such a processor). But GPU-to-GPU link should be possible with x86 processor, thus dual gpu cards could use it between gpus and use pcie to communicate with cpu(GTC 2016 nvlink graph).
They could do what Apple did with the dual AMD FirePros in the Mac Pro.



Your new SLI "NVLink" bridge
Posted on Reply
#25
Ferrum Master
Xzibit said:
They could do what Apple did with the dual AMD FirePros in the Mac Pro.
Put them into a trash can?
Posted on Reply
Add your own comment