Thursday, May 14th 2020

NVIDIA GA100 Scalar Processor Specs Sheet Released

NVIDIA today unveiled the GTC 2020, online event, and the centerpiece of it all is the GA100 scalar processor GPU, which debuts the "Ampere" graphics architecture. Sifting through a mountain of content, we finally found the slide that matters the most - the specifications sheet of GA100. The GA100 is a multi-chip module that has the 7 nm GPU die at the center, and six HBM2E memory stacks at its either side. The GPU die is built on the TSMC N7P 7 nm silicon fabrication process, measures 826 mm², and packing an unfathomable 54 billion transistors - and we're not even counting the transistors on the HBM2E stakcs of the interposer.

The GA100 packs 6,912 FP32 CUDA cores, and independent 3,456 FP64 (double-precision) CUDA cores. It has 432 third-generation tensor cores that have FP64 capability. The three are spread across a gargantuan 108 streaming multiprocessors. The GPU has 40 GB of total memory, across a 6144-bit wide HBM2E memory interface, and 1.6 TB/s total memory bandwidth. It has two interconnects: a PCI-Express 4.0 x16 (64 GB/s), and an NVLink interconnect (600 GB/s). Compute throughput values are mind-blowing: 19.5 TFLOPs classic FP32, 9.7 TFLOPs classic FP64, and 19.5 TFLOPs tensor cores; TF32 156 TFLOPs single-precision (312 TFLOPs with neural net sparsity enabled); 312 TFLOPs BFLOAT16 throughout (doubled with sparsity enabled); 312 TFLOPs FP16; 624 TOPs INT8, and 1,248 TOPS INT4. The GPU has a typical power draw of 400 W in the SXM form-factor. We also found the architecture diagram that reveals GA100 to be two almost-independent GPUs placed on a single slab of silicon. We also have our first view of the "Ampere" streaming multiprocessor with its FP32 and FP64 CUDA cores, and 3rd gen tensor cores. The GeForce version of this SM could feature 2nd gen RT cores.
Add your own comment

100 Comments on NVIDIA GA100 Scalar Processor Specs Sheet Released

#26
IceShroom
400W??? Isn't Nvidia suppose to be efficient??
Posted on Reply
#27
theoneandonlymrk
T4C Fantasy
There is no difference, A100 is just the Tesla name it uses a GA100
I don't know about no difference one's cut down and the price will vary , so they carry the same name though, weird.
Posted on Reply
#28
Fluffmeister
IceShroom
400W??? Isn't Nvidia suppose to be efficient??
Compared to what exactly?
Posted on Reply
#29
T4C Fantasy
CPU & GPU DB Maintainer
theoneandonlymrk
I don't know about no difference one's cut down and the price will vary , so they carry the same name though, weird.
The different one will be GA102, No HBM but just as many cuda cores more or less.

But there won't be 2 different 100s, technically that is what the 102 is.
Posted on Reply
#30
IceShroom
Fluffmeister
Compared to what exactly?
Compared to AMD. Nvidia's 12nm GPUs have same efficiency of AMD's 7nm GPUs, as a result Nvidia's 7nm GPU's should be more efficient.
Posted on Reply
#31
Fluffmeister
IceShroom
Compared to AMD. Nvidia's 12nm GPUs have same efficiency of AMD's 7nm GPUs, as a result Nvidia's 7nm GPU's should be more efficient.
So your comparing a 10.3 billion transistor Navi to a 54 billion transistor 40GB HBM2 HPC AI compute monster.

Got ya.
Posted on Reply
#32
dyonoctis
IceShroom
400W??? Isn't Nvidia suppose to be efficient??
For what it's supposed to be the perf/watt ratio is actually great. A Single rack of a DGX A100 can replace several old racks.
From this :


To this:

Posted on Reply
#33
Fluffmeister
A picture paints a thousand words, thank you.
Posted on Reply
#34
Vya Domus
T4C Fantasy
But there won't be 2 different 100s, technically that is what the 102 is.
But this one has an entire GPC disabled due to horrendous yields, I presume, and probably because it would throw even that eye watering 400W TDP out the window. There has to be one fully enabled chip right ? One would assume there would be different 100s.

To be honest this is borderline Thermi 2.0, a great compute architecture that can barley be implemented in actual silicon due to power and yields. These aren't exactly Nvidia's brightest hours in terms of chip design, it seems like they bit more than what they could chew, the chip was probably cut down in a last minute decision.

Suffice to say I doubt we'll see the full 8192 shaders in any GPU this generation, I doubt they could realistically fit that in a 250W power envelope and it seems like GA100 runs at 1.4 Ghz, no change from Volta nor from Turing probably. Let's see 35% more shaders than Volta but 60% more power and same clocks. It's not shaping up to be the "50% more efficient and 50% faster per SM" some hoped for.
Posted on Reply
#35
Fluffmeister
Well they can scrap the FP64 performance that the 5700XT offers in FP32 for starters so that is a bonus, with the TU102 being 18.6 billion transistors I'd suggest they have wiggle room. Just a thought.
Posted on Reply
#36
RH92
theoneandonlymrk
Some of what your saying is wrong ,it takes up quite a lot of die space relatively hence Nvidia's large die sizes which are added to by the requirements of extra cache resources and hardware needed to keep the special units busy.
I'm afraid you are wrong . The myth that larger die sized are correlated to fixed function hardware has been already debunked , im trying to find the source , might be TPU , Anandtech , or Youtube but it might take time until i find it so i will link it here ASAP .

There is no real correlation between die size increase and fixed function as the latter eats relatively very low die space , more likely than not the higher die size in Turing is explained by the fact that it has more SMs .

This is further backed up by GA100 which has increased dies size compared to GV100 ( 826mm^2 vs 815mm^2 ) but significantly lower TensorCore count ( 432 vs 640 ) . So it is pretty obvious that fixed function hardware is not responsible for the die size expansion !
theoneandonlymrk
The other reason being because they can, and to make more money, it's not rocket science just business, people should have chosen with their wallet's.
This was exactly my point , the only tangible argument that justifies higher prices for Turing ( other than the increased silicon size ) is because the lack of competition allows them to do so .
Posted on Reply
#37
M2B
Vya Domus
. These aren't exactly Nvidia's brightest hours in terms of chip design
These are exactly Nvidia's brightest hours in terms of chip design.
The A100 packs 54 billion transistors, 2.5 times as much as a V100, and those transistors aren't there for nothing.
You can't just compare SM counts and base stupid assumptions upon that. The A100 is clearly a much more efficient solution for what it's been designed for.
Posted on Reply
#38
theoneandonlymrk
RH92
I'm afraid you are wrong . The myth that larger die sized are correlated to fixed function hardware has been already debunked , im trying to find the source , might be TPU , Anandtech , or Youtube but it might take time until i find it so i will link it here ASAP .

There is no real correlation between die size increase and fixed function as the latter eats relatively very low die space , more likely than not the higher die size in Turing is explained by the fact that it has more SMs .

This is further backed up by GA100 which has increased dies size compared to GV100 ( 826mm^2 vs 815mm^2 ) but significantly lower TensorCore count ( 432 vs 640 ) . So it is pretty obvious that fixed function hardware is not responsible for the die size expansion !



This was exactly my point , the only tangible argument that justifies higher prices for Turing ( other than the increased silicon size ) is because the lack of competition allows them to do so .
We disagree , so be it.
Posted on Reply
#39
Vya Domus
M2B
These are exactly Nvidia's brightest hours in terms of chip design.
Why is almost 20% of the chip disabled then ? That's great design, right ?
M2B
You can't just compare SM counts and base stupid assumptions upon that.
Comparing SM counts and power is a totally legit way of inferring efficiency, how else would you do it? The SMs aren't same, but that's the point, efficiency wouldn't come just from the node.
M2B
those transistors aren't there for nothing.
Guess what buddy, some of them are for nothing, I'd say about 8-9 billion give or take.

Let's face reality, they couldn't enable the entire chip because of power constraints. Making a chip like that isn't desirable, it's painfully obvious they've missed their target by miles.
Posted on Reply
#40
M2B
Vya Domus
Why is almost 20% of the chip disabled then ? That's great design, right ?



Comparing SM counts and power is a totally legit way of inferring efficiency, how else would you do it, smart ass ? The SMs aren't same, but that's the point, efficiency wouldn't come just from the node.



Guess what buddy, some of them are for nothing, I'd say about 9 billion give or take.
Look at this clueless person acting like he really knows how to design GPUs better than a 200$ billion company which have been designing GPUs for ages.
So, based on your logic the Vega 56 is a more efficient GPU than AMD's latest and greatest 5700 XT, because it has more TFLOPS and much more compute units, and consumes similar amounts of power, right?
Based on the density figures, I think Nvidia is using TSMC's high-density version of their 7nm node, not the high-performance one, and that was not the case with previous generations.
They could just use the normal high performance version and scale up the GV100 chip, but they clearly needed more density for their design goals.
What I'm saying is that you have to see how the chip performs in applications that actully matter and base efficiency figures upon that, not just some raw numbers.
Posted on Reply
#41
Vya Domus
M2B
Look at this clueless person acting like he really knows how to design GPUs better a 200$ billion company which have been designing GPUs for ages.
So, based on your logic the Vega 56 is a more efficient GPU than AMD's latest and greatest 5700 XT, because it has more TFLOPS and much more compute units, and consumes similiar amounts of power,
:roll:

Because you do know how to design a GPU, right ? Sorry your GPU architect badge must have fallen off.
M2B
So, based on your logic the Vega 56 is a more efficient GPU than AMD's latest and greatest 5700 XT, because it has more TFLOPS and much more compute units, and consumes similiar amounts of power,
Nope, that's based on your logic. Your understanding of what I said was obviously severely limited.

First of all Vega 56 uses more power, and runs at lower clocks. A legendary GPU architect like yourself would know that a larger processor at lower clocks runs more efficiently because shaders scale relativity linearly with power whereas a change in clocks incurs a change in voltage which isn't linear. In other words if let's say we have a GPU with N/2 shaders at 2 Ghz it will generally consume more power than a GPU with N shaders at 1 Ghz.

Let's compile that with how Navi works : RX 5700XT runs at a considerably higher voltages and clocks and has way less shaders and yet it generates a similar amount of FP32 compute with less power. It's obviously way more efficient architecturally, but as I already mentioned I am sure a world renowned GPU architect as yourself knew all that.

On the other hand, Volta and Ampere run at pretty much the same frequency and likely similar voltages since TSMC's 7nm doesn't seem to change that in any significant manner (in fact all 7nm CPU/GPU up until know seem to run at the same or even higher voltages), GA100 has 20% more shaders compared to V100 but also consumes 60% more power. It doesn't take much to see that efficiency isn't that great. It's not that hard to infer these things, don't overestimate their complexity.

Yes, I am sure when you factor in Nvidia's novel floating point formats it looks great, but if you look just at FP32, it's doesn't look great. It's rather mediocre. Do you not find it strange that our boy Jensen never once mentioned FP32 performance ?

I never said I knew how to design it better, stop projecting made up staff onto me. I said it was obvious they failed to do what they originally set out to do, hence why a considerable porton of the chip is fused off. They've done it in the past too.
Posted on Reply
#42
Caring1
Vya Domus
By the way I've just noticed the power :), 400W, that's 150W over V100. Ouch, 7nm hasn't been kind, I was right that this is a power hungry monster.
Plot twist.
Jensen wasn't baking it in his oven, he used them to heat his oven.
Posted on Reply
#43
chodaboy19
I think this is exactly what data center customers want and have been asking for, these will sell like hot cakes to the big cloud operators.
Posted on Reply
#44
Dante Uchiha
RH92
I'm afraid you are wrong . The myth that larger die sized are correlated to fixed function hardware has been already debunked , im trying to find the source , might be TPU , Anandtech , or Youtube but it might take time until i find it so i will link it here ASAP .

There is no real correlation between die size increase and fixed function as the latter eats relatively very low die space , more likely than not the higher die size in Turing is explained by the fact that it has more SMs .

This is further backed up by GA100 which has increased dies size compared to GV100 ( 826mm^2 vs 815mm^2 ) but significantly lower TensorCore count ( 432 vs 640 ) . So it is pretty obvious that fixed function hardware is not responsible for the die size expansion !



This was exactly my point , the only tangible argument that justifies higher prices for Turing ( other than the increased silicon size ) is because the lack of competition allows them to do so .
How do these huge tensor cores do not take up space and increase the die size ? Maybe this will help to understand the relationship between die size, yields and GPU cost.

nvidia/comments/99r2x3caly-technologies.com/die-yield-calculator/

Posted on Reply
#45
MuhammedAbdo
Vya Domus
Very unimpressive FP32 and FP64 performance, I was way off in my estimations. Again, it's a case of optimizing for way too many things. So much silicon is dedicated to non traditional performance metrics that I wonder if it makes sense trying to shove everything in one package.
GA 100 is 20X faster than V100 in AI workloads and 2.5X in FP64 workloads, that's a generational leap like no other. This is an AI optimized chip, it has no RT cores, no encoders and no display connectors, it's focus is mainly on AI training and inference, for which it provides stellar performance that crushes any hope of competition in the near future. And you are comparing regular crap like FP32 and FP64?

Alright A100 provides 156 TF FP32 compared to only 15 TF in V100. That alone is 10X increase in FP32 compute power without the need to change any code. They can extend that lead to 20X through sparse network optimizations to 312 TF of FP32 without code change.

In FP16 the increase is also 2.5X in non optimzied code, and 6X in optimized code, same for INT8 and INT4 numbers, so A100 is really several orders of magnitude faster than V100 in any AI workload.



Also the 400w of power consumption is nothing relative to the size of this monster, you have 40GB of HBM2, loads of NVLink connections, loads of tensor cores that take up die area, heat and power, the chip is also cut down (which means lost power consumption), also the trend in data centers and AI is to open power consumption up to allow for more comfortable performance, V100 reached 350W in it's second iteration and 450W in its third iteration.

You seem to lack any ounce of data center experience, so I just suggest you stick to the of analysis consumer GPUs. This isn't your area.
Posted on Reply
#46
Vya Domus
MuhammedAbdo
You seem to lack any ounce of data center experience, so I just suggest you stick to the of analysis consumer GPUs.
I'll stick with whatever the hell I want, thanks. You, copy pasting boiler plate from Nvidia's website can be considered anything but an "analysis". What are you, a sales man ? You're barking at the wrong tree buddy.
MuhammedAbdo
GA 100 is 20X faster than V100 in AI workloads and 2.5X in FP64 workloads
It turns out I overestimated your ability to copy paste information, you can't even do that :



9.7 / 7.8 = 1.24X (FP64)

Or maybe Jensen did a good job deceiving the less tech literate with their fine print by mixing together FP64 with FP64 TF.

Nice paint skills by the way.
Posted on Reply
#47
Bytales
Mark Little
Figuring out how they get 40 GB from 6 HBM stacks is a little confusing.
They cheaped out, instead of offering 6 modules of 8 gb, for a total of 48 gb, they wen for higher margins. Will offer a better improved version with full 48 gb memory, 25 mhz more, on core and memory, for 5000 dollar more. Dunn worry about it.

Is so typical of nvidia.
Mark Little
Figuring out how they get 40 GB from 6 HBM stacks is a little confusing.
They cheaped out, instead of offering 6 modules of 8 gb, for a total of 48 gb, they wen for higher margins. Will offer a better improved version with full 48 gb memory, 25 mhz more, on core and memory, for 5000 dollar more. Dunn worry about it.

Is so typical of nvidia.
Posted on Reply
#48
MuhammedAbdo
Vya Domus
9.7 / 7.8 = 1.24X (FP64)

Or maybe Jensen did a good job deceiving the less tech literate with their fine print.
Hey genius, I already provided you with a chart explaining all the metrics, good to know you can't read.

FP64 from Tensor cores is 19.5TF. Which is a 2.5X increase over V100. FP64 from CUDA cores is 9.7TF. If you can use both at the same time you will get about 30TF of FP64 for AI actually.
Vya Domus
You, copy pasting boiler plate from Nvidia's website can be considered anything but an "analysis"
It's much more meaningful than the ignorant job you did, analysing regular FP32/FP64 in an AI GPU. Talk about an extreme case of stuff that are way over your head.
Posted on Reply
#49
Vya Domus
MuhammedAbdo
Hey genius, I already provided you with a chart explaining all the metrics, good to know you can't read.

FP64 from Tensor cores is 19.5TF. Which is a 2.5X increase over V100. FP64 from CUDA cores is 9.7TF. If you can use both at the same time you will get about 30TF of FP64 for AI actually.
You're so cute when you try to explain your utter lack of understanding about these metrics.

You wrote "FP64 workloads", you genius. That's pure FP64 not tensor ops, you're clueless and stubborn.
Posted on Reply
#50
MuhammedAbdo
Vya Domus
You wrote "FP64 workloads", you genius. That's pure FP64 not tensor ops, you're clueless and stubborn.
If you spent half a minute reading about AI stuff you wouldn't say that, the new FP64 tensor ops apply to any FP64 workload, the user don't even have to change the code. Go spend sometime reading the Ampere whitepaper and deepdives and then come back when you have a better understanding of things.

Sheesh.
Posted on Reply
Add your own comment