• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

NVIDIA Ampere A100 Has 54 Billion Transistors, World's Largest 7nm Chip

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
43,067 (8.00/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Not long ago, Intel's Raja Koduri claimed that the Xe HP "Ponte Vecchio" silicon was the "big daddy" of Xe GPUs, and the "largest chip co-developed in India," larger than the 35 billion-transistor Xilinix VU19P FPGA co-developed in the country. It turns out that NVIDIA is in the mood for setting records. The "Ampere" A100 silicon has 54 billion transistors crammed into a single 7 nm die (not counting transistor counts of the HBM2E memory stacks).

NVIDIA claims a 20 Times boost in both AI inference and single-precision (FP32) performance over its "Volta" based predecessor, the Tesla V100. The chip also offers a 2.5X gain in FP64 performance over "Volta." NVIDIA has also invented a new number format for AI compute, called TF32 (tensor float 32). TF32 uses 10-bit mantissa of FP16, and the 8-bit exponent of FP32, resulting in a new, efficient format. NVIDIA attributes its 20x performance gains over "Volta" to this. The 3rd generation tensor core introduced with Ampere supports FP64 natively. Another key design focus for NVIDIA is to leverage the "sparsity" phenomenon in neural nets, to reduce their size, and improve performance.



A new HPC-relevant feature being introduced with A100 is multi-instance GPU, which allows multiple complex applications to run on the same GPU without sharing resources such as memory bandwidth. The user can now partition a physical A100 into up to 7 virtual GPUs of varying specs, and ensure that an application running on one of the vGPUs doesn't eat into the resources of the other. As for real-world performance, NVIDIA claims that the A100 beat the V100 by a factor of 7 at BERT.

The DGX-A100 system crams 5 petaflops of compute peformance onto a single "graphics card" (a single node), and starts at $199,000 a piece.

View at TechPowerUp Main Site
 
Joined
Dec 22, 2011
Messages
3,613 (0.94/day)
Processor AMD Ryzen 7 3700X
Motherboard MSI MAG B550 TOMAHAWK
Cooling AMD Wraith Prism
Memory Team Group Dark Pro 8Pack Edition 3600Mhz CL16
Video Card(s) Palit GTX 980 Ti Super JetStream
Storage Kingston A2000 1TB + Seagate HDD workhorse
Display(s) Samsung 50" QN94A Neo QLED + 50" QN90A Neo QLED
Case Antec 1200
Audio Device(s) Don't be silly
Power Supply XFX 650W Core
Mouse Razer Deathadder Chroma
Keyboard Logitech UltraX
Software Windows 11
Benchmark Scores Epic
Wowzers, this bad boy is going to get a lot of love in HPC market.
 
Joined
Jan 8, 2017
Messages
7,190 (3.59/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
NVIDIA will also invent a new number format for AI compute, called TF32 (tensor float 32). TF32 uses 10-bit mantissa of FP16, and the 8-bit exponent of FP32

Wouldn't that be ... TF19 ? 10 + 8 + 1 (sign) bits
 
Joined
Jul 10, 2015
Messages
685 (0.27/day)
Location
Sokovia
System Name Alienation from family
Processor i7 7700k
Motherboard Hero VIII
Cooling Macho revB
Memory 16gb Hyperx
Video Card(s) Asus 1080ti Strix OC
Storage 960evo 500gb
Display(s) AOC 4k
Case Define R2 XL
Power Supply Be f*ing Quiet 600W M Gold
Mouse NoName
Keyboard NoNameless HP
Software You have nothing on me
Benchmark Scores Personal record 100m sprint: 60m
Age of Amperes!
 
Joined
Apr 26, 2008
Messages
220 (0.04/day)
System Name 3950X Workstation
Processor AMD Ryzen 9 3950X
Motherboard ASUS Crosshair VIII Impact
Cooling Cryorig C1 with Noctua NF-A12x15
Memory G.Skill F4-3600C16D-32GTZNC
Video Card(s) ASUS GTX 1650 LP OC
Storage 2 x Corsair MP510 1920GB M.2 SSD
Case Realan E-i7
Power Supply G-Unique 400W
Software Win 10 Pro
Benchmark Scores https://smallformfactor.net/forum/threads/the-saga-of-the-little-gem-continues.12877/
Joined
Feb 3, 2017
Messages
3,306 (1.67/day)
Processor R5 5600X
Motherboard ASUS ROG STRIX B550-I GAMING
Cooling Alpenföhn Black Ridge
Memory 2*16GB DDR4-2666 VLP @3800
Video Card(s) EVGA Geforce RTX 3080 XC3
Storage 1TB Samsung 970 Pro, 2TB Intel 660p
Display(s) ASUS PG279Q, Eizo EV2736W
Case Dan Cases A4-SFX
Power Supply Corsair SF600
Mouse Corsair Ironclaw Wireless RGB
Keyboard Corsair K60
VR HMD HTC Vive
54B transistors is three times as much as TU102 (18.6B, Titan RTX, RTX2080Ti). Volta has 21.1B.
 
Joined
Jan 8, 2017
Messages
7,190 (3.59/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
so what it's exact FP32/64 performance?

My guess around 30 TFLOPS FP32, this probably has about twice the shaders of V100 but probably no where the clock speed.
 
Last edited:
Joined
Dec 25, 2019
Messages
143 (0.16/day)
System Name Mirkwood
Processor AMD RYZEN 7 3800X
Motherboard ASUS ROG Crosshair VII Hero (Wi-Fi) AM4 AMD X470
Cooling Noctua D15S with additional Noctua NF-A12x25 FLX fan
Memory G.SKILL Flare X Series CL16 3200Mhz 16GB (4 x 8GB)
Video Card(s) GIGABYTE Radeon RX 570 DirectX 12 GV-RX570GAMING-4GD 4GB
Storage Crucial MX500 M.2 2280 500GB SATA III; WD Black 1TB Performance Desktop Hard Disk Drive
Display(s) Philips 246E9QDSB 24" Frameless Monitor, Full HD IPS, 129% sRGB, 75Hz, FreeSync
Case Corsair Graphite Series 780T
Audio Device(s) Klipsch R-41PM powered monitors and SVS SB-2000 sub
Power Supply Corsair HX650
Mouse Logitech Wireless Performance Mouse MX
Keyboard Old Logitech keyboard
Software Windows 10 Pro 64Bit
Ampere to Ponte Vecchio, “who’s your daddy now?”
 
Joined
Jun 15, 2015
Messages
91 (0.04/day)
Wouldn't that be ... TF19 ? 10 + 8 + 1 (sign) bits

Yar? Yar!

Good catch! I noticed it too.

However, what's really going on is that they are creating their own alternative to BF16 (brainfloat16) and calling it TF32: keeping the 8 bits for exponent from fp32 but using only 10 bits for the fraction (precision) from fp16. This keeps the approximate range of FP32 while keeping the precision of FP16 (half precision). This is different than BF16 since BF16 only keeps 7 bits for precision. So, you can get better (or what some of my students say..."gooder") approximations with TF32 when converting back to FP32 (by padding the last 13 bits of precision on FP32 with zeroes) instead of with BF16 (where you would pad the last 16 bits with zeroes).

Does it really matter that much in the end? I'm not a professional with experience in AIs or DNNs but I suppose that FP32 approximations from TF32 is better and only ever so slightly slower than FP32 approximations from BF16. It is somewhat clever with what they did.

But, back to the whole TF19 bit (no pun intended!): I think it's a marketing move as TF32 "sounds" better. It's really TF19 with 1 bit sign, 8 bit exponent, and 10 bit fraction but, hey, "TF32" FTW.

My guess around 30 TFLOPS FP32, this probably has about twice the shaders of V100 but probably no where the clock speed.

7FF+ is supposedly on par with clock speeds or slightly better than 12FF. If it is indeed 8192 cores, then we'd have around 29.5 TFLOPS to 32.8 TFLOPS (1.8 GHz to 2 GHz, respectively). "TF32" could be as high as 655.4 TFLOPS, or, if one has a cool $200K lying around, you can get that monstrosity that JSH has been baking and get 8x 655.4 TFLOPS = 5.243 PFLOPS of "TF32" performance. I mean...saying PFLOPS like "Pee-Flops" is just ridiculous...

Aaaaaaand...I'm getting off topic.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
12,702 (3.33/day)
Location
Concord, NH
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.
 
Joined
Dec 22, 2011
Messages
3,613 (0.94/day)
Processor AMD Ryzen 7 3700X
Motherboard MSI MAG B550 TOMAHAWK
Cooling AMD Wraith Prism
Memory Team Group Dark Pro 8Pack Edition 3600Mhz CL16
Video Card(s) Palit GTX 980 Ti Super JetStream
Storage Kingston A2000 1TB + Seagate HDD workhorse
Display(s) Samsung 50" QN94A Neo QLED + 50" QN90A Neo QLED
Case Antec 1200
Audio Device(s) Don't be silly
Power Supply XFX 650W Core
Mouse Razer Deathadder Chroma
Keyboard Logitech UltraX
Software Windows 11
Benchmark Scores Epic
Joined
Jun 15, 2015
Messages
91 (0.04/day)
Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.

Their [Nvidia] next architecture is Hopper and it is confirmed to be their first using MCMs. Hell, it might be the only thing confirmed about Hopper.
 
Joined
Mar 6, 2011
Messages
155 (0.04/day)
Wowzers, this bad boy is going to get a lot of love in HPC market.

You're forgetting that the big boys in HPC already know pretty much what AMD / NVIDIA / Intel have in the pipeline for the next 2 gens.

AMD have been scooping most of the biggest contracts lately, and most of the really big contracts in the last 18 months have been aiming at CDNA2 / Hopper, not Ampere or CDNA1.

That 20x figure is pure fantasy land.
 
Joined
Jan 8, 2017
Messages
7,190 (3.59/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
That 20x figure is pure fantasy land.

I wouldn't necessarily say that. They're changing the metric, they've done it before with "gigarays", it's the good old " We're faster ! " *with an asterisk* .

You're forgetting that the big boys in HPC already know pretty much what AMD / NVIDIA / Intel have in the pipeline for the next 2 gens.

That's true and it's telling, if they still decided to go with AMD it shows that maybe they're weren't as impressed with what Nvidia was about to have.

7FF+ is supposedly on par with clock speeds or slightly better than 12FF.

The problem isn't that it couldn't clock, it's power. V100 runs at 1400Mhz but this has more than twice the transistors, if they want to maintain the 250W power envelope it's just not possible to have it run at the same clock speed.

Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.

I find this exceedingly strange too. These 800 mm^2 dies are going to be insanely expensive and so will be the actual products, I think businesses are willing to spend up to a point. With more and more dedicated silicon out there for inferencing/training I don't think Nvidia is in a position to ask even more money for something that people can get else were for a fraction of the cost. They are optimizing their chips for too many things, some time ago that used to be an advantage but it's slowly becoming an Achilles heel.
 
Last edited:
Joined
Mar 18, 2008
Messages
5,716 (1.09/day)
System Name Virtual Reality / Bioinformatics
Processor Undead CPU
Motherboard Undead TUF X99
Cooling Noctua NH-D15
Memory GSkill 128GB DDR4-3000
Video Card(s) EVGA RTX 3090 FTW3 Ultra
Storage Samsung 960 Pro 1TB + 860 EVO 2TB + WD Black 5TB
Display(s) 32'' 4K Dell
Case Fractal Design R5
Audio Device(s) BOSE 2.0
Power Supply Seasonic 850watt
Mouse Logitech Master MX
Keyboard Corsair K70 Cherry MX Blue
VR HMD HTC Vive + Oculus Quest 2
Software Windows 10 P
Wowzers, this bad boy is going to get a lot of love in HPC market.

Exactly.

Also i find it really funny that so many "home grown HPC experts" suddenly show up here claiming CDNA2 or whatever made-up crap.

HPC is all about ecosystem where both software and hardware are needed in excellent shape. Pay attention to how much Nvidia CEO acknowledged the software developers. Without a thriving software ecosystem, the hardware by themselves are nothing. In the field of AI, nobody is currently able to compete with Nvidia's software and hardware integration.

Computing hardware is only half (Hell, actually 1/3) of the deal. There is software which is a HUGE part, as well as inter-connecting hardware.
 

Yosar

New Member
Joined
Apr 16, 2019
Messages
4 (0.00/day)
HPC is all about ecosystem where both software and hardware are needed in excellent shape. Pay attention to how much Nvidia CEO acknowledged the software developers. Without a thriving software ecosystem, the hardware by themselves are nothing. In the field of AI, nobody is currently able to compete with Nvidia's software and hardware integration.

Computing hardware is only half (Hell, actually 1/3) of the deal. There is software which is a HUGE part, as well as inter-connecting hardware.

If you pay millions of dollars for hardware, you also pay for _specialized_ software for this hardware. You surely don't buy hardware for 200 000 dollars (for 1 card) to run 3DS MAX Studio. Your eco-system doesn't matter. This whole hardware is your eco-system. The deal is for eco-system.
 
Joined
Aug 20, 2007
Messages
17,875 (3.29/day)
System Name Pioneer
Processor Ryzen R9 5950X
Motherboard GIGABYTE X570 Aorus Elite
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory Crucial Ballistix 64GB (4 x 16GB) @ DDR4-3600 (Micron E-Die, dual rank sticks)
Video Card(s) EVGA GeForce RTX 3090 Ti FTW3
Storage 2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply EVGA SuperNova T2 Titanium 850W
Mouse Razer Deathadder v2
Keyboard WASD CODE Mechanical KB w/ Cherry MX Green switches
Software Windows 11 Enterprise (yes, it's legit)
Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.

They are when the customer is willing to pay top dollar, like HPC.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
12,702 (3.33/day)
Location
Concord, NH
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
They are when the customer is willing to pay top dollar, like HPC.
I think that depends on what the alternatives cost and how easy or hard it is to build the software to work on any of the HPC solutions a business might be considering. None of this changes the fact though that a massive die like this is going to put a huge premium on the hardware, which means less money for the other things that also matter. What good is great hardware if you have to skimp on the software side of things?
 
Top