• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

NVIDIA Announces Double-Precision (FP64) Tensor Cores with "Ampere"

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
46,390 (7.67/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
What you can see, you can understand. Simulations help us understand the mysteries of black holes and see how a protein spike on the coronavirus causes COVID-19. They also let designers create everything from sleek cars to jet engines. But simulations are also among the most demanding computer applications on the planet because they require lots of the most advanced math. Simulations make numeric models visual with calculations that use a double-precision floating-point format called FP64. Each number in the format takes up 64 bits inside a computer, making it one the most computationally intensive of the many math formats today's GPUs support.

As the next big step in our efforts to accelerate high performance computing, the NVIDIA Ampere architecture defines third-generation Tensor Cores that accelerate FP64 math by 2.5x compared to last-generation GPUs. That means simulations that kept researchers and designers waiting overnight can be viewed in a few hours when run on the latest A100 GPUs.



Science Puts AI in the Loop
The speed gains open a door for combining AI with simulations and experiments, creating a positive-feedback loop that saves time.

First, a simulation creates a dataset that trains an AI model. Then the AI and simulation models run together, feeding off each other's strengths until the AI model is ready to deliver real-time results through inference. The trained AI model also can take in data from an experiment or sensor, further refining its insights.

Using this technique, AI can define a few areas of interest for conducting high-resolution simulations. By narrowing the field, AI can slash by orders of magnitude the need for thousands of time-consuming simulations. And the simulations that need to be run will run 2.5x faster on an A100 GPU.

With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

Accelerating Matrix Math

A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

We refer to this new capability as Double-Precision Tensor Cores. It delivers the power of Tensor Cores to HPC applications, accelerating matrix math in full FP64 precision.

Beyond simulations, HPC apps called iterative solvers — algorithms with repetitive matrix-math calculations — will benefit from this new capability. These apps include a wide range of jobs in earth science, fluid dynamics, healthcare, material science and nuclear energy as well as oil and gas exploration.

To serve the world's most demanding applications, Double-Precision Tensor Cores arrive inside the largest and most powerful GPU we've ever made. The A100 also packs more memory and bandwidth than any GPU on the planet.

The third-generation Tensor Cores in the NVIDIA Ampere architecture are beefier than prior versions. They support a larger matrix size — 8x8x4, compared to 4x4x4 for Volta — that lets users tackle tougher problems.

That's one reason why an A100 with a total of 432 Tensor Cores delivers up to 19.5 FP64 TFLOPS, more than double the performance of a Volta V100.

Where to Go to Learn More

To get the big picture on the role of FP64 in our latest GPUs, watch the keynote with NVIDIA founder and CEO Jensen Huang. To learn more, register for the webinar or read a detailed article that takes a deep dive into the NVIDIA Ampere architecture.

Double-Precision Tensor Cores are among a battery of new capabilities in the NVIDIA Ampere architecture, driving HPC performance as well as AI training and inference to new heights. For more details, check out our blogs on:
  • Multi-Instance GPU (MIG), supporting up to 7x in GPU productivity gains.
  • TensorFloat-32 (TF32), a format, speeding up AI training and certain HPC jobs up to 20x.
  • Our support for sparsity, accelerating math throughput 2x for AI inference.
  • Or, see the web page describing the A100 GPU.

View at TechPowerUp Main Site
 
Joined
Mar 24, 2012
Messages
528 (0.12/day)
this thing is complicated. definitely not like nvidia previous compute card. in the past it the spec is just simple FP16/32/64.
 
Joined
May 3, 2018
Messages
2,292 (1.05/day)
This has alwasy been the case, so why has it taken so long to get decent double precision hardware acceleration. In my line of work I can only use double precision, single precision is useless.
 
Joined
Apr 6, 2015
Messages
246 (0.07/day)
Location
Japan
System Name ChronicleScienceWorkStation
Processor AMD Threadripper 1950X
Motherboard Asrock X399 Taichi
Cooling Noctua U14S-TR4
Memory G.Skill DDR4 3200 C14 16GB*4
Video Card(s) AMD Radeon VII
Storage Samsung 970 Pro*1, Kingston A2000 1TB*2 RAID 0, HGST 8TB*5 RAID 6
Case Lian Li PC-A75X
Power Supply Corsair AX1600i
Software Proxmox 6.2
This has alwasy been the case, so why has it taken so long to get decent double precision hardware acceleration. In my line of work I can only use double precision, single precision is useless.

Just wondering, how do you find the new Radeon VII Pro? It rocks 6 TFLOPs of DP performance at a relatively low price.
 
Top