• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

NVidia A100 (the $10,000 server card) is only 19.5 FP32 TFlops: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

And only 9.7 FP64 TFlops.

The Tensor-flops are an elevated number that only deep-learning folk care about (and apparently not all deep learning folk are using those tensor cores). Achieving ~20 FP32 TFlops general-purpose code is basically the best today (MI100 is a little bit faster, but without as much of that NVlink thing going on).

So 45 TFlops of FP32 is pretty huge by today's standards. However, Intel is going to be competing against the next-generation products, not the A100. I'm sure NVidia is going to grow, but 45TFlops per card is probably going to be competitive.

Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
 
Nice to see that Intels finest wine only need watercooling... but hey any GPU news of new cards is good news :)
 
obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much


The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.

I don't know where you're getting the bullshit numbers that the 3090 is faster than an A100, but... its just not true. Under any reasonable benchmark, like Linpack, A100 is something like 10-ish TFlops double-precision and 20-ish TFlops single-precision.
 
The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.

Yeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance
 
Yeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance

A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.
 
A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.
gpgpu.png

My overclocked 3090 is getting 40.6 TFLOPS of single precision (FP32) and only 660GFLOPS of FP64

A100 SM is different from GA102 SM as A100 is more focused on Tensor performance
GA102 SM vs A100 SM
ga102.png
a100.png
 
Last edited:
Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
 
45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.

Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF).

It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.
 
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
To be fair, rtx 3090 would get those numbers on this same clpeak benchmark, which Intel uses. So numbers are apples to apples, product audience is just wholly different and thus comparing them is just academic.
 
True competition will be Hopper and MI200.
I mean, it's a marvel of engineering and 45 TFlops is a lot, but the A100 was announced in May 2020. Beating a 2 year old card when it's finally going to be released sometime in 2022 sounds somewhat less impressive...
 
Off the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.
 
Off the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.

Its like the old joke goes: What an engineer can accomplish in 1 month, two engineers can accomplish in 2 months.

The primary constraint of engineering is time. Money allows for a bigger product to be built, but not necessarily a better product. Given the timeline here, I'm sure the Intel GPU won't be the best. Something weird will happen and not work as expected.

What I'm looking for is a "good first step", not necessarily the best, but a product that shows that Intel knows why NVidia and AMD GPUs have done so well in supercomputing circles. Maybe generation 2 or 3 will be actually competitive.
 
Specs and powerpoint slides are great, but let's see what happens when intel actually ships the silicon...
 
A100 SM is different from GA102 SM as A100 is more focused on Tensor performance

Point. I accept your benchmark, but note that its somewhat impractical. I forgot that Turing cores did the "two FP32 instructions per clock tick" (similar to Pentium's dual-pipeline design way back in the day). That offers a paper-gain of 2x peak FP32 flops, but in practice, most code can't take advantage of it entirely.

Though in practice I still assert that the A100 is superior (again, 80GB of 2TBps HBM2e RAM, 10TFlops of double-precision performance, etc. etc.). Almost any GPU-programmer would rather have twice the SMs / cores rather than double the resources spent per SM.

In any case, A100 is still king of NVidia's lineup. Its a few years old however.
 
AMD's Instinct MI200 'Aldebaran' is being shipped to customers. LOL. MI200 more than 50 TFlops FP32.

45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.



It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.
Raytracing denoise compute shader runs on CUDA cores, hence RTX 3090's TFLOPS advantage is shown.

Direct Storage decompression on PC is done via Compute Shader (GpGPU) path.

Mesh Shader (similar to Compute Shader) is done on CUDA cores, hence RTX 3090's advantage is shown.

Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
3090's TF is real via Compute Shader (GpGPU) path. Pixel Shader path is bottlenecked by raster hardware.

Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
Tensor is for pack math INT4, INT8, INT16, and FP16 with FP32 result and it's less flexible than CUDA cores .
 
Last edited:
45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.



It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.

Well put :D Raja definitely has the biggest one now.
 
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
 
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
I think that's the point. Intel just releasing a competitive product is good news for the market. Trying to beat others in every metric in their first go is a recipe for unending delays. Worry about power on the next iteration.

I'm hoping Intel DG2 is 3070 level of performance at 15% less price even if it uses more power, it will sell well if they can build reasonable supply and drivers are stable and updated regularly.
 
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
Intel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking
 
Intel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking
Lol, at what power draw? I mean, sure, well probably get there eventually, but that's years and years in the future. And everyone is going mcm in the next generation or two.
 
Re: comparing fp32 on GA100 (TSMC A100) vs GA102 (Samsung 3090/3080 Ti), the folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.

The 3090 runs its fp32 cores at higher clock speeds, and can use its 32-bit integer path to double up the old (Turing) fp32 rate, hence that 36 TFLOPS. But as dragontamer5788 notes, to feed these cores, you need bandwidth. The A100 gets this from its magical but pricey HBM2e; the 3090 forces things by using GDDR6X on a 384-bit bus.

PS: if you think 3090's cost a lot, try to find the price for A100 !!
 
the folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.

"Would/should be using" doesn't make sense, there are many workloads that cannot be accelerated by tensor ops.
 
Back
Top