• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Why include FP64

Joined
Dec 17, 2011
Messages
360 (0.07/day)
Hi guys. I have a question, well, it's a rant actually. GK104 can do
FP64 at 1/24 the FP32 rate. GM107 can do FP64 at 1/32 the FP32 rate.
My question is, why doesn't NVIDIA just leave FP64 support and save
die space? What is the point of supporting FP64 at such abysmal
speeds?
 
GK104 and GM107 are way more capable at Double-Precision Floating Point than nVidia allows in the desktop parts. They include it because the ability is there, they use the same dies for their workstation cards. It is cheaper to produce a single die and use it in multiple cards than it is to produce a purpose built die for each different product. So the ability is in the die no matter what, but they purposely limit the performance of the desktop cards so people that actually need the higher performance will pay the outrageous prices for workstation class cards.
 
Because scientific applications especially require 64-bit floating point operations. Some CAD programs probably do as well. If they eliminate it, they basically give AMD (maybe Intel too with Xeon Phi) the market for high performance computing.
 
GK104 and GM107 are way more capable at Double-Precision Floating Point than nVidia allows in the desktop parts. They include it because the ability is there, they use the same dies for their workstation cards. It is cheaper to produce a single die and use it in multiple cards than it is to produce a purpose built die for each different product. So the ability is in the die no matter what, but they purposely limit the performance of the desktop cards so people that actually need the higher performance will pay the outrageous prices for workstation class cards.

By GK104 I meant the fully unlocked GK104. It's a fact that a completely unlocked GK104 processes FP64 at 1/24 FP32 rate.

Because scientific applications especially require 64-bit floating point operations. Some CAD programs probably do as well. If they eliminate it, they basically give AMD (maybe Intel too with Xeon Phi) the market for high performance computing.

They have the TITAN series for that.
 
By GK104 I meant the fully unlocked GK104. It's a fact that a completely unlocked GK104 processes FP64 at 1/24 FP32 rate.

Again, it is a by-product of trying to use the same GPU die in two different markets and balancing things. GK104 was designed primarily for the desktop market, and it's poor FP64 performance is a result, but they couldn't complete remove FP64 support because they wanted to use it in the workstation market as well. While GK110 was designed for the workstation market, and its great FP64 performance is a result, it was only brought to the desktop market when it was needed.
 
They couldn't complete remove FP64 support because they wanted to use it in the workstation market as well.

What kind of Workstation Market would like to use 190 GFLOPs of peak FP64 performance? A 150$ CPU can do more than that.
 
You're right in that it doesn't make sense to use a mid range GPU over a good CPU for those tasks. A Core i7 4770K can do ~224 GFLOPS DP, while a GTX 680 can only churn out ~128 GFLOPS DP.

The real reason why FP64 blocks are not eliminated from GPUs is to ensure that all code works on all GPUs in a series, even if it does run more slowly on some. This also allows developers to create and test their programs on any GPU then deploy it on GPUs like Tesla cards that are much faster at DP.
 
Are we sure GK104 actually has hardware FP64 and it isn't just emulated FP64 support? If GK104 was emulating FP64, that would explain why it's performance is so bad and also not be wasting any silicon.
 
Are we sure GK104 actually has hardware FP64 and it isn't just emulated FP64 support?

Yes, there is special block (module) that is not shown in diagrams which has 8 fat cuda cores that can do fp64 and only fp64 (that's why it's not in the diagrams). I believe titan has those cores mixed among regular ones in every smx.

The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams.
from http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
 
Do you even use it???
 
GK104 and GM107 are way more capable at Double-Precision Floating Point than nVidia allows in the desktop parts. They include it because the ability is there, they use the same dies for their workstation cards. It is cheaper to produce a single die and use it in multiple cards than it is to produce a purpose built die for each different product. So the ability is in the die no matter what, but they purposely limit the performance of the desktop cards so people that actually need the higher performance will pay the outrageous prices for workstation class cards.

That's like creating awesome V12 Bi-Turbo engine and purposly limit it to 4 cylinders with disabled turbos and stuff it in Mercedes A class. Just so they can sell more of those SL65 Mercs with full blown same engine. That sucks a bit...
 
Every chip manufacturer out there bar memory manufacturers do this.
They also frequently put circuit on die just so they can see how it will work out in yield and use terms that the customer who buys the end chip will never ever see or use because it will be fused off.
Its just The Way.
As for dropping all fp64 on lower cards that would not be wise as I doubt it would play crisis then at all and some games would really really struggle emulating fp64 and I'd wager physx uSes it quite a bit
 
That appears to be the real reason. But they could leave GK104 alone and create a cheapo TITAN card.

A lot of Direct Compute performance comes from the cache on the gpu, not just FP64. Nvidia started stripping down the cache on their Geforce cards after first gen fermi. Gtx470/480.

Cache can make a gpu die very large and hot.
 
As for dropping all fp64 on lower cards that would not be wise as I doubt it would play crisis then at all and some games would really really struggle emulating fp64 and I'd wager physx uSes it quite a bit

FP64 is used for more precise effects not for more effects. There isn't a single consumer application that uses GPU FP64 and that includes PhysX.

A lot of Direct Compute performance comes from the cache on the gpu, not just FP64. Nvidia started stripping down the cache on their Geforce cards after first gen fermi. Gtx470/480.

Cache can make a gpu die very large and hot.

Yet Maxwell went with a 2 MB cache for GM107.
 
Can't nVidia's higher end cards (like Titan) and workstation cards run in a DP mode where DP performance isn't crap at the expense of SP performance? I can't seem to find where I read that but I distinctly remember nVidia giving that option to particular GPUs like Titan and their workstation cards. I think the point was that most games use single precision, so it makes sense for SP to be faster than DP for consumer graphics cards.

Edit: Yes! There is a driver switch that changes how the GPU performs with DP and SP. Apparently there are side-effects like Boost getting disabled, but it results in DP numbers that would otherwise be mediocre.

sandra-gp-processing.png

source
 
The problem with IEEE 754-1985 binary64 (double precision) is it is not bitwise backwards compatible with binary32 (single precision):
http://kipirvine.com/asm/workbook/floating_tut.htm

If you throw a binary32 at a binary64 processor, it has to convert it (or have separate hardware) before it can process it. There's no reason why GPUs couldn't be made entirely binary64 but to do so means no backwards compatibility or severely limited performance when doing so. This is why binary64 GPUs are marketed separately to a different audience. As demand increases for binary64, GPUs will grow increasingly biased towards binary64 but likely at the cost of binary32 performance.


Pi in binary32: 3.1415927410125732421875
Pi in binary64: 3.141592653589793115997963468544185161590576171875

Pi is a small number. Imagine if you were dealing with a number like 1 billion and some change. The bigger the whole number, the less precise the fraction.
 
Last edited:
The problem with IEEE 754-1985 binary64 (double precision) is it is not bitwise backwards compatible with binary32 (single precision):
http://kipirvine.com/asm/workbook/floating_tut.htm

If you throw a binary32 at a binary64 processor, it has to convert it (or have separate hardware) before it can process it. There's no reason why GPUs couldn't be made entirely binary64 but to do so means no backwards compatibility or severely limited performance when doing so. This is why binary64 GPUs are marketed separately to a different audience. As demand increases for binary64, GPUs will grow increasingly biased towards binary64 but likely at the cost of binary32 performance.

This may explain why NVIDIA had entirely different FP64 CUDA Cores in Kepler. But why did Fermi had FP32 capable FP64 cores?
 
FP64 is used for more precise effects not for more effects. There isn't a single consumer application that uses GPU FP64 and that includes PhysX.



Yet Maxwell went with a 2 MB cache for GM107.

There are plenty of uses for fp 64 and i did not imply more effects as i know what double precision means , regardless it is what it is you are bickering about bs , not everyone uses a gf cards sound controler and that uses space, should they chop that out for one more shader.
No
And how is something you don't use appreciate or like worth a rant
 
Last edited:
This may explain why NVIDIA had entirely different FP64 CUDA Cores in Kepler. But why did Fermi had FP32 capable FP64 cores?
Every FPU in each CUDA core in Fermi has hardware to handle both binary32 and binary64 not unlike the FPU in a CPU.

CUDA cores dedicated to binary64 could be used directly by programmers to perform high precision calculations.
 
Last edited:
Arch has to evolve with our uses .are you just working on post count.
 
I know that. But why didn't they do the same for Kepler?
They wanted to boost binary64 by having cores dedicated to it most likely for scientific applications. It sounds to me like, if Fermi had 32 FPUs capable of binary32 and binary64, Kepler would have 64 + some more binary64 only FPUs. This way, they're getting equal binary32 capability and they get equal or much more (for software the uses the dedicated binary64 FPUs) binary64 performance.

Nope. What's the benefit? I made this account wayy back in 2011. It won't affect my posts/day anyway. But you didn't answer my question. What is the advantage of separating FP32 and FP64 cores?
The advantage is that you can vastly simplify the FPU by removing the backwards compatibility for binary32. This means, in turn, they can pack more binary64 performance into less die space.
 
The advantage is that you can vastly simplify the FPU by removing the backwards compatibility for binary32. This means, in turn, they can pack more binary64 performance into less die space.

That appears to be the reason.

It sounds to me like, if Fermi had 32 FPUs capable of binary32 and binary64, Kepler would have 64 + some more binary64 only FPUs. This way, they're getting equal binary32 capability and they get equal or much more (for software the uses the dedicated binary64 FPUs) binary64 performance.

Nope.

Anandtech said:
In GK104 none of the regular CUDA core blocks are FP64 capable; in its place we have what we’re calling the CUDA FP64 block. The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams. These CUDA cores can only do and are only used for FP64 math.
 
Back
Top