• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Why include FP64

Joined
Dec 17, 2011
Messages
359 (0.08/day)
Hi guys. I have a question, well, it's a rant actually. GK104 can do
FP64 at 1/24 the FP32 rate. GM107 can do FP64 at 1/32 the FP32 rate.
My question is, why doesn't NVIDIA just leave FP64 support and save
die space? What is the point of supporting FP64 at such abysmal
speeds?
 

newtekie1

Semi-Retired Folder
Joined
Nov 22, 2005
Messages
28,472 (4.23/day)
Location
Indiana, USA
Processor Intel Core i7 10850K@5.2GHz
Motherboard AsRock Z470 Taichi
Cooling Corsair H115i Pro w/ Noctua NF-A14 Fans
Memory 32GB DDR4-3600
Video Card(s) RTX 2070 Super
Storage 500GB SX8200 Pro + 8TB with 1TB SSD Cache
Display(s) Acer Nitro VG280K 4K 28"
Case Fractal Design Define S
Audio Device(s) Onboard is good enough for me
Power Supply eVGA SuperNOVA 1000w G3
Software Windows 10 Pro x64
GK104 and GM107 are way more capable at Double-Precision Floating Point than nVidia allows in the desktop parts. They include it because the ability is there, they use the same dies for their workstation cards. It is cheaper to produce a single die and use it in multiple cards than it is to produce a purpose built die for each different product. So the ability is in the die no matter what, but they purposely limit the performance of the desktop cards so people that actually need the higher performance will pay the outrageous prices for workstation class cards.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.63/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Because scientific applications especially require 64-bit floating point operations. Some CAD programs probably do as well. If they eliminate it, they basically give AMD (maybe Intel too with Xeon Phi) the market for high performance computing.
 
Joined
Dec 17, 2011
Messages
359 (0.08/day)
GK104 and GM107 are way more capable at Double-Precision Floating Point than nVidia allows in the desktop parts. They include it because the ability is there, they use the same dies for their workstation cards. It is cheaper to produce a single die and use it in multiple cards than it is to produce a purpose built die for each different product. So the ability is in the die no matter what, but they purposely limit the performance of the desktop cards so people that actually need the higher performance will pay the outrageous prices for workstation class cards.

By GK104 I meant the fully unlocked GK104. It's a fact that a completely unlocked GK104 processes FP64 at 1/24 FP32 rate.

Because scientific applications especially require 64-bit floating point operations. Some CAD programs probably do as well. If they eliminate it, they basically give AMD (maybe Intel too with Xeon Phi) the market for high performance computing.

They have the TITAN series for that.
 

newtekie1

Semi-Retired Folder
Joined
Nov 22, 2005
Messages
28,472 (4.23/day)
Location
Indiana, USA
Processor Intel Core i7 10850K@5.2GHz
Motherboard AsRock Z470 Taichi
Cooling Corsair H115i Pro w/ Noctua NF-A14 Fans
Memory 32GB DDR4-3600
Video Card(s) RTX 2070 Super
Storage 500GB SX8200 Pro + 8TB with 1TB SSD Cache
Display(s) Acer Nitro VG280K 4K 28"
Case Fractal Design Define S
Audio Device(s) Onboard is good enough for me
Power Supply eVGA SuperNOVA 1000w G3
Software Windows 10 Pro x64
By GK104 I meant the fully unlocked GK104. It's a fact that a completely unlocked GK104 processes FP64 at 1/24 FP32 rate.

Again, it is a by-product of trying to use the same GPU die in two different markets and balancing things. GK104 was designed primarily for the desktop market, and it's poor FP64 performance is a result, but they couldn't complete remove FP64 support because they wanted to use it in the workstation market as well. While GK110 was designed for the workstation market, and its great FP64 performance is a result, it was only brought to the desktop market when it was needed.
 
Joined
Dec 17, 2011
Messages
359 (0.08/day)
They couldn't complete remove FP64 support because they wanted to use it in the workstation market as well.

What kind of Workstation Market would like to use 190 GFLOPs of peak FP64 performance? A 150$ CPU can do more than that.
 
Joined
Dec 16, 2010
Messages
1,662 (0.34/day)
Location
State College, PA, US
System Name My Surround PC
Processor AMD Ryzen 9 7950X3D
Motherboard ASUS STRIX X670E-F
Cooling Swiftech MCP35X / EK Quantum CPU / Alphacool GPU / XSPC 480mm w/ Corsair Fans
Memory 96GB (2 x 48 GB) G.Skill DDR5-6000 CL30
Video Card(s) MSI NVIDIA GeForce RTX 4090 Suprim X 24GB
Storage WD SN850 2TB, 2 x 512GB Samsung PM981a, 4 x 4TB HGST NAS HDD for Windows Storage Spaces
Display(s) 2 x Viotek GFI27QXA 27" 4K 120Hz + LG UH850 4K 60Hz + HMD
Case NZXT Source 530
Audio Device(s) Sony MDR-7506 / Logitech Z-5500 5.1
Power Supply Corsair RM1000x 1 kW
Mouse Patriot Viper V560
Keyboard Corsair K100
VR HMD HP Reverb G2
Software Windows 11 Pro x64
Benchmark Scores Mellanox ConnectX-3 10 Gb/s Fiber Network Card
You're right in that it doesn't make sense to use a mid range GPU over a good CPU for those tasks. A Core i7 4770K can do ~224 GFLOPS DP, while a GTX 680 can only churn out ~128 GFLOPS DP.

The real reason why FP64 blocks are not eliminated from GPUs is to ensure that all code works on all GPUs in a series, even if it does run more slowly on some. This also allows developers to create and test their programs on any GPU then deploy it on GPUs like Tesla cards that are much faster at DP.
 

newtekie1

Semi-Retired Folder
Joined
Nov 22, 2005
Messages
28,472 (4.23/day)
Location
Indiana, USA
Processor Intel Core i7 10850K@5.2GHz
Motherboard AsRock Z470 Taichi
Cooling Corsair H115i Pro w/ Noctua NF-A14 Fans
Memory 32GB DDR4-3600
Video Card(s) RTX 2070 Super
Storage 500GB SX8200 Pro + 8TB with 1TB SSD Cache
Display(s) Acer Nitro VG280K 4K 28"
Case Fractal Design Define S
Audio Device(s) Onboard is good enough for me
Power Supply eVGA SuperNOVA 1000w G3
Software Windows 10 Pro x64
Are we sure GK104 actually has hardware FP64 and it isn't just emulated FP64 support? If GK104 was emulating FP64, that would explain why it's performance is so bad and also not be wasting any silicon.
 
Joined
Feb 8, 2012
Messages
3,013 (0.68/day)
Location
Zagreb, Croatia
System Name Windows 10 64-bit Core i7 6700
Processor Intel Core i7 6700
Motherboard Asus Z170M-PLUS
Cooling Corsair AIO
Memory 2 x 8 GB Kingston DDR4 2666
Video Card(s) Gigabyte NVIDIA GeForce GTX 1060 6GB
Storage Western Digital Caviar Blue 1 TB, Seagate Baracuda 1 TB
Display(s) Dell P2414H
Case Corsair Carbide Air 540
Audio Device(s) Realtek HD Audio
Power Supply Corsair TX v2 650W
Mouse Steelseries Sensei
Keyboard CM Storm Quickfire Pro, Cherry MX Reds
Software MS Windows 10 Pro 64-bit
Are we sure GK104 actually has hardware FP64 and it isn't just emulated FP64 support?

Yes, there is special block (module) that is not shown in diagrams which has 8 fat cuda cores that can do fp64 and only fp64 (that's why it's not in the diagrams). I believe titan has those cores mixed among regular ones in every smx.

The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams.
from http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
 

eidairaman1

The Exiled Airman
Joined
Jul 2, 2007
Messages
40,435 (6.59/day)
Location
Republic of Texas (True Patriot)
System Name PCGOD
Processor AMD FX 8350@ 5.0GHz
Motherboard Asus TUF 990FX Sabertooth R2 2901 Bios
Cooling Scythe Ashura, 2×BitFenix 230mm Spectre Pro LED (Blue,Green), 2x BitFenix 140mm Spectre Pro LED
Memory 16 GB Gskill Ripjaws X 2133 (2400 OC, 10-10-12-20-20, 1T, 1.65V)
Video Card(s) AMD Radeon 290 Sapphire Vapor-X
Storage Samsung 840 Pro 256GB, WD Velociraptor 1TB
Display(s) NEC Multisync LCD 1700V (Display Port Adapter)
Case AeroCool Xpredator Evil Blue Edition
Audio Device(s) Creative Labs Sound Blaster ZxR
Power Supply Seasonic 1250 XM2 Series (XP3)
Mouse Roccat Kone XTD
Keyboard Roccat Ryos MK Pro
Software Windows 7 Pro 64
Do you even use it???
 
Joined
Oct 2, 2004
Messages
13,791 (1.93/day)
GK104 and GM107 are way more capable at Double-Precision Floating Point than nVidia allows in the desktop parts. They include it because the ability is there, they use the same dies for their workstation cards. It is cheaper to produce a single die and use it in multiple cards than it is to produce a purpose built die for each different product. So the ability is in the die no matter what, but they purposely limit the performance of the desktop cards so people that actually need the higher performance will pay the outrageous prices for workstation class cards.

That's like creating awesome V12 Bi-Turbo engine and purposly limit it to 4 cylinders with disabled turbos and stuff it in Mercedes A class. Just so they can sell more of those SL65 Mercs with full blown same engine. That sucks a bit...
 
Joined
Mar 10, 2010
Messages
11,878 (2.30/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
Every chip manufacturer out there bar memory manufacturers do this.
They also frequently put circuit on die just so they can see how it will work out in yield and use terms that the customer who buys the end chip will never ever see or use because it will be fused off.
Its just The Way.
As for dropping all fp64 on lower cards that would not be wise as I doubt it would play crisis then at all and some games would really really struggle emulating fp64 and I'd wager physx uSes it quite a bit
 

MxPhenom 216

ASIC Engineer
Joined
Aug 31, 2010
Messages
12,945 (2.60/day)
Location
Loveland, CO
System Name Ryzen Reflection
Processor AMD Ryzen 9 5900x
Motherboard Gigabyte X570S Aorus Master
Cooling 2x EK PE360 | TechN AM4 AMD Block Black | EK Quantum Vector Trinity GPU Nickel + Plexi
Memory Teamgroup T-Force Xtreem 2x16GB B-Die 3600 @ 14-14-14-28-42-288-2T 1.45v
Video Card(s) Zotac AMP HoloBlack RTX 3080Ti 12G | 950mV 1950Mhz
Storage WD SN850 500GB (OS) | Samsung 980 Pro 1TB (Games_1) | Samsung 970 Evo 1TB (Games_2)
Display(s) Asus XG27AQM 240Hz G-Sync Fast-IPS | Gigabyte M27Q-P 165Hz 1440P IPS | Asus 24" IPS (portrait mode)
Case Lian Li PC-011D XL | Custom cables by Cablemodz
Audio Device(s) FiiO K7 | Sennheiser HD650 + Beyerdynamic FOX Mic
Power Supply Seasonic Prime Ultra Platinum 850
Mouse Razer Viper v2 Pro
Keyboard Razer Huntsman Tournament Edition
Software Windows 11 Pro 64-Bit
That appears to be the real reason. But they could leave GK104 alone and create a cheapo TITAN card.

A lot of Direct Compute performance comes from the cache on the gpu, not just FP64. Nvidia started stripping down the cache on their Geforce cards after first gen fermi. Gtx470/480.

Cache can make a gpu die very large and hot.
 
Joined
Dec 17, 2011
Messages
359 (0.08/day)
As for dropping all fp64 on lower cards that would not be wise as I doubt it would play crisis then at all and some games would really really struggle emulating fp64 and I'd wager physx uSes it quite a bit

FP64 is used for more precise effects not for more effects. There isn't a single consumer application that uses GPU FP64 and that includes PhysX.

A lot of Direct Compute performance comes from the cache on the gpu, not just FP64. Nvidia started stripping down the cache on their Geforce cards after first gen fermi. Gtx470/480.

Cache can make a gpu die very large and hot.

Yet Maxwell went with a 2 MB cache for GM107.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.94/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Can't nVidia's higher end cards (like Titan) and workstation cards run in a DP mode where DP performance isn't crap at the expense of SP performance? I can't seem to find where I read that but I distinctly remember nVidia giving that option to particular GPUs like Titan and their workstation cards. I think the point was that most games use single precision, so it makes sense for SP to be faster than DP for consumer graphics cards.

Edit: Yes! There is a driver switch that changes how the GPU performs with DP and SP. Apparently there are side-effects like Boost getting disabled, but it results in DP numbers that would otherwise be mediocre.


source
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.63/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
The problem with IEEE 754-1985 binary64 (double precision) is it is not bitwise backwards compatible with binary32 (single precision):
http://kipirvine.com/asm/workbook/floating_tut.htm

If you throw a binary32 at a binary64 processor, it has to convert it (or have separate hardware) before it can process it. There's no reason why GPUs couldn't be made entirely binary64 but to do so means no backwards compatibility or severely limited performance when doing so. This is why binary64 GPUs are marketed separately to a different audience. As demand increases for binary64, GPUs will grow increasingly biased towards binary64 but likely at the cost of binary32 performance.


Pi in binary32: 3.1415927410125732421875
Pi in binary64: 3.141592653589793115997963468544185161590576171875

Pi is a small number. Imagine if you were dealing with a number like 1 billion and some change. The bigger the whole number, the less precise the fraction.
 
Last edited:
Joined
Dec 17, 2011
Messages
359 (0.08/day)
The problem with IEEE 754-1985 binary64 (double precision) is it is not bitwise backwards compatible with binary32 (single precision):
http://kipirvine.com/asm/workbook/floating_tut.htm

If you throw a binary32 at a binary64 processor, it has to convert it (or have separate hardware) before it can process it. There's no reason why GPUs couldn't be made entirely binary64 but to do so means no backwards compatibility or severely limited performance when doing so. This is why binary64 GPUs are marketed separately to a different audience. As demand increases for binary64, GPUs will grow increasingly biased towards binary64 but likely at the cost of binary32 performance.

This may explain why NVIDIA had entirely different FP64 CUDA Cores in Kepler. But why did Fermi had FP32 capable FP64 cores?
 
Joined
Mar 10, 2010
Messages
11,878 (2.30/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
FP64 is used for more precise effects not for more effects. There isn't a single consumer application that uses GPU FP64 and that includes PhysX.



Yet Maxwell went with a 2 MB cache for GM107.

There are plenty of uses for fp 64 and i did not imply more effects as i know what double precision means , regardless it is what it is you are bickering about bs , not everyone uses a gf cards sound controler and that uses space, should they chop that out for one more shader.
No
And how is something you don't use appreciate or like worth a rant
 
Last edited:

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.63/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
This may explain why NVIDIA had entirely different FP64 CUDA Cores in Kepler. But why did Fermi had FP32 capable FP64 cores?
Every FPU in each CUDA core in Fermi has hardware to handle both binary32 and binary64 not unlike the FPU in a CPU.

CUDA cores dedicated to binary64 could be used directly by programmers to perform high precision calculations.
 
Last edited:
Joined
Mar 10, 2010
Messages
11,878 (2.30/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
Arch has to evolve with our uses .are you just working on post count.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.63/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
I know that. But why didn't they do the same for Kepler?
They wanted to boost binary64 by having cores dedicated to it most likely for scientific applications. It sounds to me like, if Fermi had 32 FPUs capable of binary32 and binary64, Kepler would have 64 + some more binary64 only FPUs. This way, they're getting equal binary32 capability and they get equal or much more (for software the uses the dedicated binary64 FPUs) binary64 performance.

Nope. What's the benefit? I made this account wayy back in 2011. It won't affect my posts/day anyway. But you didn't answer my question. What is the advantage of separating FP32 and FP64 cores?
The advantage is that you can vastly simplify the FPU by removing the backwards compatibility for binary32. This means, in turn, they can pack more binary64 performance into less die space.
 
Joined
Dec 17, 2011
Messages
359 (0.08/day)
The advantage is that you can vastly simplify the FPU by removing the backwards compatibility for binary32. This means, in turn, they can pack more binary64 performance into less die space.

That appears to be the reason.

It sounds to me like, if Fermi had 32 FPUs capable of binary32 and binary64, Kepler would have 64 + some more binary64 only FPUs. This way, they're getting equal binary32 capability and they get equal or much more (for software the uses the dedicated binary64 FPUs) binary64 performance.

Nope.

Anandtech said:
In GK104 none of the regular CUDA core blocks are FP64 capable; in its place we have what we’re calling the CUDA FP64 block. The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams. These CUDA cores can only do and are only used for FP64 math.
 
Top