Monday, April 8th 2019

NVIDIA RTX Logic Increases TPC Area by 22% Compared to Non-RTX Turing

Public perception on NVIDIA's new RTX series of graphics cards was sometimes marred by an impression of wrong resource allocation from NVIDIA. The argument went that NVIDIA had greatly increased chip area by adding RTX functionality (in both its Tensor ad RT cores) that could have been better used for increased performance gains in shader-based, non-raytracing workloads. While the merits of ray tracing oas it stands (in terms of uptake from developers) are certainly worthy of discussion, it seems that NVIDIA didn't dedicate that much more die area to their RTX functionality - at least not to the tone of public perception.

After analyzing full, high-res images of NVIDIA's TU106 and TU116 chips, reddit user @Qesa did some analysis on the TPC structure of NVIDIA's Turing chips, and arrived at the conclusion that the difference between NVIDIA's RTX-capable TU106 compared to their RTX-stripped TU116 amounts to a mere 1.95 mm² of additional logic per TPC - a 22% area increase. Of these, 1.25 mm² are reserved for the Tensor logic (which accelerates both DLSS and de-noising on ray-traced workloads), while only 0.7 mm² are being used for the RT cores.
According to the math, this means that a TU102 chip used for the RTX 2080 Ti, which in its full configuration, has a 754 mm² area, could have done with a 684 mm² chip instead. It seems that most of the area increase compared to the Pascal architecture actually comes from increased performance (and size) of caches and larger instruction sets on Turing than from RTX functionality. Not accounting to area density achieved from the transition from 16 nm to 12 nm, a TU106 chip powering an RTX 2060 delivers around the same performance as the GP104 chip powering the GTX 1080 (410 mm² on the TU106 against 314 mm² on GP104), whilst carrying only 75% of the SM count (1920 versus 2560 SMs). Source: Reddit @ User Qesa
Add your own comment

22 Comments on NVIDIA RTX Logic Increases TPC Area by 22% Compared to Non-RTX Turing

#1
_Flare
Did @Qesa mention anything about the area for the dedicated FP16-Cores in the TU116?
Posted on Reply
#2
londiste
Source: hardware/comments/baajes
I believe in another comment he mentioned FP16 logic in TU116 is smaller than Tensor cores in TU106 but not negligible.
Posted on Reply
#3
sergionography
I remember this coming up in the comments previously. I pointed out how a GTX 1660 ti at 284mm2 is about 10-15% slower than a GTX 1080 at 314mm2 (also 10-15% smaller in size)
It makes me question whether Turing actually improved on pascal at all. Similar performance for a given footprint. Im guessing where it comes in handy is when you scale it to larger chips where having a lesser number of more powerful cores is more manageable and scales better. Perhaps nvidias architecture also has a core count cap similar to hoe GCN caps at 4096?

And as for tensor cores; they are the most useless things in a gaming card as far as I am concerned.
Posted on Reply
#4
cucker tarlson
I'm more interested in power that rtx logic requires.
Posted on Reply
#5
londiste
cucker tarlson
I'm more interested in power that rtx logic requires.
If it is not used, it gets power gated and does not use any significant amount of power.
Posted on Reply
#6
theoneandonlymrk
Only 22% , how's that a only for a start that's a lot of shader space ?.

That simple figure doesn't account for the extra caches to feed them , they're likely to have increased in size to accommodate them, so 25-27% if we throw fp16 hardware too I would think.

Seams odd anyway , I like chip pic's like anyone but,

It's a bit late to be pushing this angle IMHO.
Posted on Reply
#7
danbert2000
The big change for Turing was being able to do INT and FP operations at the same time. I'm sure that cost some transistors. The perf per clock and clock speeds are higher too, I'm not surprised at all that the chips are bigger. Same thing happened with Maxwell I believe, the GPUs got bigger to support the higher clocks.
Posted on Reply
#8
bug
That is actually good news.
Posted on Reply
#9
Vya Domus
A lot of people think shaders take up a lot of space and that the relationship between how many shaders a GPU has and it's size is linear. But really it isn't, for example, GP106 had 1280 shaders and was 200m^2 , GP104 has twice the shaders and was only ~60% bigger. That's because caches, crossbars, memory controllers, etc don't scale at the same rate. That being said 22% is a lot when everything is put into perspective.

Even more concerning is how much in percentages these things take from the power budget. 12nm didn't bring any notable efficiency gains and Turing only uses slightly more power than their Pascal equivalent does. But we now know RT and Tensor cores uses about a fifth of the silicon, I have a suspicion that when this silicon is in use it doesn't just cause the GPU to use more power but it may actually eat away at the power budget the other parts of the chip would otherwise get performing traditional shading.
Posted on Reply
#10
Fiendish
Don't those numbers add up to like a 10% increase, where is 22% coming from?

Edit: I see now.
Posted on Reply
#11
Sora
damnit, now i need to go on reddit and beat this information over the heads of the fools that kept repeating the RT is 1/3 the die crap.
Posted on Reply
#12
Crackong
22% is A LOT.
With a slight overclock, we could have the performance of 1080ti with the price of a 1080.
All of that was lost because some RTX holy grail "Just Works".
It works so well that after 6 months of launch, nobody can utilize all of the RTX features in a single game.
And DLSS is such a gimmick feature that have to beg The leather jacket himself to fire up his multi-billion AI computer to train for your multi-million AAAAA loot box micro - transaction "game" in the first place.


Thank you Leather Jacket.
Praise the Leather Jacket.
Posted on Reply
#13
MuhammedAbdo
Crackong
22% is A LOT.
With a slight overclock, we could have the performance of 1080ti with the price of a 1080.
All of that was lost because some RTX holy grail "Just Works".
It works so well that after 6 months of launch, nobody can utilize all of the RTX features in a single game.
And DLSS is such a gimmick feature that have to beg The leather jacket himself to fire up his multi-billion AI computer to train for your multi-million AAAAA loot box micro - transaction "game" in the first place.
It only amounts to 10% die increase
According to the math, this means that a TU102 chip used for the RTX 2080 Ti, which in its full configuration, has a 754 mm² area, could have done with a 684 mm² chip instead.
Posted on Reply
#14
Crackong
MuhammedAbdo
It only amounts to 10% die increase
TPC units do not populate the whole die.
Just like an Intel CPU had almost half of its die size populated by the i-GPU.

Therefore decreasing TPC size by 22% only results a 10% reduction in overall die size.
Posted on Reply
#15
kastriot
Maybe RTX cores work in tandem with normal cores so it's some kind of compromise?
Posted on Reply
#16
Kaapstad
It does not matter how it is dressed up, RTX cards are NVidia's worst products in a very long time.

NVidia have totally screwed up on price/performance and there is no getting away from that.

It would be nice if NVidia could give us some cards that "just work for the price" next time they launch a new architecture.

Epic fail NVidia, it is enough to make Alan Turing eat an apple.
Posted on Reply
#17
londiste
Vya Domus
Even more concerning is how much in percentages these things take from the power budget. 12nm didn't bring any notable efficiency gains and Turing only uses slightly more power than their Pascal equivalent does. But we now know RT and Tensor cores uses about a fifth of the silicon, I have a suspicion that when this silicon is in use it doesn't just cause the GPU to use more power but it may actually eat away at the power budget the other parts of the chip would otherwise get performing traditional shading.
While RT Cores are used, they definitely eat into power budget but the extent of it is unknown. I do have a feeling it is less than we would expect though, using DXR stuff in games like SoTR - that is otherwise heavy load and causes a bit lower clocks than normal due to power limit - does not decrease the clocks noticeably. Unfortunately, I don't think anyone (besides Nvidia) has even a good idea for how to test this. RT Cores, while separate units, always have the rest of the chip feeding data into them effectively being just another ALU in CUDA core.
Posted on Reply
#18
Vayra86
MuhammedAbdo
It only amounts to 10% die increase
People forget that the shader itself and the L2 cache was also expanded a bit. Turing always contains part of the RTX logic even in the 1660ti.

This news article says nothing against the claim that about 17-20% of the die is needed for RTRT with Turing which, given the die schematic still looks plausible to me. All things considered the architecture does not perform much better (per watt) than Pascal in regular loads, its hit or miss in that sense. The real comparison here is die size vs absolute performance on non-RT workloads for Pascal versus Turing. The rest only serves to make matters complicated for no benefit.

cucker tarlson
I'm more interested in power that rtx logic requires.
Makes two (and probably many more). Its a pretty complicated test I think, but I think the best way to get a handle on it, is to put Turing RTX and non-RTX next to Pascal and test at fixed FPS all cards can manage, then measure power consumption. That is with the assumption that we accept Pascal and Turing to have a ballpark equal perf/watt figure; though you could probably apply a formula for any deviation as well; the problem here is that its not linear per game.
Posted on Reply
#19
londiste
The comparison is made based on TU106 vs TU116 - focused on RT Cores and Tensor cores. The shader changes from Pascal (to Volta) to Turing are not being looked at. This is where a considerable amount of additional transistors went.
Posted on Reply
#20
bug
Vayra86
Makes two (and probably many more). Its a pretty complicated test I think, but I think the best way to get a handle on it, is to put Turing RTX and non-RTX next to Pascal and test at fixed FPS all cards can manage, then measure power consumption. That is with the assumption that we accept Pascal and Turing to have a ballpark equal perf/watt figure; though you could probably apply a formula for any deviation as well; the problem here is that its not linear per game.
I think this one is tricky to measure (but I'd still like to know). Turn of RTX and the card will draw more frames. Draw more frames, the power draw goes up :(
You can mitigate that by locking the FPS to a set value, but then you're not stressing the hardware enough :(
Posted on Reply
#21
londiste
Vayra86
Makes two (and probably many more). Its a pretty complicated test I think, but I think the best way to get a handle on it, is to put Turing RTX and non-RTX next to Pascal and test at fixed FPS all cards can manage, then measure power consumption. That is with the assumption that we accept Pascal and Turing to have a ballpark equal perf/watt figure; though you could probably apply a formula for any deviation as well; the problem here is that its not linear per game.
It is not that simple. There are no cards that are directly comparable in terms of resources.
- RTX2070 (2304:144:64 and 8GB GDDR6 on 256-bit bus) vs GTX1070Ti (2432:125:64 and 8GB GDDR5 on 256-bit bus) is the closest comparison we can make. And even here, shaders, TMUs and memory type are different and we cannot make an exact correction for it. Memory we could account for roughly but more shaders and less TMUs on GTX is tough.
- RTX2080 (2944:184:64 and 8GB GDDR6 on 256-bit bus) vs GTX1080Ti (3584:224:88 and 11GB GDDR5X on 352-bit bus) is definitely closer in performance at stock but discrepancy in individual resources is larger, mostly due to larger memory controller along with associated ROPs.

The other question is which games/tests to run. Anything that is able to utilize new features in Turing either inherently (some concurrent INT+FP) or with dev support (RPM) will do better and is likely to better justify the additional cost.

Tom's Hardware Germany did try running RTX2080Ti at lower power limits and comparison in Metro Last Light: Redux. It ties to GTX 1080Ti (2GHz and 280W) at around 160W.
https://www.tomshw.de/2019/04/04/nvidia-geforce-rtx-2080-ti-im-grossen-effizienz-test-von-140-bis-340-watt-igorslab/
However, this is a very flawed comparison as RTX2080Ti has 21% more shaders and TMUs along with 27% more memory bandwidth. This guy overclocked the memory on 2080Ti from 1750MHz to 2150MHz making the memory bandwidth difference 56%. Lowering the power limit lowers the core clock (slightly above 1GHz at 160W) but does not reduce memory bandwidth.

Edit: Actually, in that Tom's Hardware Germany comparison, RTX2080Ti runs at roughly 2GHz at 340W. Considering that it has 21% more shaders than GTX 1080Ti we can roughly calculate the power consumption for comparison - 340W / 1.21 = 280.1W which is very close to GTX1080Ti's 280W number. This means shaders are consuming roughly the same amount of power. At the same time, performance is up 47% in average FPS and 57% in min FPS. Turing does appear to be more efficient even in old games but not by very much.
Posted on Reply
#22
ppn
445/284 sq.mm is 156% increase. and 2070 is 150% over 1660Ti, so clearly. RT core only takes about 4% of space. Not much.
It would be nice to mark the different functional parts on the infrared photo. I think the RT is still there, in no way RT can be 4%.
Posted on Reply
Add your own comment