NVIDIA Turing is the company's best kept secret, if it's indeed 15 years in the making. The architecture introduces a feature NVIDIA feels is so big that it could be the biggest innovation in real-time 3D graphics since programmable shaders from early last decade. Real-time ray tracing has indeed been regarded as the holy grail of 3D graphics because of the sheer amount of computation needed to make it work. The new GeForce RTX family of graphics cards promises to put a semblance of ray tracing in the hands of gamers. At this point, we are calling it a semblance because NVIDIA has adopted some very clever tricks to make it work, and the resulting 3D scenes do tend to resemble renders that have undergone hours of ray tracing.
Around this time last year, when we first heard the codename "Turing," we discounted it for a silicon NVIDIA could develop to cash in on the blockchain boom of the time since the architecture is named after the mathematician who saved millions of lives by cracking the Nazi "Enigma" cryptography, which helped bring World War II to a speedy end. Little did we know that NVIDIA's tribute to Alan Turing wouldn't merely be to his achievements in cryptography, but, rather, his overall reputation as the Father of Artificial Intelligence and Theoretical Computing.
Over the past five years, NVIDIA invested big in AI, developing the first deep-learning neural network models that leverage its CUDA technology and powerful GPUs. Initial attempts at building and training neural nets proved to be a very time-consuming task for even the most powerful GPUs, requiring hardware components that accelerate tensor operations. NVIDIA thus built the first fixed-function component for tensor ops, called simply "Tensor Cores". These are large, specialized components that compute 3x3x3 matrix multiplications. Tensor Cores debuted with the "Volta" architecture, which we thought at the time would be the natural successor to "Pascal." However, NVIDIA decided the time was ripe to bring the RTX technology out of the oven.
The Turing graphics architecture introduces a third (and final) piece of the hardware puzzle that makes NVIDIA's ambitious consumer ray-tracing plans work—RT cores. An RT core is a fixed-function hardware that does what the spiritual ancestor of RTX, NVIDIA Optix, did over CUDA cores. You input the mathematical representation of a ray, and it will transverse the scene to calculate the point of intersection with any triangle in the scene.
NVIDIA RTX is an all-encompassing, highly flexible, real-time ray-tracing model for consumer graphics. It seeks to minimize the toolset and learning curve for today's 3D graphics programmers. It seeks to bring as tangible an impact on realism as anti-aliasing, programmable shaders, and tessellation (all of which triggered leaps in GPU compute power). On Turing, a combination of the latest-generation CUDA cores work with a new component called RT Core, and Tensor Cores, to make RTX work.
NVIDIA debuted RTX with the Quadro RTX line of professional graphics cards first, at SIGGRAPH 2018, not only because the event precedes Gamescom, but also because it gives content creators a head start into the technology. The GeForce RTX family is the first in a decade to lack "GTX" in its branding, which speaks for just how much weight is on RTX to succeed.
In this article, we dive deep into the inner workings of the NVIDIA RTX technology and Turing GPU architecture, and how the two are put together in the first three GeForce RTX 20-series graphics cards you'll be able to purchase later this month.
Very soon, when the NVIDIA review embargo is lifted, we'll also provide our own review with Turing performance results in lots of games.
When NVIDIA released the first die-shot of Turing, it looked unlike any GPU die we'd seen from NVIDIA in a decade. Block-diagrams reveal that the first two Turing implementations, TU102 and TU104, more or less retain the component hierarchy of their predecessors needed to make a modern GPU work, but pack big changes to the SM (streaming multiprocessors), indivisible sub-units of the GPU, with the introduction of RT Cores and Tensor Cores, and a new Warp Scheduler that allows concurrent INT32 and FP32 execution, a feature that could improve the chip's overall asynchronous compute functions. We will dive deeper into the mechanics of a Turing CUDA core, SM, RT core, and Tensor cores, on subsequent pages.
The Turing TU102 is the biggest single piece of non-storage silicon ever conceived. Built on the 12 nm silicon fabrication process like the rest of the Turing family, this chip packs a whopping 18 billion transistors and promises a TDP of just 250 W. The chip is armed with six GPCs (graphics processing clusters), which each pack twelve streaming multiprocessors (SM), the indivisible sub-unit of the GPU. There are hence 72 SMs across the TU102.
The addition of Tensor Cores and RT Cores leaves room for just 64 CUDA cores per SM, or 768 across a GPC, and 4,608 per die. Unlike the GV100, a Tensor Core takes up approximately eight times the die area of CUDA cores, and the SM has 8 Tensor Cores, or 96 Tensor Cores per GPC and 576 across the whole GPU. The RT core is the largest indivisible component in an SM, and there is only one of these per SM, 12 per GPC, and 72 across the die. A 384-bit GDDR6 memory bus handles up to 24 GB of GDDR6 memory.
Alas, the GeForce RTX 2080 Ti, like its predecessor, the GTX 1080 Ti, does not max out the TU102 silicon. Only 68 out of 72 SMs are enabled, which work out to 4,352 CUDA cores, 544 Tensor Cores, and 68 RT Cores. The memory bus, too, is narrowed down to 352-bit, driving 11 GB of memory. With Turing, NVIDIA put brakes on doubling memory amounts with each passing generation. Perhaps, the state of the DRAM industry is to blame here, and also the lack of a use case for twice the memory over the previous generation.
The Turing TU104 will go down as the poster boy of the Turing architecture, much like its long line of predecessors, like the GK104 "Kepler" (GTX 680). This chip powers the GeForce RTX 2080, which interestingly, does not max out all components physically present on the chip. Perhaps the rumored RTX 2080+ could do just that.
Another key dissimilarity the TU104 has over its predecessors is its GPC count in comparison to their bigger counterparts. The TU104 has the same exact number of GPCs as the TU102, at 6, with the only difference being SM count per GPC. While the TU102 has 12 SMs per GPC, the TU104 has 8. These SMs are of the same configuration as the TU102.
A maxed out TU104 would hence have 3,072 CUDA cores, 384 Tensor Cores, and 48 RT cores. The RTX 2080 does not max this chip out, and instead has two of its SMs disabled. This translates to 2,944 CUDA cores, 368 Tensor Cores, and 46 RT cores. Luckily, the memory interface is untouched. The 256-bit wide GDDR6 memory bus pulls 8 GB of memory on the RTX 2080.