NVIDIA Turing GeForce RTX Technology & Architecture

on Sep 14th, 2018,

Manufacturer: NVIDIA

Shader Improvements

As much as NVIDIA spoke and pushed about ray tracing specifically while referring to their Turing cards, the company did not shy away from improving the ages-old shading engine and techniques for its new architecture. Neither could they: shading techniques will, for a while, continue to be kings in adoption. And with so much GPU real-estate attributed to Tensor and RT cores, NVIDIA could not simply look to improve shading performance via the addition of yet more execution units; the dies would grow prohibitively expensive. Performance improvements, thus, had to be achieved elsewhere.

This is where some techniques pop up, and the first one we are going to talk about is Mesh Shading. With usual shading workloads, each object to be rendered—think trees and their leaves, multiple buildings at a distance—requires one draw call from the CPU. Draw calls which, due to shader model limitations, occur per-thread, limiting the speed and algorithms that can be used. There's only so much parallelism to be extracted here.

Mesh Shading improves performance by reducing the number of required draw calls for objects on-scene, by flexibly incorporating multiple stages of the traditional graphics pipeline. With Mesh Shading, NVIDIA is introducing a Task Shader process which incorporates both vertex and hull shading, and a Mesh Shader which brings together both Domain and Geometry Shading.

The Mesh Shader in particular can now produce triangles for the rasterizer in a cooperative, parallelized thread model that looked to compute shading for inspiration. Changes are similar for the Mesh Shader portion—it removes draw call requirements for the CPU by allowing the GPU to look at object lists instead of just single objects. NVIDIA says this improves GPU utilization by removing CPU bottlenecks, enabling an increase of "more than an order of magnitude". The Task Shader can be particularly interesting, though, in that it supports LOD (Level of Detail) tables for objects according to their distance, thus allowing the GPU to know exactly which version of an object (with increasing detail if close and proportionally less detail if far) on the fly. It then sends this information to the Mesh Shader so it can either simply process an image, if detail level is low, or apply tessellation, in cases of higher LOD.

Variable Rate Shading

Variable Rate Shading is another NVIDIA-introduced tech for Turing, and one that looks to optimize image quality and performance by correctly distributing workload and GPU processing time according to an object's importance to any given scene. Basically, it allows developers to reduce shading workload and level of detail in areas they deem not too important for the overall experience, and this approach is fine-grained, with every 16x16 pixel region on the screen being adaptable on the fly. As such, developers can now opt to individually shade objects on the center of the viewport or those that are static in nature (such as cars and weapons), while only using a single shading pass result to color four separate pixels in other areas of the screen (e.g.: there are seven resolutions). This happens due to three particular algorithms that are at play:

Content Adaptive Shading

One way NVIDIA found of reducing the graphics workload for less than essential objects is taking temporal and special outputs from previous frames and analyzing where additional (or reduced) processing is required. Content adaptive shading is a post-process step that looks at a rendered frame's output and determines quality conditions for the next one—lowered detail areas such as skies, flat walls, or even shadowed portions of objects require lesser amounts of shading detail, and thus, their shading rates can be reduced from a per-pixel shading to four pixels per shading ratio.

Motion Adaptive Shading

This technique basically looks at a given object's speed in a frame to calculate the amount of shading required for it to have an adequate level of detail for our perception. Faster moving objects are rendered with blur techniques, such as Motion Blur and DoF (Depth of Field), which, by their nature, make shading differences less noticeable, and processing grunt can, as such, be saved by reducing shading resolution. Motion Adaptive Shading basically looks at the result of various motion blur algorithms to calculate, in an invert proportion, the amount of shading work that is justified (lower speeds = higher shading requirements).

Foveated Rendering

Foveated rendering is a technique we have seen applied to VR headsets to ease processing requirements for GPUs, which have to render at 90 FPs in that particular scenario. However, this applies to more mundane output options as well, such as monitors. Basically, it takes advantage of the fact that our visual acuity is higher for objects at the center of our field of view, while visual resolution for objects in the periphery is lower (you could say our brain applies Foveated Rendering by itself as well). As such, NVIDIA is banking on some way of knowing what our eyes are looking at (through eye tracking) to dynamically increase and reduce the rendering resolution for areas in the center of their field of vision.

Texture Space Shading

Texture Space Shading essentially allows Turing to reuse previously computed and shaded pixels on a frame-by-frame basis, thus allowing for both improved image quality and performance. These pixels are dynamically saved as texels in a texture space. As such, if a given pixel was previously shaded in full resolution, saved as a texel, and is still rendered in the next frame, there is no need to go over it again—its texel result is simply loaded from the texture space. This saves processing grunt because only new texels have to be processed, and the GPU can look at adjacent texels and cut on workload by simply extrapolating results from already-saved compute results. This is of particular importance in VR scenarios, where two slightly differing images have to be computed for every frame. With TSS, there is no longer any need to compute two full scenes for each eye: developers can simply process the full-resolution image for the left eye and apply TSS to the right eye, rendering only new texels that can't be reused from the left eye's rendering results (because they are culled from vision, or obscured).