Wednesday, September 21st 2022

NVIDIA Ada AD102 Block Diagram and New Architectural Features Detailed

Sep 21st, 2022 23:01 Discuss (21 Comments)

At the heart of the GeForce RTX 4090 is the gigantic AD102 silicon, which we broadly detailed in an older article. Built on the 4 nm silicon fabrication process, this chip measures 608 mm² in die-area, and crams in 76.3 billion transistors. We now have our first look into the silicon-level block diagram of the AD102, including the introduction of several new components.

The AD102 features a PCI-Express 4.0 x16 host interface, and a 384-bit GDDR6X memory interface. The Gigathread Engine acts as a the main resource allocation component of the silicon. Ada introduces the Optical Flow Accelerator, a component crucial for DLSS 3 to generate entire frames without involving the graphics rendering machinery. The chip features double the number of media-encoding hardware engines as "Ampere," including hardware-accelerated AV1 encode/decode. Multiple accelerators mean that multiple streams of videos can be transcoded (helpful in a media production environment), or transcoding is performed at twice the FPS rate (each encoder takes turns at encoding a single frame).

The main graphics rendering components of the AD102 are the GPCs (graphics processing clusters). There are 12 of these, compared to 7 on the previous-generation GA102. Each GPC shares a raster engine and render backends with six TPCs (texture processing clusters). Each TPC packs two SMs (streaming multiprocessors), the indivisible number-crunching machinery of the NVIDIA GPU. The SM is where maximum architectural innovation is done by NVIDIA. Each SM packs a 3rd generation RT core, a 128 KB L1 cache, and four TMUs, among four clusters that each pack 16 FP32 CUDA cores, 16 concurrent FP32+INT32 CUDA cores, 4 load/store units, a tiny L0 cache with warp-scheduler and threat-dispatch; a register file, and the all-important 4th generation Tensor core.

Each SM hence packs a total of 128 CUDA cores, 4 Tensor cores, and an RT core. There are 12 SM per GPC, so 1,536 CUDA cores, 48 Tensor cores, and 12 RT cores; per GPC. Twelve GPCs hence add up to 18,432 CUDA cores, 576 Tensor cores, and 144 RT cores. Each GPC contributes 16 ROPs, so there are a mammoth 192 ROPs on the silicon. An L2 cache serves as town-square for the various GPCs, memory controllers, and the PCIe host interface, to exchange data. NVIDIA didn't mention the size of this L2 cache, but it is said to be significantly larger than the previous generation, and is playing a major role in lubricating the memory sub-system enough that NVIDIA can retain the same 21 Gbps @ 384-bit data-rate of the previous-generation.

NVIDIA is introducing shader execution reordering, (SER), a new technology that reorganizes math workloads to be relevant to each worker thread, so it is more efficiently processed by the SIMD components. This is expected to have a particularly big impact on rendering games with ray tracing. A GPU works best when the same operation can be executed on multiple targets. For example, when rendering a triangle, each pixel runs the same shader in parallel. With ray tracing, each ray at a time can execute a completely different piece of code, because it goes in a slightly different direction. With SER, the GPU will "sort" the operations, to create chunks of identical tasks and execute them in parallel. In Cyberpunk 2077 with its new Overdrive graphics preset that significantly dials up RT calculations per pixel, SER improves performance up to 44 percent. NVIDIA is developing Portal RTX, a mod for the original game with RTX effects added. Here, SER improves performance by 29 percent. It is also said to have a 20 percent performance impact on the Racer RTX interactive tech-demo we'll see this November. NVIDIA commented that there's various SER approaches and the best choice vary by-game, so they exposed the shader reordering functionality to game developers as an API, so they have control over how the sorting algorithm works, to best optimize their performance.

Displaced micro-mesh engine is a revolutionary feature introduced with the new 3rd generation RT core, which accelerates the displaced micro-mesh feature. Just as mesh shaders and tessellation have had a profound impact on improving performance with complex raster geometry, allowing game developers to significantly increase geometric complexity; DMMs is a method to reduce the complexity of the bounding-volume hierarchy (BVH) data-structure, which is used to determine where a ray hits geometry. Previously the BVH had to capture even the smallest details, to properly determine the intersection point.

The BVH now needn't have data for every single triangle on an object, but can represent objects with complex geometry as a coarse mesh of base triangles, which greatly simplifies the BVH data structure. A simpler BVH means less memory consumed and helps to greatly reduce ray tracing CPU load, because the CPU only has to generate a smaller structure. With older "Ampere" and "Turing" RT cores, each triangle on an object had to be sampled at high overhead, so the RT core could precisely calculate ray intersection for each triangle. With Ada, the simpler BVH, plus the displacement maps can be sent to the RT core, which is now able to figure out the exact hit point on its own. NVIDIA has seen 11:1 to 28:1 compression in total triangle counts. This reduces BVH compile speedups by 7.6x to over 15x, in comparison to the older RT core; and reducing its storage footprint by anywhere between 6.5 to 20 times. DMMs could reduce disk- and memory bandwidth utilization, utilization of the PCIe bus, as well as reduce CPU utilization. NVIDIA worked with Simplygon and Adobe to add DMM support for their tool chains.

Opacity Micro Meshes (OMM) is a new feature introduced with Ada to improve rasterization performance, particularly with objects that have alpha (transparency data). Most low-priority objects in a 3D scene, such as leaves on a tree, are essentially rectangles with textures on the leaves where the transparency (alpha) creates the shape of the leaf. RT cores have a hard time intersecting rays with such objects, because they're not really in the shape that they appear (they're really just rectangles with textures that give you the illusion of shape. Previous-generation RT cores had to have multiple interactions with the rendering stage to figure out the shape of a transparent object, because they couldn't test for alpha by themselves.

This has been solved by using OMMs. Just as DMMs simplify geometry by creating meshes of micro-triangles; OMMs create meshes of rectangular textures that align with parts of the texture that aren't alpha, so the RT core has a better understanding of the geometry of the object, and can correctly calculate ray intersections. This has a significant performance impact on shading performance in non-RT applications, too. Practical applications of OMMs aren't just low-priority objects such as vegetation, but also smoke-sprites and localized fog. Traditionally there was a lot of overdraw for such effects, because they layered multiple textures on top of each other, that all had to be fully processed by the shaders. Now only the non-opaque pixels get executed—OMMs provide a 30 percent speedup with graphics buffer fill-rates, and a 10 percent impact on frame-rates.

DLSS 3 introduces a revolutionary new feature that promises a doubling in frame-rate at comparable quality, it's called AI frame-generation. While it has all the features of DLSS 2 and its AI super-resolution (scaling up a lower-resolution frame to native resolution with minimal quality loss); DLSS 3 can generate entire frames simply using AI, without involving the graphics rendering pipeline. Every alternating frame with DLSS 3 is hence AI-generated, without being a replica of the previous rendered frame.

This is possible only on the Ada graphics architecture, because of a hardware component called optical flow accelerator (OFA), which assists in predicting what the next frame could look like, by creating what NVIDIA calls an optical flow-field. OFA ensures that the DLSS 3 algorithm isn't confused by static objects in a rapidly-changing 3D scene (such as a race sim). The process heavily relies on the performance uplift introduced by the FP8 math format of the 4th generation Tensor core.

A third key ingredient of DLSS 3 is Reflex. By reducing the rendering queue to zero, Reflex plays a vital role in ensuring the frame-times with DLSS 3 are at an acceptable level, and a render-queue doesn't confuse the upscaler. A combination of OFA and 4th gen Tensor core is why the Ada architecture is required to use DLSS 3, and why it won't work on older architectures.

Add your own comment

21 Comments on NVIDIA Ada AD102 Block Diagram and New Architectural Features Detailed

dgianstefani

TPU Proofreader

Interesting stuff. Cool to see RTRT really being refined so much, and all that dedicated tensor hardware being put to work.

mouacyk

Neat, Ada is a VGPU.

HisDivineOrder

According to Nvidia, the Optical Flow Accelerator has existed in cards since Turing. It's not new. It's just improved.

https://twitter.com/i/web/status/1572334396145012738

AusWolf

So the GPU is essentially Ampere (in other words: Turing Refresh 2), just bigger. How innovative! :sleep:

The technological innovations are cool, though. I hope at least the RT-related ones won't be Nvidia exclusive.

Bwaze

I see lots of proprietary new arhitecture that also demands the game to have these features to show it's potential. New and improved raytracing? It has to be supportet in game. New and improved DLSS 3.0? Also only for games that support it.

Will we see the push in reviews and benchmarks to include as much of these new games and architecture as possible, to really skew the results against the Ampere and AMD cards?

The way I see it, 2x and 3x claims from Nvidia keynote presentation are all achieved by using such architecture and game changes, not by actually be 2x, 3x faster...

Unregistered

Lots of waste of silicon, instead of wasting silicon find a way to make DLSS work in all games, as for RT cores leave them for a Titan card and once we achieve full path tracing then bring it to the masses. The rich/fans...etc can buy the Titan to get the latest things, and the rest of us get fast relatively cheap GPUs.

AusWolf

Xex360Lots of waste of silicon, instead of wasting silicon find a way to make DLSS work in all games, as for RT cores leave them for a Titan card and once we achieve full path tracing then bring it to the masses. The rich/fans...etc can buy the Titan to get the latest things, and the rest of us get fast relatively cheap GPUs.

Alternatively, they could release another RT-free affordable GPU range like the Turing GTX 16-series.

Jimmy_

Tricky launch and pretty interesting launch thb - Turing refresh ( a bigger version )

Vayra86

@bug
"NVIDIA is introducing shader execution reordering, (SER), a new technology that reorganizes math workloads to be relevant to each worker thread, so it is more efficiently processed by the SIMD components. This is expected to have a particularly big impact on rendering games with ray tracing."

;)
Nvidia already picked all the low hanging fruit on raster

#10

Unregistered

AusWolfAlternatively, they could release another RT-free affordable GPU range like the Turing GTX 16-series.

Yeah, maybe something like a GTX3070 would be great.

#11

bug

Vayra86@bug
"NVIDIA is introducing shader execution reordering, (SER), a new technology that reorganizes math workloads to be relevant to each worker thread, so it is more efficiently processed by the SIMD components. This is expected to have a particularly big impact on rendering games with ray tracing."

;)
Nvidia already picked all the low hanging fruit on raster

Everybody did, by now. And yet, it seems SER still benefits raster, even if RT can be more fitting for instruction reordering.

How about waiting for benchmarks before declaring IPC has/hasn't improved? Does that work for you?

#12

Anymal

Xex360Lots of waste of silicon, instead of wasting silicon find a way to make DLSS work in all games, as for RT cores leave them for a Titan card and once we achieve full path tracing then bring it to the masses. The rich/fans...etc can buy the Titan to get the latest things, and the rest of us get fast relatively cheap GPUs.

Its for sake of their ecosystem... AMD will be wasting sillicon for traditional rasterization, you have to wait till November 3rd or buy 6000 series, 6950xt is the fastest in TPU benchmarks.

#13

Vayra86

bugEverybody did, by now. And yet, it seems SER still benefits raster, even if RT can be more fitting for instruction reordering.

How about waiting for benchmarks before declaring IPC has/hasn't improved? Does that work for you?

Absolutely, just pointing out the fun that marketing is :)

#14

Daven

So its for sure 192 ROPs. Other news outlets continue to report 384. The number of pipelines have a huge impact on game performance.

#15

Denver

The chips would be about 30% smaller without the dedicated Tensor cores and RT trinket.

Poll: Would you buy a RTX 4xxx, without RT, but 15-20% cheaper ?

#16

Dirt Chip

BwazeI see lots of proprietary new arhitecture that also demands the game to have these features to show it's potential. New and improved raytracing? It has to be supportet in game. New and improved DLSS 3.0? Also only for games that support it.

Will we see the push in reviews and benchmarks to include as much of these new games and architecture as possible, to really skew the results against the Ampere and AMD cards?

The way I see it, 2x and 3x claims from Nvidia keynote presentation are all achieved by using such architecture and game changes, not by actually be 2x, 3x faster...

Yep, NV go`s all-in on AI to do the real performance uplift from gen to gen at the same $$ level.
and as you said, all the RT-DLSS3 stuff is just a gimmick until every game will support it from day 1 (or be automatically backward compatible - which will not happen of course).
I just want every game with DLSS2-FSR2 with no RTX at all. I live happily with 'baked scenes' just as I know the every game is fictional.

#17

bug

Vayra86Absolutely, just pointing out the fun that marketing is :)

To me it's not fun, it's painful. I tend to read only the major announcements and take the very basic information from them. Too much b to waste my time on it.

#18

ModEl4

DavenSo its for sure 192 ROPs. Other news outlets continue to report 384. The number of pipelines have a huge impact on game performance.

My original assumption was 192 due to memory bandwidth limitation, but when i saw news outlets 384 ROPs reporting, i thought that maybe it was a mixed solution like the 128 RB+ of RDNA2, with no improvement in 64b pixel color write/cycle and pixel color blend/cycle (Navi 10 can do also 64 despite having 64RB+)

#19

THU31

DenverThe chips would be about 30% smaller without the dedicated Tensor cores and RT trinket.

Poll: Would you buy a RTX 4xxx, without RT, but 15-20% cheaper ?

I would not. The 4080-12 will be about 20% faster than the 3080 in rasterized games. If it was 20% cheaper, it would still be $720, which is a terrible price for 20% more performance. And I do not even need that performance, I can run all rasterized games almost maxed out.

I actually would want to pay just for more RT performance. And I absolutely do not want to give up tensor cores. DLSS is one of the best things ever invented for games.

GPUs have gotten too big when manufacturing them was still cheap. If high-end cards had stayed below 400 mm2, we would not be having this problem. Currently new processes are very expensive, and GPUs still have to be big to get any performance gains. I will just wait out this transition period until mid-range cards can offer a performance increase for me.

#20

Denver

THU31I would not. The 4080-12 will be about 20% faster than the 3080 in rasterized games. If it was 20% cheaper, it would still be $720, which is a terrible price for 20% more performance. And I do not even need that performance, I can run all rasterized games almost maxed out.

I actually would want to pay just for more RT performance. And I absolutely do not want to give up tensor cores. DLSS is one of the best things ever invented for games.

GPUs have gotten too big when manufacturing them was still cheap. If high-end cards had stayed below 400 mm2, we would not be having this problem. Currently new processes are very expensive, and GPUs still have to be big to get any performance gains. I will just wait out this transition period until mid-range cards can offer a performance increase for me.

AMD has already proven that you can get the same effect of DLSS via Software. I don't know what magic you see in RT, running CP2077 at 30fps with a 3090ti looks like a bad joke.

Realistic reflections do not make a game realistic when everything else is not.

#21

THU31

Two games have impressed me with ray tracing - Metro: Exodus EE and Dying Light 2.
Why? Because they feature RTGI on top of other effects. And lighting is the primary factor affecting visual realism (and do not confuse realistic light behavior with artistic design, two completely different things).

But both games are difficult to run on a 3080 even with DLSS Performance when you turn RT on.
I usually do not bother with RT in games that only use it for shadows or reflections, unless I do not have to sacrifice anything to turn those on.

RTGI is incredible. Just look at Lumen in UE5, the Matrix demo for example.

The worst thing about devs implementing RTGI is that the non-RT lighting model suffers greatly. Without RTGI, both Metro and DL2 look much worse compared to beautiful games like Horizon Forbidden West or even Far Cry 6. The Matrix demo without Lumen also looked bad.
You can fake really good GI with rasterization, but they do not bother doing both if they can use RT. Crysis 3 had some impressive voxel-based GI, but that is difficult to do and has a high performance cost as well, and it is nowhere near as accurate as path tracing.

RTGI is very uncommon right now, because of performance reasons. Ada GPUs can change that, but if nobody can afford those cards, nobody will play with RTGI anyway.

As for DLSS, you could run it without tensor cores, but performance would be lower. Could that be balanced by adding more CUDA cores instead? Possibly. Did they lock it to tensor cores only to sell more RTX cards? Possibly, but they could have locked it to RTX cards even without tensor cores, so why do it this way? We will probably never know.

Add your own comment

NVIDIA Ada AD102 Block Diagram and New Architectural Features Detailed

21 Comments on NVIDIA Ada AD102 Block Diagram and New Architectural Features Detailed

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts

NVIDIA Ada AD102 Block Diagram and New Architectural Features Detailed

Related News

21 Comments on NVIDIA Ada AD102 Block Diagram and New Architectural Features Detailed

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts