Thursday, November 21st 2019

NVIDIA Develops Tile-based Multi-GPU Rendering Technique Called CFR

NVIDIA is invested in the development of multi-GPU, specifically SLI over NVLink, and has developed a new multi-GPU rendering technique that appears to be inspired by tile-based rendering. Implemented at a single-GPU level, tile-based rendering has been one of NVIDIA's many secret sauces that improved performance since its "Maxwell" family of GPUs. 3DCenter.org discovered that NVIDIA is working on its multi-GPU avatar, called CFR, which could be short for "checkerboard frame rendering," or "checkered frame rendering." The method is already secretly deployed on current NVIDIA drivers, although not documented for developers to implement.

In CFR, the frame is divided into tiny square tiles, like a checkerboard. Odd-numbered tiles are rendered by one GPU, and even-numbered ones by the other. Unlike AFR (alternate frame rendering), in which each GPU's dedicated memory has a copy of all of the resources needed to render the frame, methods like CFR and SFR (split frame rendering) optimize resource allocation. CFR also purportedly offers lesser micro-stutter than AFR. 3DCenter also detailed the features and requirements of CFR. To begin with, the method is only compatible with DirectX (including DirectX 12, 11, and 10), and not OpenGL or Vulkan. For now it's "Turing" exclusive, since NVLink is required (probably its bandwidth is needed to virtualize the tile buffer). Tools like NVIDIA Profile Inspector allow you to force CFR on provided the other hardware and API requirements are met. It still has many compatibility problems, and remains practically undocumented by NVIDIA.
Source: 3DCenter.org
Add your own comment

33 Comments on NVIDIA Develops Tile-based Multi-GPU Rendering Technique Called CFR

#1
R-T-B
This was explained back when Crossfire and SLI were making their debut, IIRC. It is not exactly new? Or am I missing something.

In all cases I recall, these techniques sucked because they mandated each gpu still had to render the complete scene geometry, and only helped with fill rate.
Posted on Reply
#2
btarunr
Editor & Senior Moderator
R-T-B
This was explained back when Crossfire and SLI were making their debut, IIRC. It is not exactly new? Or am I missing something.

In all cases I recall, these techniques sucked because they mandated each gpu still had to render the complete scene geometry, and only helped with fill rate.
Yeah, I too had a lot of deja vu writing this, and had a long chat with W1zzard. Maybe it's some kind of TBR extrapolation for multi-GPU which they finally got right.
Posted on Reply
#3
R-T-B
btarunr
Yeah, I too had a lot of deja vu writing this, and had a long chat with W1zzard. Maybe it's some kind of TBR extrapolation for multi-GPU which they finally got right.
I sometimes swear they are selling us the same darn tech with new buzzwords...

Maybe the matrix is just glitching again...
Posted on Reply
#4
W1zzard
It seems they are leveraging their (single GPU) tiled-rendering hardware in the silicon to split up the image for CFR, possibly with non 50/50 splits that could possibly dynamically change during runtime to spread the load better.
Posted on Reply
#5
londiste
Did either AMD or Nvidia manage to get dynamic splitting to work reliably? As far as I remember all the attempts were eventually ended because solutions came with their own set of problems primarily around uneven frame times and stuttering.

Single-GPU tiled-rendering hardware would be tiles of static size but playing around with the tile count per GPU might work?
Posted on Reply
#6
silentbogo
I'm wondering if that new tile-based technique will introduce artifacts in the picture, just like with tearing in SFR?
Posted on Reply
#7
notb
silentbogo
I'm wondering if that new tile-based technique will introduce artifacts in the picture, just like with tearing in SFR?
Not when doing RTRT, which is likely the reason they're developing this (and mostly for game streaming services, not local GPUs).
londiste
Did either AMD or Nvidia manage to get dynamic splitting to work reliably? As far as I remember all the attempts were eventually ended because solutions came with their own set of problems primarily around uneven frame times and stuttering.
Well, actually this is a problem that RTRT solves automatically.
In legacy game rendering techniques the input consists of instructions that must be run. There's little control over time - GPU has to complete (almost) everything or there's no image at all.
So the rendering time is a result (not a parameter) and each frame has to wait for the last tile.

In RTRT frame rendering time (i.e. number of rays) is the primary input parameter. It's not relevant how you split the frame. This is perfectly fine:

Posted on Reply
#8
The Quim Reaper
So, Nvidia are now using the technique that UK based PowerVR developed, a company that Nvidia effectively forced out of the PC GPU market with their dirty tricks, in the early 2000's... :rolleyes:
Posted on Reply
#9
notb
The Quim Reaper
So, Nvidia are now using the technique that UK based PowerVR developed, a company that Nvidia effectively forced out of the PC GPU market with their dirty tricks, in the early 2000's... :rolleyes:
Tile-based rendering is a straightforward, natural approach. It's commonly used in non-gaming rendering engines (you see it happening in Cinebench). PowerVR didn't invent it. They may have just been the first to implement it in hardware.
Posted on Reply
#11
mtcn77
Don't overthink it. We needed SLI in Directx 12, now we have it. The trick is running render targets seperately despite having to run equal postprocess weights throughout the screen therefore it has been difficult to scale up SFR performance. Since there is no unified dynamic lighting load in RTX mode, this might work.
Posted on Reply
#12
kapone32
A question for the community; Would a VBIOS update be enough to enable crossfire on the 5700 cards?
Posted on Reply
#13
3rold
Here's a crazy idea: Why not work with M$/AMD to optimize DX12/Vulkan? Hell Vulkan has an open source SDK, it does not even need special cooperations with anyone.
Also, back when DX12 was launched there was a lot of hype on how good it would perform with multi-GPU setups using async technologies (indipendent chips & manufacturers) https://wccftech.com/dx12-nvidia-amd-asynchronous-multigpu/
Seems like everyone forgot about it...
Posted on Reply
#14
AnarchoPrimitiv
Would this have anything to do with MCM GPUs? I hope AMD beats Nvidia to an MCM (multiple GPU Chipley, not just a GPU and HBM) GPU, I'm not an AMD fanboy in the least, I just dislike Nvidia and want them to get cut down to size like Intel has been solely due to the fact that Intel getting their ass whooped has benefitted consumers and the same happening to Nvidia would probably benefit us all.
Posted on Reply
#15
TheLostSwede
The Quim Reaper
So, Nvidia are now using the technique that UK based PowerVR developed, a company that Nvidia effectively forced out of the PC GPU market with their dirty tricks, in the early 2000's... :rolleyes:
Wow, I thought people had forgotten about them. Nvidia was also trashing them back then and saying tile based rendering sucked...
Posted on Reply
#16
RH92
AnarchoPrimitiv
Would this have anything to do with MCM GPUs?
That's exactly what i though ! I don't see them revamping SLI after killing it , to me this has more to do with future MCM designs and might be a good indication that MCM design based gaming GPU from Nvidia is closer than what most of us believe .
Posted on Reply
#17
londiste
TheLostSwede
Nvidia was also trashing them back then and saying tile based rendering sucked...
:)
http://dumpster.hardwaretidende.dk/dokumenter/nvidia_on_kyro.pdf

Edit:
For a bit of background, this was a presentation to OEMs.
Kyro was technically new and interesting but as an actual gaming GPU on desktop cards, it sucked both due to spotty support as well as lackluster performance. It definitely had its bright moments but they were too few and far between. PowerVR could not develop their tech fast enough to compete with Nvidia an ATi at the time.
PowerVR itself went along just fine, the same architecture series was (or is) a strong contender in mobile GPUs.
Posted on Reply
#18
64K
I see MCM as the way forward and not just another version of SLI. For one thing the cost of buying 2 cards must be more expensive than a single card with MCM that could rival the performance of multi-GPU. Granted the MCM will cost more than a regular GPU but with SLI you have to buy 2 of everything. 2 PCBs, 2 GPUs, 2 sets of VRAM, 2 of all the components on the PCB, 2 Shrouds, 2 coolers, 2 boxes, etc
Posted on Reply
#19
eidairaman1
The Exiled Airman
They are trying to justify the rtx lineup lol
Posted on Reply
#20
Flanker
3rold
Also, back when DX12 was launched there was a lot of hype on how good it would perform with multi-GPU setups using async technologies (indipendent chips & manufacturers) https://wccftech.com/dx12-nvidia-amd-asynchronous-multigpu/
Seems like everyone forgot about it...
With good reason. In order for it to really work, the programmer would need to optimize every time for a specific system. If I am writing some GPGPU software for a solution that I'm selling bundled with a computer (which hardware I get to specify), it could be worth the effort. For games that can have any combination of rendering hardware? Eh, no thanks. The world is just much simpler when we just have to think about 2 exact same gpu to balance the workload equally. Even then we get cans of worms thrown at our faces from time to time.
Posted on Reply
#21
Steevo
W1zzard
It seems they are leveraging their (single GPU) tiled-rendering hardware in the silicon to split up the image for CFR, possibly with non 50/50 splits that could possibly dynamically change during runtime to spread the load better.
A good idea with complexity, how is full screen AA processed if only half the resources are on each card?

Or could this possibly be a Zen like chiplet design to save money and loss on the newest node?

NVlink as the fabric for communication, if only half the resources are actually required maybe I'm out in left field but put 12GB or 6GB for each chiplet and interleave the memory.
Posted on Reply
#22
londiste
Steevo
A good idea with complexity, how is full screen AA processed if only half the resources are on each card?
Or could this possibly be a Zen like chiplet design to save money and loss on the newest node?
NVlink as the fabric for communication, if only half the resources are actually required maybe I'm out in left field but put 12GB or 6GB for each chiplet and interleave the memory.
AA would probably be one of the postprocessing methods done at the end of rendering a frame.

You can't get off with shared memory like that. You are still going to need a sizable part of assets accessible by both/all GPUs. Any memory far away from GPU is evil and even a fast interconnect like NVLink won't replace local memory. GPUs are very bandwidth-constrained so sharing memory access through something like Zen2's IO die is not likely to work on GPUs at this time. With big HBM cache for each GPU, maybe, but that is effectively still each GPU having its own VRAM :)

Chiplet design has been the end goal for a while and all the GPU makers have been trying their hand on this. So far, unsuccessfully. As @Apocalypsee already noted - even tiled distribution of work is not new.
Posted on Reply
#23
TheLostSwede
londiste
:)
http://dumpster.hardwaretidende.dk/dokumenter/nvidia_on_kyro.pdf

Edit:
For a bit of background, this was a presentation to OEMs.
Kyro was technically new and interesting but as an actual gaming GPU on desktop cards, it sucked both due to spotty support as well as lackluster performance. It definitely had its bright moments but they were too few and far between. PowerVR could not develop their tech fast enough to compete with Nvidia an ATi at the time.
PowerVR itself went along just fine, the same architecture series was (or is) a strong contender in mobile GPUs.
It wasn't that bad, I tested the cards myself at the time, in fact, I'm still on good terms with Imagination Technologies PR director, who at the time used to come by the office with things to test. But yes, they did have drivers issues, which was one of the big flaws, but performance wasn't as terrible as that old Nvidia presentation makes it out to be.
Posted on Reply
#24
eidairaman1
The Exiled Airman
TheLostSwede
It wasn't that bad, I tested the cards myself at the time, in fact, I'm still on good terms with Imagination Technologies PR director, who at the time used to come by the office with things to test. But yes, they did have drivers issues, which was one of the big flaws, but performance wasn't as terrible as that old Nvidia presentation makes it out to be.
Of course nv will commit libel just like intel.
Posted on Reply
#25
Steevo
londiste
AA would probably be one of the postprocessing methods done at the end of rendering a frame.

You can't get off with shared memory like that. You are still going to need a sizable part of assets accessible by both/all GPUs. Any memory far away from GPU is evil and even a fast interconnect like NVLink won't replace local memory. GPUs are very bandwidth-constrained so sharing memory access through something like Zen2's IO die is not likely to work on GPUs at this time. With big HBM cache for each GPU, maybe, but that is effectively still each GPU having its own VRAM :)

Chiplet design has been the end goal for a while and all the GPU makers have been trying their hand on this. So far, unsuccessfully. As @Apocalypsee already noted - even tiled distribution of work is not new.
I know NVlink won't replace memory, it's merely the protocol for interdie communication.


I am saying the IO die could handle memory interleaving between two sets of 6GB vram and assign shared and dedicated memory and resources, it's already the same sort of memory management used, but with the ability to share resources with multiple dies, which would also make them a good shared workstation card, allow hardware management of user and resources allocation.
Posted on Reply
Add your own comment