Thursday, September 3rd 2020

NVIDIA RTX IO Detailed: GPU-assisted Storage Stack Here to Stay Until CPU Core-counts Rise

NVIDIA at its GeForce "Ampere" launch event announced the RTX IO technology. Storage is the weakest link in a modern computer, from a performance standpoint, and SSDs have had a transformational impact. With modern SSDs leveraging PCIe, consumer storage speeds are now bound to grow with each new PCIe generation doubling per-lane IO bandwidth. PCI-Express Gen 4 enables 64 Gbps bandwidth per direction on M.2 NVMe SSDs, AMD has already implemented it across its Ryzen desktop platform, Intel has it on its latest mobile platforms, and is expected to bring it to its desktop platform with "Rocket Lake." While more storage bandwidth is always welcome, the storage processing stack (the task of processing ones and zeroes to the physical layer), is still handled by the CPU. With rise in storage bandwidth, the IO load on the CPU rises proportionally, to a point where it can begin to impact performance. Microsoft sought to address this emerging challenge with the DirectStorage API, but NVIDIA wants to build on this.

According to tests by NVIDIA, reading uncompressed data from an SSD at 7 GB/s (typical max sequential read speeds of client-segment PCIe Gen 4 M.2 NVMe SSDs), requires the full utilization of two CPU cores. The OS typically spreads this workload across all available CPU cores/threads on a modern multi-core CPU. Things change dramatically when compressed data (such as game resources) are being read, in a gaming scenario, with a high number of IO requests. Modern AAA games have hundreds of thousands of individual resources crammed into compressed resource-pack files.
Although at a disk IO-level, ones and zeroes are still being moved at up to 7 GB/s, the de-compressed data stream at the CPU-level can be as high as 14 GB/s (best case compression). Add to this, each IO request comes with its own overhead - a set of instructions for the CPU to fetch x piece of resource from y file, and deliver to z buffer, along with instructions to de-compress or decrypt the resource. This could take an enormous amount of CPU muscle at a high IO throughput scale, and NVIDIA pegs the number of CPU cores required as high as 24. As we explained earlier, DirectStorage enables a path for devices to directly process the storage stack to access the resources they need. The API by Microsoft was originally developed for the Xbox Series X, but is making its debut on the PC platform.

NVIDIA RTX IO is a concentric outer layer of DirectStorage, which is optimized further for gaming, and NVIDIA's GPU architecture. RTX IO brings to the table GPU-accelerated lossless data decompression, which means data remains compressed and bunched up with fewer IO headers, as it's being moved from the disk to the GPU, leveraging DirectStorage. NVIDIA claims that this improves IO performance by a factor of 2. NVIDIA further claims that GeForce RTX GPUs, thanks to their high CUDA core counts, are capable of offloading "dozens" of CPU cores, driving decompression performance beyond even what compressed data loads PCIe Gen 4 SSDs can throw at them.

There is, however, a tiny wrinkle. Games need to be optimized for DirectStorage. Since the API has already been deployed on Xbox since the Xbox Series X, most AAA games for Xbox that have PC versions, already have some awareness of the tech, however, the PC versions will need to be patched to use the tech. Games will further need NVIDIA RTX IO awareness, and NVIDIA needs to add support on a per-game basis via GeForce driver updates. NVIDIA didn't detail which GPUs will support the tech, but given its wording, and the use of "RTX" in the branding of the feature, NVIDIA could release the feature to RTX 20-series "Turing" and RTX 30-series "Ampere." The GTX 16-series probably misses out as what NVIDIA hopes to accomplish with RTX IO is probably too heavy on the 16-series, and this may have purely been a performance-impact based decision for NVIDIA.
Add your own comment

51 Comments on NVIDIA RTX IO Detailed: GPU-assisted Storage Stack Here to Stay Until CPU Core-counts Rise

#1
Ferrum Master
Basically same stuff as Sony did in PS5 presentation.

Just a fancy new name, and calling we did it first and it is ours.
Posted on Reply
#2
thesmokingman
This seems redundant with 8, 12 and 16 core cpus.
Posted on Reply
#3
Xzibit
Shouldn't they be adding more ram to GPUs
Posted on Reply
#4
btarunr
Editor & Senior Moderator
thesmokingman
This seems redundant with 8, 12 and 16 core cpus.
Apparently NVIDIA thinks not. A compressed data stream of game resources can bog down up to 24 cores.
Posted on Reply
#5
biffzinker
btarunr
A compressed data stream of game resources can bog down up to 24 cores.
No benefit from SMT? I’m going off the performance uplift Ryzen gets from SMT.
Posted on Reply
#6
Valantar
thesmokingman
This seems redundant with 8, 12 and 16 core cpus.
Given that MS and Sony claim decompressing NVMe bandwith-amounts of game data can consume an equivalent of >6 Zen 2 CPU cores, I would say no. Remember, that is on paper; these workloads don't scale that well in real life, and would inevitably become a bottleneck. There's a reason MS is moving DirectStorage from the XSX to Windows.
Posted on Reply
#7
BoboOOZ
btarunr
,
Apparently NVIDIA thinks not. A compressed data stream of game resources can bog down up to 24 cores.
That's most probably a theoretical worst case scenario that has 0 chance of really happening. The PS5 guys had said 12 cores on the same issue, that's probably exaggerated a bit, too. And it was about requiring a 12 cores, not using all of them.
Posted on Reply
#8
Valantar
BoboOOZ
That's most probably a theoretical worst case scenario that has 0 chance of really happening. The PS5 guys had said 12 cores on the same issue, that's probably exaggerated a bit, too. And it was about requiring a 12 cores, not using all of them.
No, it was about requiring the equivalent computational power of 12 Zen 2 cores at PS5 speeds to decompress that amount of data at the required rate. It was theoretical in as much as few games will be streaming in 5.5GB/s of entirely compressed data, but beyond that it's entirely real.
Posted on Reply
#9
Legacy-ZA
Xzibit
Shouldn't they be adding more ram to GPUs
This is also my concern, when I still used my GTX1070, I came to close to the 8GB usage on several occasions, though enough for the time... I am not so sure for the future, with everything getting as big as they do these days, I don't like it, I don't like it one bit.
Posted on Reply
#10
BoboOOZ
Valantar
No, it was about requiring the equivalent computational power of 12 Zen 2 cores at PS5 speeds to decompress that amount of data at the required rate. It was theoretical in as much as few games will be streaming in 5.5GB/s of entirely compressed data, but beyond that it's entirely real.
I was replying to btarunr, who was quoting Nvidia's 24 cores claim, you missed that.

Anyway, besides the marketing quoting worst-case scenarios, that's definitely a much more efficient way of doing these transfers, and AMD will most probably be doing the same thing, with a different name.

Edit to add: To the OP and title, I don't see any reason for this kind of optimisation to disappear even when core counts increase. Doing this way it's much more efficient, just like DMA for disk drives is much more efficient, they will be replaced by other technologies, but it makes no sense to make all this data transition through the CPU only for decompression.
Posted on Reply
#11
ebivan
SSDs are huge and cheap, why not just put uncompressed (or less compressed) data there? Even if a game were to use maybe 200 or 300 GB, I would prefer that to the load times of Wasteland 3.... I dont have 10 Games installed at every moment, so i could allocate SSD space for the 1-3 games that I am actually playing at the moment.

Ages ago every game would let you choose how much of the installation you wanted to put on HDD and how much would be left of the CD/DVD. Why not add an option to chose compression level of stored data?
Posted on Reply
#12
Bruno_O
Legacy-ZA
This is also my concern, when I still used my GTX1070, I came to close to the 8GB usage on several occasions, though enough for the time... I am not so sure for the future, with everything getting as big as they do these days, I don't like it, I don't like it one bit.
Same here, even at 1080p. 8/10GB is a bad joke and the great performance these cards have make it even more stupid.
Posted on Reply
#13
Xex360
Would AMD and Intel dedicate some silicon for data decompression in their CPUs, both the PS5 and Series X are using some silicon to do that.
Posted on Reply
#14
SetsunaFZero
ebivan
SSDs are huge and cheap, why not just put uncompressed (or less compressed) data there? Even if a game were to use maybe 200 or 300 GB, I would prefer that to the load times of Wasteland 3.... I dont have 10 Games installed at every moment, so i could allocate SSD space for the 1-3 games that I am actually playing at the moment.

Ages ago every game would let you choose how much of the installation you wanted to put on HDD and how much would be left of the CD/DVD. Why not add an option to chose compression level of stored data?
Short and Sweet, Server bandwidth, Limited GPU ram and SSDs aren't cheap :)
Posted on Reply
#15
Valantar
BoboOOZ
I was replying to btarunr, who was quoting Nvidia's 24 cores claim, you missed that.

Anyway, besides the marketing quoting worst-case scenarios, that's definitely a much more efficient way of doing these transfers, and AMD will most probably be doing the same thing, with a different name.
I didn't miss that, I was responding both to that and to your specific mention of Sony's claim of the equivalent of 12 cores of decompression for the PS5.

As for these numbers being a worst case scenario, I disagree, mainly as the scaling is most likely calculated with 100% scaling, i.e. 1 core working 100% with decompression = X, 12 cores = 12X, despite scaling never really being 100% in the real world. As such this is a favorable comparison, not a worst-case scenario, and saying "would require the equivalent of n cores" could just as well end up requiring more than this to account for imperfect scaling. I sincerely hope AMD also adds a decompression accelerator to RDNA2, which would make a lot of sense given that they designed those for both MS and Sony in the first place.
BoboOOZ
Edit to add: To the OP and title, I don't see any reason for this kind of optimisation to disappear even when core counts increase. Doing this way it's much more efficient, just like DMA for disk drives is much more efficient, they will be replaced by other technologies, but it makes no sense to make all this data transition through the CPU only for decompression.
Here I entirely agree with you. There's no reason to move this back to the CPU in the future - it's a workload that only really benefits the GPU (nothing but the GPU really uses compressed game assets, and in the edge cases where the CPU might need some it should be able to handle that), thus alleviating load on the PCIe link by bypassing the CPU, and given that GPUs are more frequently replaced than CPUs it also allows for more flexibility in terms of upgrades, adding new compression standards, etc. Keeping this functionality as a dedicated acceleration block on the GPU makes a ton of sense.
ebivan
SSDs are huge and cheap, why not just put uncompressed (or less compressed) data there? Even if a game were to use maybe 200 or 300 GB, I would prefer that to the load times of Wasteland 3.... I dont have 10 Games installed at every moment, so i could allocate SSD space for the 1-3 games that I am actually playing at the moment.

Ages ago every game would let you choose how much of the installation you wanted to put on HDD and how much would be left of the CD/DVD. Why not add an option to chose compression level of stored data?
Sorry, but what world do you live in? NVMe SSDs have come down a lot in price, but cheap? No. Especially not in capacities like what would be needed for even three games with your 2-300GB install sizes. And remember, even with compressed assets games are now hitting 150-200GB. Not to mention the effect removing compression would have on download times, or install times if data was downloaded and then decompressed directly. Compressing game assets is the only logical way of moving forward.
Posted on Reply
#16
ebivan
SetsunaFZero
Short and Sweet, Server bandwidth, Limited GPU ram and SSDs aren't cheap :)
Server bandwidth would stay the same, since decompressing would take place at installation at the client, the amount of downloaded data during installation would not change. Installation time would rise a bit though.
GPU ram requirements would not change either, because only decompressed data is stored there, so no change there.
How cheap ssds are is for the user to decide. If you wanna save on ssd volume, you can opt for longer loading times, if you have ssd space to spare, you can opt for the uncompressed installation.
Posted on Reply
#17
Caring1
Valantar
Here I entirely agree with you. There's no reason to move this back to the CPU in the future - it's a workload that only really benefits the GPU (nothing but the GPU really uses compressed game assets, and in the edge cases where the CPU might need some it should be able to handle that), thus alleviating load on the PCIe link by bypassing the CPU, and given that GPUs are more frequently replaced than CPUs it also allows for more flexibility in terms of upgrades, adding new compression standards, etc. Keeping this functionality as a dedicated acceleration block on the GPU makes a ton of sense.
Why not an independent co-processor using Nvlink that offloads directly, negating the need to use the CPU at all.
Posted on Reply
#18
BoboOOZ
Valantar
I didn't miss that, I was responding both to that and to your specific mention of Sony's claim of the equivalent of 12 cores of decompression for the PS5.
Let me make it plainer then:
BoboOOZ
btarunr
Apparently NVIDIA thinks not. A compressed data stream of game resources can bog down up to 24 cores.
That's most probably a theoretical worst-case scenario that has 0 chance of really happening.
Thwe theoretical worst case is about the 24 core Nvidia claim.
BoboOOZ
The PS5 guys had said 12 cores on the same issue, that's probably exaggerated a bit, too. And it was about requiring a 12 cores, not using all of them.
The PS5 is probably exaggerated a bit, in that there might be remaining computing capacity on those 12 cores, basically the same thing you are saying.
Valantar
I sincerely hope AMD also adds a decompression accelerator to RDNA2, which would make a lot of sense given that they designed those for both MS and Sony in the first place.
The trouble here is that the teams working on RDNA2 discrete graphic cards and those working on consoles are different and under NDA's for 2 years. So it's not sure exactly what could trickle from the consoles to this generation, though I definitely hope the same as you do, otherwise RDNA2 cards will have trouble keeping up with the next-gen games.
Posted on Reply
#19
nguyen
ebivan
Server bandwidth would stay the same, since decompressing would take place at installation at the client, the amount of downloaded data during installation would not change. Installation time would rise a bit though.
GPU ram requirements would not change either, because only decompressed data is stored there, so no change there.
How cheap ssds are is for the user to decide. If you wanna save on ssd volume, you can opt for longer loading times, if you have ssd space to spare, you can opt for the uncompressed installation.
The point of compress/decompress data is the increase the "effective" bandwidth; meaning if you compress a file to 1/2 the size and send it over a network, you just effectivelly doubling the network bandwidth.
Nvidia is saying they could get 2x the effective bandwidth out of PCIe gen 4 x4 NVMe drive, that is 14GBs of effective bandwidth. Imagine no loading time, no texture pop-in with open world game.
Posted on Reply
#20
ebivan
Valantar
Sorry, but what world do you live in? NVMe SSDs have come down a lot in price, but cheap? No. Especially not in capacities like what would be needed for even three games with your 2-300GB install sizes. And remember, even with compressed assets games are now hitting 150-200GB. Not to mention the effect removing compression would have on download times, or install times if data was downloaded and then decompressed directly. Compressing game assets is the only logical way of moving forward.
What games hit 150TB? Oh yeah, Flight Simulator, so you're right of course, there is one big Game now! Most AAA games are still in the 50GB range!
Good NVME SSDs cost about 150$ per TB
I don't know what the compression factor of these assets is, but lets say its 1:6 so a 50GB game comes to 300GB of uncompressed data (not counting that not all of the assets are even GPU ralated, like sound assets or pre rendered videos) that would mean you could store three uncompressed Games on a 1TB drive.
Most games are not AAA games that are even this big and a lot of games do fine the way they are now, so only a fraction of the games would even need an option for uncompressed install. Which means, you could propably store even more games on that 1TB drive.

A lot of enthusiasts spent high three digits or even four digits on GPUs, so why not spent another 150$ on an additional SSD to immensely speed up those data intensive AAA games?
Posted on Reply
#21
Mouth of Sauron
At this point, my confidence in anything Mr. Leather Jacket says is at the historical minimum. There is steaming pile of... lies about supposed ray-tracing, where he degraded the official term to 'heavily approximated, always partial ray-tracing with destructive compression-like algorithm, which will work on <1% games' (and none that I play - yes, I'm not interested in Cyberpunk until I see it in real life). He will charge it 1400g, it won't be available to buy and I'll be forced to watch ALL benchmarks done on that bloody thing, which nobody except reviewers will have - oh yes, all 5 games that support that fake-RT will become standard part of benchmark suite, to my great enjoyment, and provide highly skewed results for, say, CPUs for anyone who doesn't have that card and doesn't play that games. Great news!

To clear things up, I consider 'heavily approximated, always partial ray-tracing with destructive compression-like algorithm, which will work on <1% games' an advancement and generally nice feature - except it's just a distant cousin of scene or camera ray-tracing and not the 'ultimate dream'. Yeah, RTX looks great on Minecraft and Quake 2 - but please find an easier example for a derivative of ray-tracing and you'll be rewarded tenfold. Low poly-count, flat surfaces... Why not even Quake 3? Why not new Wolfenstain? Errr...

So, now he is speeding up M.2 storage? Yes, sure, why not?

I'll believe in all those things *WHEN* I see them in work, on real (and normal) system with more general benchmarks, not by-NVIDIA-for-NVIDIA set of 2...
Posted on Reply
#22
Bruno Vieira
They could have just "Guys, the new gpus support microsoft direct storage!", but they had to do their own proprietary stuff on top of it and no one (without beeing paid for) will implement it
Posted on Reply
#23
Valantar
ebivan
What games hit 150TB? Oh yeah, Flight Simulator, so you're right of course, there is one big Game now! Most AAA games are still in the 50GB range!
Good NVME SSDs cost about 150$ per TB
I don't know what the compression factor of these assets is, but lets say its 1:6 so a 50GB game comes to 300GB of uncompressed data (not counting that not all of the assets are even GPU ralated, like sound assets or pre rendered videos) that would mean you could store three uncompressed Games on a 1TB drive.
Most games are not AAA games that are even this big and a lot of games do fine the way they are now, so only a fraction of the games would even need an option for uncompressed install. Which means, you could propably store even more games on that 1TB drive.

A lot of enthusiasts spent high three digits or even four digits on GPUs, so why not spent another 150$ on an additional SSD to immensely speed up those data intensive AAA games?
Here's a list from back in January. Since then we've had Flight Simulator, that CoD BR thing, and a handful of others. Sure, most AAA games are still below 100GB, but install sizes are growing at an alarming rate.
BoboOOZ
Let me make it plainer then:

Thwe theoretical worst case is about the 24 core Nvidia claim.

The PS5 is probably exaggerated a bit, in that there might be remaining computing capacity on those 12 cores, basically the same thing you are saying.


The trouble here is that the teams working on RDNA2 discrete graphic cards and those working on consoles are different and under NDA's for 2 years. So it's not sure exactly what could trickle from the consoles to this generation, though I definitely hope the same as you do, otherwise RDNA2 cards will have trouble keeping up with the next-gen games.
I think you're also misreading "the Nvidia claim" - all they said is that to decompress a theoretical maximum throughput PCIe 4.0 SSD you would "need" a theoretical 24 CPU cores, which is the equivalent level of decompression performance of their RTX IO decompression block. I don't see this as them saying "this is how much performance you will need in the real world", as no game has ever required that kind of throughput, no such SSD exists, and in general nobody would design a game in that way - at least for another decade.

Also, I think your claim about the RDNA teams is fundamentally flawed. AMD post-RTG is a much more integrated company than previously. And while there are obviously things worked on within some parts of the company that the other parts don't know about, a new major cross-platform storage API provided by an outside vendor (Microsoft) is not likely to be one of these things.
Caring1
Why not an independent co-processor using Nvlink that offloads directly, negating the need to use the CPU at all.
Because that would limit support to boards with free PCIe slots, excluding ITX entirely, require an expensive NVLink bridge, limit support to the 3090, etc. This would likely work just as "well" over a PCIe 4.0 x16 slot, but that would of course limit support to HEDT platforms. Besides, we saw how well dedicated coprocessor AICs worked in the market back when PhysX launched. I.e. not at all.
Bruno Vieira
They could have just "Guys, the new gpus support microsoft direct storage!", but they had to do their own proprietary stuff on top of it and no one (without beeing paid for) will implement it
Is there actually anything proprietary here though? Isn't this just a hardware implementation of DirectStorage? Nvidia doesn't like to say "we support standards", after all, they have to give them a new name, presumably to look cooler somehow.
Mouth of Sauron
At this point, my confidence in anything Mr. Leather Jacket says is at the historical minimum. There is steaming pile of... lies about supposed ray-tracing, where he degraded the official term to 'heavily approximated, always partial ray-tracing with destructive compression-like algorithm, which will work on <1% games' (and none that I play - yes, I'm not interested in Cyberpunk until I see it in real life). He will charge it 1400g, it won't be available to buy and I'll be forced to watch ALL benchmarks done on that bloody thing, which nobody except reviewers will have - oh yes, all 5 games that support that fake-RT will become standard part of benchmark suite, to my great enjoyment, and provide highly skewed results for, say, CPUs for anyone who doesn't have that card and doesn't play that games. Great news!

To clear things up, I consider 'heavily approximated, always partial ray-tracing with destructive compression-like algorithm, which will work on <1% games' an advancement and generally nice feature - except it's just a distant cousin of scene or camera ray-tracing and not the 'ultimate dream'. Yeah, RTX looks great on Minecraft and Quake 2 - but please find an easier example for a derivative of ray-tracing and you'll be rewarded tenfold. Low poly-count, flat surfaces... Why not even Quake 3? Why not new Wolfenstain? Errr...

So, now he is speeding up M.2 storage? Yes, sure, why not?

I'll believe in all those things *WHEN* I see them in work, on real (and normal) system with more general benchmarks, not by-NVIDIA-for-NVIDIA set of 2...
What review sites do you know of that systematically only tests games in RT mode? Sure, RT benchmarks will become more of a thing this generation around, but I would be shocked if that didn't mean additional testing on top of RT-off testing. And comparing RT-on vs. RT-off is obviously not going to happen (that would make the RT-on GPUs look terrible!).
Posted on Reply
#24
ebivan
Death Stranding: 64 GB
Horizon Zero Dawn: 72 GB
Mount & Blade 2: 51 GB
Red Dead Redemption 2: 110 GB
Star Citizen: 60 GB
Posted on Reply
#25
Chrispy_
Will this play nice with DirectStorage or is it going to be another Nvidia black box that only Nvidia 3000-series customers get to beta test for Jensen?
Posted on Reply
Add your own comment