• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Evaluating DirectStorage performance

Joined
Sep 21, 2020
Messages
1,500 (1.14/day)
Processor 5800X3D -30 CO
Motherboard MSI B550 Tomahawk
Cooling DeepCool Assassin III
Memory 32GB G.SKILL Ripjaws V @ 3800 CL14
Video Card(s) ASRock MBA 7900XTX
Storage 1TB WD SN850X + 1TB ADATA SX8200 Pro
Display(s) Dell S2721QS 4K60
Case Cooler Master CM690 II Advanced USB 3.0
Audio Device(s) Audiotrak Prodigy Cube Black (JRC MUSES 8820D) + CAL (recabled)
Power Supply Seasonic Prime TX-750
Mouse Logitech Cordless Desktop Wave
Keyboard Logitech Cordless Desktop Wave
Software Windows 10 Pro
After watching this video I decided to do my own testing of DirectStorage to see how it affects performance:


This benchmark uses the newest DirectStorage 1.2 sample by GPUOpen. It lets you compare loading times of different assets with or without the aid of DS. You can also specify how the textures will be decompressed: on the CPU -- which is the default implementation -- or leveraging the GPU. The test uses four models of increasing complexity, with a proportionally larger texture set:

boombox.jpgplane.jpgshuttle.jpgmodule.jpg

ModelTexture size compressed [MB]Texture size uncompressed [MB]
BoomBox10.8685.37
X129.37170.69
SpaceShuttle926.782475.79
CommandModule915.302758.72

I tested it on the system in my profile, a powerful 4K machine with a 5800X3D, 7900XTX, and tweaked 3800CL14 dual-rank DDR4 RAM. I assessed both a Gen3 and a Gen4 NVMe SSD, using some of the fastest drives of their generation -- an ADATA SX8200 Pro and a WD SN850X.

First, let's see how DS affects the time spent by the CPU from the moment the first texture request is made, until the time the transfer to the GPU is complete. This is represented by the I/O time metric, where lower values mean faster completion of the process:

ModelCPU texture decompression [ms]GPU texture decompression + Gen3 SSD [ms]GPU texture decompression + Gen4 SSD [ms]
BoomBox271213
X1682218
SpaceShuttle865402150
CommandModule879392143

Even with the simplest asset, GPU decompression reduced the CPU time by more than a half compared to using the CPU alone. A faster drive initially made no difference. But with the larger texture sets the Gen4 SSD comes into its own, rapidly separating itself from the older model. Using the SX8200 Pro, the I/O time was decreased by 56%, 68%, 53% and 54% respectively, when contrasted with pure CPU decompression. The SN850X widened the gap to 52%, 73%, 83% and 84%, allowing the CPU to complete all these tasks nearly six times faster -- in 324 rather than 1839ms! The ADATA drive enabled the processor to finish over two times faster (in 828ms) when it was freed from the burden of decompressing the textures by the GPU.

Now let's analyze the impact of DS on the time it takes the CPU to completely load the scene. Again, lower CPU load time means the model will be presented quicker:

ModelDS disabled
[ms]
DS enabled
CPU texture decompression
[ms]
DS enabled
GPU texture decompression + Gen3 SSD [ms]
DS enabled
GPU texture decompression + Gen4 SSD [ms]
BoomBox147463031
X14971166559
SpaceShuttle1252955486239
CommandModule1508995514262

We can already see a similar pattern here, but the differences are even more pronounced. Enabling DS with CPU decompression (standard game implementation) helps to reduce the loading times significantly, cutting it down by 3.2x and 4.3x in case of the simpler scenes, and presenting the complex models 1.3 to 1.5 times faster. But the real advantage of DS lies in GPU decompression. Even with a Gen3 drive, the scenes load 490%, 765%, 258% and 293% faster! And a Gen4 SSD allows for even more incredible 4.74x, 8.42x, 5.24x and 5.76x speed up.

When we evaluate total loading time for all four scenes, we see the following gains:

DS disabled
[ms]
DS enabled
CPU texture decompression
[ms]
DS enabled
GPU texture decompression + Gen3 SSD [ms]
DS enabled
GPU texture decompression + Gen4 SSD [ms]
All scenes340421121095591
Speed up factor1.61x3.11x5.76x

DirectStorage appears to be a very capable technology, with potentially amazing benefits. It should enable greatly reduced loading times and a smoother gameplay experience. As games get more complex visually and virtual worlds more expansive, the advantage of DS will likely become clear, especially in open world titles, which constantly stream in textures. And when implemented properly, GPU decompression could become the real game changer (pun intended). In these tests, we saw nearly twice as fast loading times with a Gen3 drive when contrasted with exclusive CPU texture decompression. And the difference between the 5800X3D and the 7900XTX was almost quadruple with a Gen4 SSD.

Lastly, let's look at the average data bandwidth when decompressing the textures using all these different techniques, as indicated by Data Rate:

ModelCPU texture decompression
disk only vs. DS amplified
[GB/s]
GPU texture decompression + Gen3 SSD
disk only vs. DS amplified
[GB/s]
GPU texture decompression + Gen4 SSD
disk only vs. DS amplified
[GB/s]
BoomBox0.4 vs. 3.20.9 vs. 7.30.9 vs. 6.7
X10.4 vs. 2.61.4 vs. 8.01.7 vs. 9.7
SpaceShuttle1.1 vs. 2.92.4 vs. 6.36.3 vs. 16.9
CommandModule1.1 vs. 3.22.4 vs. 7.26.6 vs. 19.8

For reference, this is how the storage solutions of current generation consoles stack up. Both utilize a custom Gen4 SSD and additional dedicated hardware to assist with asset decompression:

Maximum raw SSD throughput
[GB/s]
Typical storage throughput - decompressing
[GB/s]

Maximum storage throughput - decompressing
[GB/s]
Xbox Series S/X3.94.86.5
PlayStation 55.58.522.0

And if you would like to check out DS performance for yourself, the video at the top has a link to a downloadable build of the DS sample. I used the same settings in my tests as in this video. You can create these batch files to start the benchmark, and find detailed statistics for the run in the corresponding *.csv file in the \bin subfolder:

DS disabled
Code:
setlocal
pushd bin
FOR /f "tokens=* delims=" %%A in ('timestamp') do @set "ds_ts=%%A"
DirectStorageSample_DX12.exe {"iotiming":true, "stagingbuffersize":268435456, "profile": true, "profileOutputPath":"DS_off.csv"}
popd

DS enabled, CPU decompression
Code:
setlocal
pushd bin
FOR /f "tokens=* delims=" %%A in ('timestamp') do @set "ds_ts=%%A"
DirectStorageSample_DX12.exe {"directstorage":true, "iotiming":true, "disablegpudecompression":true, "stagingbuffersize":268435456, "profile": true, "profileOutputPath":"CPU.csv"}
popd

DS enabled, GPU decompression
Code:
setlocal
pushd bin
FOR /f "tokens=* delims=" %%A in ('timestamp') do @set "ds_ts=%%A"
DirectStorageSample_DX12.exe {"directstorage":true, "iotiming":true, "stagingbuffersize":268435456, "profile": true, "profileOutputPath":"GPU.csv"}
popd
:lovetpu:
 
Last edited:
Joined
Sep 21, 2020
Messages
1,500 (1.14/day)
Processor 5800X3D -30 CO
Motherboard MSI B550 Tomahawk
Cooling DeepCool Assassin III
Memory 32GB G.SKILL Ripjaws V @ 3800 CL14
Video Card(s) ASRock MBA 7900XTX
Storage 1TB WD SN850X + 1TB ADATA SX8200 Pro
Display(s) Dell S2721QS 4K60
Case Cooler Master CM690 II Advanced USB 3.0
Audio Device(s) Audiotrak Prodigy Cube Black (JRC MUSES 8820D) + CAL (recabled)
Power Supply Seasonic Prime TX-750
Mouse Logitech Cordless Desktop Wave
Keyboard Logitech Cordless Desktop Wave
Software Windows 10 Pro
In my previous testing of DirectStorage I observed greatly accelerated loading times of game assets, especially when GPU texture decompression was utilized. The biggest difference could be seen with the largest texture sets (2.5 - 2.8 GB) on a fast Gen4 SSD. With DS enabled the most complex scene loaded 5.76x faster with the 7900XTX decompressing the data, and 3.80x faster when the 5800X3D was doing it.

But do you need a high-end GPU, a 16-threaded CPU, and a top-performing Gen4 drive to profit from DS? Would an entry-level 1080p gaming rig see any benefits? I did my next round of benchmarks on a 4c/8t Ryzen 3 3300X with a static overclock of 4.5 GHz, coupled with an oc'd RX6600XT and 3733CL16 single-rank DDR4 memory in dual channel. I used the same Gen3 SSD as before, the ADATA SX8200 Pro. Let's look at the results.

First, the time spent by the CPU from the moment the first texture request is made, until the transfer to the GPU is complete:

ModelCPU texture decompression [ms]GPU texture decompression [ms]
BoomBox4417
X19730
SpaceShuttle1797379
CommandModule1806368
Total I/O time3744794

Next, the time it takes the CPU to completely load the scene:

ModelDS disabled
[ms]
DS enabled
CPU texture decompression [ms]
DS enabled
GPU texture decompression [ms]
BoomBox1627043
X153115184
SpaceShuttle19941918498
CommandModule27811962529
Total loading time546841011154
Average speed up factor1.33x4.74x

Wow! Even on a humble 4c/8t processor, DS combined with CPU decompression sped up loading by 33% on average. The difference was most pronounced with a small texture set. But with an entry-level 1080p GPU the whole process completed nearly five times faster!

And finally, let's see how much DS boosts the data rate when loading each asset:

ModelCPU texture decompression
disk only vs. DS amplified [GB/s]
GPU texture decompression
disk only vs. DS amplified [GB/s]
BoomBox0.3 vs 2.00.7 vs. 5.1
X10.3 vs. 1.81.0 vs. 5.8
SpaceShuttle0.5 vs. 1.42.5 vs. 6.7
CommandModule0.5 vs. 1.62.6 vs. 7.7

The above results clearly demonstrate how much of a bottleneck standard file I/O and the CPU itself are for loading assets. Even a base system could potentially see huge benefits when games start utilizing DirectStorage with GPU decompression.
 
Last edited:
Top