Wednesday, April 21st 2021

DirectStorage API Works Even with PCIe Gen3 NVMe SSDs

Microsoft on Tuesday, in a developer presentation, confirmed that the DirectStorage API, designed to speed up the storage sub-system, is compatible even with NVMe SSDs that use the PCI-Express Gen 3 host interface. It also confirmed that all GPUs compatible with DirectX 12 support the feature. A feature making its way to the PC from consoles, DirectStorage enables the GPU to directly access an NVMe storage device, paving the way for GPU-accelerated decompression of game assets.

This works to reduce latencies at the storage sub-system level, and offload the CPU. Any DirectX 12-compatible GPU technically supports DirectStorage, according to Microsoft. The company however recommends DirectX 12 Ultimate GPUs "for the best experience." The GPU-accelerated game asset decompression is handled via compute shaders. In addition to reducing latencies; DirectStorage is said to accelerate the Sampler Feedback feature in DirectX 12 Ultimate.
More slides from the presentation follow.

Source: NEPBB (Reddit)
Add your own comment

75 Comments on DirectStorage API Works Even with PCIe Gen3 NVMe SSDs

#26
Mussels
Moderprator
Zubasa
But then is it that much to ask for a DX12 compatible GPU? Everything since Fermi and GCN are "compatible" with DX12.
Older GPUs don't even get driver updates anymore, so even if M$ makes them work somehow they won't get the driver needed.
difference is compatible vs compliant

compatible means some features are emulated, or missing for optional ones - and direct storage may use those optional ones
My guess is games designed for this tech will have faster than average load times anyway, since they're being optimised for SSD's and not 4,200RPM laptop drives in consoles
Posted on Reply
#27
Zubasa
Mussels
difference is compatible vs compliant

compatible means some features are emulated, or missing for optional ones - and direct storage may use those optional ones
My guess is games designed for this tech will have faster than average load times anyway, since they're being optimised for SSD's and not 4,200RPM laptop drives in consoles
The article just says it requires a compatible GPU, nothing about the feature level required or what not.
Hopefully 'it just works'.
Posted on Reply
#28
Mussels
Moderprator
Zubasa
The article just says it requires a compatible GPU, nothing about the feature level required or what not.
Hopefully 'it just works'.
Yes but what is required to be compatible?

I would not be shocked to find out its a DX12 ultimate feature
(I'd be happy if it wasnt)
Posted on Reply
#29
Tartaros
Mussels
Yes but what is required to be compatible?

I would not be shocked to find out its a DX12 ultimate feature
(I'd be happy if it wasnt)
New RTX 4090TI with double nvme slot on the back of the card.
Posted on Reply
#30
Mussels
Moderprator
Tartaros
New RTX 4090TI with double nvme slot on the back of the card.
Wait... NVME is almost fast enough to work as a VRAM supplement. brb making a patent.
Posted on Reply
#31
Tartaros
Mussels
Wait... NVME is almost fast enough to work as a VRAM supplement. brb making a patent.
I said it as a joke but if you think about what directstorage is it actually makes sense as the next step in the process. There were Matrox vgas in the past that could be expanded with sodimms.
Posted on Reply
#32
Mussels
Moderprator
Tartaros
I said it as a joke but if you think about what directstorage is it actually makes sense as the next step in the process. There were Matrox vgas in the past that could be expanded with sodimms.
I could see something like optane working as as slot in GPU booster
Maybe APU's would benefit the most from something like that, as they're dealing with slower system RAM anyway

What we really need is a giant GPU in the ATX standard with a PCI-E card to slot in the CPU
Posted on Reply
#33
1d10t
Panther_Seraphin
Ever wondered why the BAR size suddenly got brought up even though its been in the PCI-e spec for years?

I can honestly see the theory of this working on DX11 cards but as with anything in the tech world. Unless its pushing something new or shiny you will pretty much never see it back ported en mass.
Simple, you need hardware to support it, "cache" in this case. Both parties support Resizable BAR, but one particular already came in box and the other just following hype train. As discussed earlier in this thread

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

and backed up with this video


In theory it is possible to stretched into DirectX 11 title by simply removing performance target or some kind of limitation (in this case the transfer rate), but I don't think impact will be significant as the Variable Rate Shading and Resources Binding features that are embedded in DirectX 12.
Posted on Reply
#34
Zubasa
Mussels
Wait... NVME is almost fast enough to work as a VRAM supplement. brb making a patent.
Radeon Pro SSG was a thing :D
I guess the software overhead etc was the reason it never took off.

Posted on Reply
#35
efikkan
Mussels
So it should let textures stream across fast, get decompressed and smashed open fast, and generally make load times and texture pop in go away
Faster load times is certainly possible, especially with accelerated decompression.
But texture popping is caused by a missing resource, and the only real way to avoid that is prefetching. So this technology by itself will not solve that problem, but you can certainly build a game engine which prefetches textures combined with this technology.
Vanny
Well, the only NVMe that I have is my 500 GB boot drive. Not putting any games there. Gonna wait and see if it's worth to buy an extra 2 TB NVMe just for this.
I would highly recommend having a separate boot drive. The OS will cause a lot of wear on TLC/QLC SSDs, so you better have your files somewhere else.
If your motherboard has a free PCE 4x slot, you can buy a M.2 adapter for it.
Posted on Reply
#36
Vanny
efikkan
I would highly recommend having a separate boot drive. The OS will cause a lot of wear on TLC/QLC SSDs, so you better have your files somewhere else.
If your motherboard has a free PCE 4x slot, you can buy a M.2 adapter for it.
Not gonna reinstall/clone my entire OS for insignificant writes on a 300TBW drive. Had it for almost a year and it's only at 6 TB. If it dies I'll go back to my previous boot drive.

All the files that I care about amount to only ~30GB and are backed up monthly on my Dropbox. Everything else can be downloaded & reinstalled easily, and I can also reconfigure my OS easily because I keep my .reg files, redists, and other necessary tweaks also on my Dropbox.

The NVMe is solely for OS files and programs I can redownload.

Also nobody gonna answer if the NVMe needs to be attached to CPU for DirectStorage to work?
Posted on Reply
#37
jermando
Vanny
Also nobody gonna answer if the NVMe needs to be attached to CPU for DirectStorage to work?
It does need the NVMe to be directly connected to the SoC/uncore (northbridge). The GPU also connects to the PCIe root complex (SoC). It's the only way to have CPU <-> GPU direct communication for instant data transfers.

That's how next-gen consoles are. The SSD connects directly to the APU. Last-gen consoles had the HDD connected through the southbridge (which isn't a big deal for HDDs or SATA SSDs).

So, if you have an old Intel PC (Rocket Lake introduced 4 dedicated lanes for NVMe, just like AMD Ryzen/AM4 since 2017) where the NVMe is attached to the PCH (southbridge), you're screwed. There's no benefit, so expect to upgrade your platform.

PCH connection is usually 4 lanes (4 GB/s) and any NVMe worth its salt is going to saturate that bus (let alone the fact you may also have SATA HDDs, Gigabit Ethernet, TV tuner card etc.)
Posted on Reply
#38
Vanny
jermando
It does need the NVMe to be directly connected to the SoC/uncore (northbridge). The GPU also connects to the PCIe root complex (SoC). It's the only way to have CPU <-> GPU direct communication for instant data transfers.

That's how next-gen consoles are. The SSD connects directly to the APU. Last-gen consoles had the HDD connected through the southbridge (which isn't a big deal for HDDs or SATA SSDs).

So, if you have an old Intel PC (Rocket Lake introduced 4 dedicated lanes for NVMe, just like AMD Ryzen/AM4 since 2017) where the NVMe is attached to the PCH (southbridge), you're screwed. There's no benefit, so expect to upgrade your platform.

PCH connection is usually 4 lanes (4 GB/s) and any NVMe worth its salt is going to saturate that bus (let alone the fact you may also have SATA HDDs, Gigabit Ethernet, TV tuner card etc.)
Then shit, guess I'm not using this. Whatever
Posted on Reply
#39
mechtech
windwhirl
Smartphones are not comparable. They're controlled with what is the equivalent of a giant mouse pointer. They need massive pixel density so that a large amount of content can fit in a tiny 6 inch screen at best, if not smaller, and massive scaling for the user to control the UI and other stuff relatively easily, like tapping on links or selecting content. Take out the pixel density and everything will be big. Take out the scaling and you can try to fine control things when your finger is area bombardment for your touch screen (and provided I have somewhat average fingers for the sample there, I know people with way thicker fingers than mine, and the scaling on my phone can't go to 100%, 125% is the minimum, which is what I used here).

For the reference: 1080p screen at 21.5 inches (so, around 102 PPI, a bit above the 92 PPI of a 1080p 24 inch screen), my phone's 1280x720p 5.7 inch screen scaled to match in real-world size against my display at the upper right and Firefox Responsive Design mode on the lower right to show what it would be like if phones didn't have high pixel density displays. The giant black/white circles are the size of my finger tip on the screen.


Your issue is that you need or want to be able to see a massive amount of content (otherwise you wouldn't have a 4K screen), have a small desk and want to use normal size scales. You can't have all three. Something's gotta give. You gotta step down your resolution, or get a bigger desk or get used to scaling.
No. My issue is I’m far sighted. So at 18” away from a 24” 1080p screen it looks like a screen door and I can see individual pixels and worse if I sit any closer. I want a screen that appears as sharp as my smartphone. The anti glare coatings don’t help but I don’t think I would want a full gloss monitor either. I know not an apples to apples comparison nor does it really require ‘retina’ resolution. I had a 23.8” 2560x1440 before the 27” 4K and it was decent also and I may go back to that someday also who knows. I think I had 125% scaling on the 2560x1440 and 150% on the 4K. I guess I misspoke a bit. 27” no issue for desk on its own but dual would be a bit overwhelming. Dual 23-24” screens would fit better. That’s the goal one day. Dual monitors.
Posted on Reply
#40
jermando
Vanny
Then shit, guess I'm not using this. Whatever
You'll be forced to use DirectStorage as soon as next-gen games arrive (i.e. something like PS5 Ratchet with its portals).
Posted on Reply
#41
Zubasa
Vanny
Then shit, guess I'm not using this. Whatever
The point of this new API is to let your 980 Pro make an actual difference compare to SATA drives.
The current software stack is from a time where HDD was common, so the overhead didn't matter.
At most you just need install whatever game that supports DirectStorage on your 980 Pro.
Posted on Reply
#42
Vanny
Zubasa
At most you just need install whatever game that supports DirectStorage on your 980 Pro.
games on a 500gb drive that's already ~40% full? gl
jermando
You'll be forced to use DirectStorage as soon as next-gen games arrive (i.e. something like PS5 Ratchet with its portals).
or, instead of pulling my hair out trying to swap nvme drives with everything in the way (including a gpu that is stuck to my board), ill continue using classic SATA SSDs like a caveman. i cant even put my 980 pro on the southbridge.
Posted on Reply
#43
chrcoluk
Not sure why people are surprised it will work on pcie3, also I see no reason why it wouldnt work on chipset NVME ports.
Posted on Reply
#44
jermando
chrcoluk
Not sure why people are surprised it will work on pcie3, also I see no reason why it wouldnt work on chipset NVME ports.
Because the chipset (PCH/southbridge): 1) will be saturated with a decent PCIe 3.0 NVMe, 2) there is no direct connection to the GPU.
Posted on Reply
#45
chrcoluk
jermando
Because the chipset (PCH/southbridge): 1) will be saturated with a decent PCIe 3.0 NVMe, 2) there is no direct connection to the GPU.
There isnt a direct connection to gpu on cpu based nvme either.

The reality is the PCH is mostly idle, which is why the current system works, its oversubscribed in theory, but the vast majority of people are not fully utilising several PCH connected devices at the same time. Also I expect in most games a pcie3 drive would not typically be maxed out either. The benefits of directstorage is the extra io operations/sec not so much the overall burst bandwidth. It will work just fine, like how a 3080 can work fine on pcie3x8.

On the xbox it works via pch.
Posted on Reply
#46
jermando
chrcoluk
1) There isnt a direct connection to gpu on cpu based nvme either.

2) The reality is the PCH is mostly idle, which is why the current system works, its oversubscribed in theory, but the vast majority of people are not fully utilising several PCH connected devices at the same time. Also I expect in most games a pcie3 drive would not typically be maxed out either. The benefits of directstorage is the extra io operations/sec not so much the overall burst bandwidth. It will work just fine, like how a 3080 can work fine on pcie3x8.

3) On the xbox it works via pch.
1) Have you studied the PCIe root complex architecture? It's located in the SoC/uncore (previously called northbridge), so I'm afraid you're misinformed.

That's where the GPU is attached, along with NVMe (only for AM4/AMD Zen so far and some recent Intel platforms).

2) Nope. When Ratchet gets ported on PC, you'll understand what I'm talking about. You need raw bandwidth too for instant portal switching.

3) Have you studied the XBOX Series architecture? The NVMe is connected directly to the APU (SoC/uncore part), not the southbridge (that's a separate chip).

Pretty sure you haven't even seen XBOX Series PCB pics (there are 2 PCBs).

Come on guys, there's tons of info out there, educate yourselves! :)
Posted on Reply
#47
cmoney619
jermando
1) Have you studied the PCIe root complex architecture? It's located in the SoC/uncore (previously called northbridge), so I'm afraid you're misinformed.

That's where the GPU is attached, along with NVMe (only for AM4/AMD Zen so far and some recent Intel platforms).

2) Nope. When Ratchet gets ported on PC, you'll understand what I'm talking about. You need raw bandwidth too for instant portal switching.

3) Have you studied the XBOX Series architecture? The NVMe is connected directly to the APU (SoC/uncore part), not the southbridge (that's a separate chip).

Pretty sure you haven't even seen XBOX Series PCB pics (there are 2 PCBs).

Come on guys, there's tons of info out there, educate yourselves! :)
The root complex, which is what the PCH is, can contain PCIe switches. Given that the PCH is identical to the I/O die on newer Ryzens, I believe we can assume help for this example.
Of course, those who use Linux are aware that it has always been ahead of the curve in terms of storage which is something that I would advise to read up on FIO and elbencho, although the former works on Windows. Billy Tallis of AnandTech has been posting a lot of Linux and storage-related content on Reddit, including async I/O.
One region is P2PDMA (or P2P DMA), which is basically what we have here and is associated with other technologies such as GPUDirect.
Notably, P2PDMA is compatible with a wide variety of chipsets, for example, "all AMD Zen chipsets."
Posted on Reply
#48
jermando
cmoney619
1) The root complex, which is what the PCH is, can contain PCIe switches.

2) Given that the PCH is identical to the I/O die on newer Ryzens, I believe we can assume help for this example.

3) Of course, those who use Linux are aware that it has always been ahead of the curve in terms of storage which is something that I would advise to read up on FIO and elbencho, although the former works on Windows. Billy Tallis of AnandTech has been posting a lot of Linux and storage-related content on Reddit, including async I/O.
One region is P2PDMA (or P2P DMA), which is basically what we have here and is associated with other technologies such as GPUDirect.
Notably, P2PDMA is compatible with a wide variety of chipsets, for example, "all AMD Zen chipsets."
1) en.wikipedia.org/wiki/Root_complex

"Root complex functionality may be implemented as a discrete device (northbridge chip), or may be integrated in the CPU."

It's a matter of pure geography: both the GPU and the SSD need to be as close as possible.

If you care to study console motherboards/PCBs, you'll notice that the SSD lanes (4 of them) lead straight to the APU chip, not the PCH.

I don't know why you have to confuse all these things.

Even if you have ample of PCH bandwidth (like on TRX40), you're going to experience more latency if the SSD is not connected directly to the GPU via the SoC (PCIe root complex).

Game devs want guaranteed things: this means that if my PC has 6 x SATA HDDs in RAID0 seeding torrents, a TV tuner card recording stuff and a Gigabit Ethernet connection, saturation is inevitable.

The only way to guarantee (via DirectStorage API) zero saturation is by enforcing direct GPU <-> SSD communication via the SoC/northbridge. There's no other way.

There's a reason AMD dedicated 4 lanes to the NVMe since 2017. Intel was late in the game (Rocket Lake supports it, but only if the mobo has the actual PCB traces obviously).

2) X570 is not a normal chipset/southbridge, it's a hack.

B450/X470/B550 are the equivalent of southbridge for AMD.

Again: why do you have to confuse all these things?

AMD has 4 dedicated lanes since 2017. X570 is not needed and in fact many people avoid it (due to active cooling and a certain SATA bug).

3) Linux is a server-oriented OS, so of course it would have a more advanced I/O stack (among other things).
Posted on Reply
#49
Makaveli
jermando
1) en.wikipedia.org/wiki/Root_complex

"Root complex functionality may be implemented as a discrete device (northbridge chip), or may be integrated in the CPU."

It's a matter of pure geography: both the GPU and the SSD need to be as close as possible.

If you care to study console motherboards/PCBs, you'll notice that the SSD lanes (4 of them) lead straight to the APU chip, not the PCH.

I don't know why you have to confuse all these things.

Even if you have ample of PCH bandwidth (like on TRX40), you're going to experience more latency if the SSD is not connected directly to the GPU via the SoC (PCIe root complex).

Game devs want guaranteed things: this means that if my PC has 6 x SATA HDDs in RAID0 seeding torrents, a TV tuner card recording stuff and a Gigabit Ethernet connection, saturation is inevitable.

The only way to guarantee (via DirectStorage API) zero saturation is by enforcing direct GPU <-> SSD communication via the SoC/northbridge. There's no other way.

There's a reason AMD dedicated 4 lanes to the NVMe since 2017. Intel was late in the game (Rocket Lake supports it, but only if the mobo has the actual PCB traces obviously).

2) X570 is not a normal chipset/southbridge, it's a hack.

B450/X470/B550 are the equivalent of southbridge for AMD.

Again: why do you have to confuse all these things?

AMD has 4 dedicated lanes since 2017. X570 is not needed and in fact many people avoid it (due to active cooling and a certain SATA bug).

3) Linux is a server-oriented OS, so of course it would have a more advanced I/O stack (among other things).
What is the SATA bug on X570 curious don't think I've heard of it.
Posted on Reply
Add your own comment