• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

DirectX 12 API New Feature Set Introduces GPU Upload Heaps, Enables Simultaneous Access to VRAM for CPU and GPU

I want a GPU with a M.2 cache device like they were planning.
 
I want a GPU with a M.2 cache device like they were planning.
I don't. The latencies would be awful. Last time I checked caches are supposed to be fast, which flash really isn't (in gpu terms).
 
One makes an article about DX12.

Puts on some GPU board picture.

GTX285. Supports max DX11.1

Stupid artists ®
GTX285 supports DirectX10.1.
 
Blame TPU database for that.

But in reality it had partial DX11.

DX11 has Shader Model 5.0/5.1, tessellation, hull & domain shaders, DirectCompute (CS 5.0/5.1), 16K textures, BC6H/BC7, extended pixel formats, and all 10_1 features.
 
I am not sure if the CPU could use the huge bandwidth of the Graphics cards. The latency alone would kill it.
Works fine in current gen consoles.

I want a GPU with a M.2 cache device like they were planning.
That's just DirectStorage. The textures/files are pre-compiled to run instantly, so it's literally an NVME cache for the GPU.
 
Works fine in current gen consoles.
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.
 
CPU + GPU unified memory architecture is nothing new, just hasn't been done for consumer level software but you can get unified memory with CUDA or HIP in Linux right now.
Or ps5 or Xbox , it's just levelling pc up to console.
 
I still suspect that by the end of this decade, we will have some chips that will target high performance and will be a single SoC a la MI300 on PC. At this time, having the ability for the GPU and CPU to use the same memory will just make way more sense and will allow another level of performance.

but with dedicated GPU, i think it will only be used marginally.
 
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.
They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.


32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer
 
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.
Console's Zen 2 CPU cluster is limited by internal Infinity Links despite the high 256-bit or 320-bit GDDR6-14000 memory bandwidth. Only the IGP can fully exploit system memory bandwidth.

PCIe 4.0 16 lanes 32 GB/s bandwidth read direction is slightly above Xbox 360's 22.4 GB/s or about half of the texture memory bandwidth of Xbox One's 68 GB/s. PC iGPU is not limited by PCIe 4.0 16-lane link.
 
Console's Zen 2 CPU cluster is limited by internal Infinity Links despite the high 256-bit or 320-bit GDDR6-14000 memory bandwidth. Only the IGP can fully exploit system memory bandwidth.

PCIe 4.0 16 lanes 32 GB/s bandwidth read direction is slightly above Xbox 360's 22.4 GB/s or about half of the texture memory bandwidth of Xbox One's 68 GB/s. PC iGPU is not limited by PCIe 4.0 16-lane link.
Good point about the infinity fabrics limits and that would still be true in our case (traffic to the CPU die will have to compete with the traffic from memory). But anyway i don't really see scenario where that bandwidth would be used up to that point.

And also, i wonder how the cache will handle that as they cache memory line in main memory.


They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.


32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer
the VRAM can't be compared to NVME so i am not sure what you are trying to describe here.

VRAM is temporary storage, same as main ram were NVME is long term storage. The data you would want to access in VRAM is probably not stored on the SSD anyway.

I see this more a good way to utilize more the GPU. Things like mesh shaders are super powerful, but you are limited to what you can do right now due to having to sync data between the GPU and CPU.

By example, i could see very complex mesh shaders that perform complex destructions on mesh that also affect the collision model. with this tech, the CPU could have the collision model in VRAM.

It will be really about exchanging temporary data. not static one.
 
Good point about the infinity fabrics limits and that would still be true in our case (traffic to the CPU die will have to compete with the traffic from memory). But anyway i don't really see scenario where that bandwidth would be used up to that point.

And also, i wonder how the cache will handle that as they cache memory line in main memory.
AMD 4700S APU recycled PS5 APU with 16 GB GDDR6-14000 memory for the PC market and it was benchmarked.

Despite APU's single-chip design, Infinity Link still exists between IO and CCX blocks i.e. AMD's cut-n-paste engineering.

Renior APU example

08.jpg


There's a reason for some Epyc SKUs having double infinity links between CCD and IO.

They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.


32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer
FYI, AMD 4700S APU recycled PS5 APU with 16 GB GDDR6-14000 memory for the PC market and it was benchmarked.

AMD supplied "Design for Windows" ACPI UEFI Firmware for recycled PS5 4700S APU to boot ACPI HAL-enabled Windows.

PS5 APU with "Design for Windows" ACPI UEFI Firmware is an AMD-based X86-64 PC.

My QNAP NAS has Intel Haswell Core i7 4770T CPU (45 watts) and it can't directly boot Windows since it's missing "Design for Windows" ACPI-enabled UEFI. The same Intel Haswell Core i7 4770T CPU was recycled from a slim office Windows-based PC.

AMD designed AM4 cooler mounting holes for 4800S. AMD is not throwing away defective console APUs with working Zen 2 CPUs in the bin i.e. BOM cost for these chips needs to be recovered.
 
Last edited:
the VRAM can't be compared to NVME so i am not sure what you are trying to describe here.
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example
 
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example
Maybe, but system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching.

Think more about data that both CPU and GPU need to have access to and that need to be modify on the GPU.

It's quite bit hard to see the real usage of this technology in gaming as many use case doesn't exist yet as it didn't make sense to use it
 
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example
Using system ram as art assets cache didn't stop stuttering mess from the recent games that exceeded 8 GB VRAM. 32 GB /s from PCIe 4.0 16 lanes is about half of Xbox One's texture bandwidth.

Using system ram as a landing zone from NVMe adds additional memory copy latency.

Nvidia's GPUdirect with CUDA skips system memory landing zone for direct NVMe to GPU VRAM.

MS's current DX12U Direct Storage build for PC doesn't skip system memory and has double data storage issue. PC's DX12U Direct Storage needs to evolved when DX12U gains AMD Fusion like feature i.e. this topic's DX12U improvements.
 
Last edited:
Maybe, but system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching.

Think more about data that both CPU and GPU need to have access to and that need to be modify on the GPU.

It's quite bit hard to see the real usage of this technology in gaming as many use case doesn't exist yet as it didn't make sense to use it
You missed the point - latency
It has to be moved from storage TO ram, and no, you cant fit everything into RAM. There are multiple AAA titles out there with 100GB+ sizes right now.

If it has to go storage -> RAM -> VRAM it's got delays every step of the way, vs the GPU just loading what's needed directly
DXdiag for example now shows a mix of VRAM + system RAM, with directstorage your NVME drive becomes part of that setup too - and the GPU is aware of it, instead of the CPU processing all the work prior to that point.
 
You missed the point - latency
It has to be moved from storage TO ram, and no, you cant fit everything into RAM. There are multiple AAA titles out there with 100GB+ sizes right now.

If it has to go storage -> RAM -> VRAM it's got delays every step of the way, vs the GPU just loading what's needed directly
DXdiag for example now shows a mix of VRAM + system RAM, with directstorage your NVME drive becomes part of that setup too - and the GPU is aware of it, instead of the CPU processing all the work prior to that point.
I think you mix Direct Storage and this technology


What you discribe is Direct Storage. What this technology allow is for the CPU to be able to edit things in VRAM without having to copy them to local memory and also very importantly, without the GPU losing access to that data. It's true that you could maybe use that for something like Direct Storage, but it would be just be the tips of the iceberg. (And mostly, you don't need at all this technology to be able to acheive what you are describing, just DirectStorage is enough). Also Latency wise, the main source of latency will be the SSD access that is calculated in microseconds. System ram latency is calculated in nano seconds. But you save a big copy in ram so you save a lot of CPU cycles and also bandwidth by sending it directly where you want to send it.


This technology for scenario when you want to compute a data set with both GPU and CPU. Not just for copying stuff into VRAM without passing thru system RAM. This have usage outside of games right now but not much in game since you can't do that right now. The things possible to achieve with that will appear as the technology is deployed.
 
I think you mix Direct Storage and this technology


What you discribe is Direct Storage. What this technology allow is for the CPU to be able to edit things in VRAM without having to copy them to local memory and also very importantly, without the GPU losing access to that data. It's true that you could maybe use that for something like Direct Storage, but it would be just be the tips of the iceberg. (And mostly, you don't need at all this technology to be able to acheive what you are describing, just DirectStorage is enough). Also Latency wise, the main source of latency will be the SSD access that is calculated in microseconds. System ram latency is calculated in nano seconds. But you save a big copy in ram so you save a lot of CPU cycles and also bandwidth by sending it directly where you want to send it.


This technology for scenario when you want to compute a data set with both GPU and CPU. Not just for copying stuff into VRAM without passing thru system RAM. This have usage outside of games right now but not much in game since you can't do that right now. The things possible to achieve with that will appear as the technology is deployed.

You argued "But system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching."

The current PC Direct Storage implementation has system memory as a landing zone.

microsoft-directstorage.jpg


The current PC Direct Storage implementation has the PC's legacy "double copy" issue.


Before Direct Storage on the Windows-based PC

Direcstorage-Legacy-IO-e1644518278873.png


Meanwhile, NVIDIA's GPUDirect on HPC markets

Screen-Shot-2020-08-19-at-9.29.23-PM.png


NVIDIA's RTX IO with Ampere generation https://developer.nvidia.com/rtx-io
rtx-io-visual-2545900-v5-01.png


PC's current Direct Storage implementation needs middleware evolution for direct NVME to GPU path i.e. this topic's DirectX12U's improvement direction.

PC's current Direct Storage Tier 1.1 implementation is half-baked. The console's DirectStorage model is the destination.
 
Last edited:
You argued ....
Good point !

GPU decompression got added with Direct Storage 1.1

I was under the impression that Direct Storage 1.0 was directly from NVME to GPU but you are right. It's actually DMA access from main memory. The CPU doesn't have to intervene there and the GPU can communicate directly with the memory controller to get the data.

It's a bit the opposite of the technology described in this news.

In this news, it's the CPU that can access directly the GPU memory.
 
Good point !

GPU decompression got added with Direct Storage 1.1

I was under the impression that Direct Storage 1.0 was directly from NVME to GPU but you are right. It's actually DMA access from main memory. The CPU doesn't have to intervene there and the GPU can communicate directly with the memory controller to get the data.

It's a bit the opposite of the technology described in this news.

In this news, it's the CPU that can access directly the GPU memory.
BlogFig1.png

PC's current Direct Storage Tier 1.1 implementation. From https://devblogs.microsoft.com/directx/directstorage-1-1-now-available/

There are the chipset's DMA functions from NVMe to system memory (we're not in PIO modes) and then there are GPU's DMA functions from system memory to GPU memory.

---
For this topic, Microsoft has announced a new DirectX12 GPU optimization feature in conjunction with Resizable-BAR, called GPU Upload Heaps that allows the CPU to have direct, simultaneous access to GPU memory. This can increase performance in DX12 titles and decrease system RAM utilization since the feature circumvents the need to copy data from the CPU to the GPU.

CPU ping-pong between GPU VRAM is limited by PCIe 4.0 16 lanes (32 GB/s per direction).

Killzone Shadowfall CPU GPU storage example.jpeg


Using PS4's Killzone Shadow Fall's example, the shared CPU-GPU data storage is usually small i.e. the bulk of CPU and GPU data sets don't need to be known by either CPU or GPU nodes e.g. CPU should not be interested in GPU's framebuffer and texture processing activities.
 
Last edited:
Works fine in current gen consoles.


That's just DirectStorage. The textures/files are pre-compiled to run instantly, so it's literally an NVME cache for the GPU.
It works fine on current consoles because of unified memory. On PC you cannot use the vram for the cpu as you would the ram, and vice versa, the PCIE link latency is too high.

Using PS4's Killzone Shadow Fall's example, the shared CPU-GPU data storage is usually small i.e. the bulk of CPU and GPU data sets don't need to be known by either CPU or GPU nodes e.g. CPU should not be interested in GPU's framebuffer and texture processing activities.
Yes ! That's because
- cpu and GPU do very different things with different data, and
- sharing data between the gpu and cpu is very costly even with unified memory because of memory coherency : if the gpu and cpu want to modify the same memory range they have to be sychronized and their caches have to be flushed which destroys performance. The same issue occurs with atomic operations inside a multi core gpu, and it can destroy performance on a single chip with access to a common cache !
 
Back
Top