DirectX 12 API New Feature Set Introduces GPU Upload Heaps, Enables Simultaneous Access to VRAM for CPU and GPU

R0H1T · Apr 2, 2023

It's also available with CDNA & EPYC IIRC.

Steevo · Apr 2, 2023

I want a GPU with a M.2 cache device like they were planning.

R-T-B · Apr 2, 2023

Steevo said:
I want a GPU with a M.2 cache device like they were planning.

I don't. The latencies would be awful. Last time I checked caches are supposed to be fast, which flash really isn't (in gpu terms).

Bruno_O · Apr 4, 2023

Vya Domus said:
CPU + GPU unified memory architecture is nothing new, just hasn't been done for consumer level software but you can get unified memory with CUDA or HIP in Linux right now.

New macs disagree.

ValenOne · Apr 4, 2023

Ferrum Master said:
One makes an article about DX12.

Puts on some GPU board picture.

GTX285. Supports max DX11.1

Stupid artists ®

GTX285 supports DirectX10.1.

Ferrum Master · Apr 4, 2023

rvalencia said:
GTX285 supports DirectX10.1.

Blame TPU database for that.

But in reality it had partial DX11.

Vya Domus · Apr 4, 2023

Bruno_O said:
New macs disagree.

Integrated GPUs obviously don't count.

ValenOne · Apr 4, 2023

Ferrum Master said:
Blame TPU database for that.

But in reality it had partial DX11.

DX11 has Shader Model 5.0/5.1, tessellation, hull & domain shaders, DirectCompute (CS 5.0/5.1), 16K textures, BC6H/BC7, extended pixel formats, and all 10_1 features.

Mussels · Apr 11, 2023

Punkenjoy said:
I am not sure if the CPU could use the huge bandwidth of the Graphics cards. The latency alone would kill it.

Works fine in current gen consoles.

Steevo said:
I want a GPU with a M.2 cache device like they were planning.

That's just DirectStorage. The textures/files are pre-compiled to run instantly, so it's literally an NVME cache for the GPU.

Punkenjoy · Apr 11, 2023

Mussels said:
Works fine in current gen consoles.

Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.

TheoneandonlyMrK · Apr 11, 2023

Vya Domus said:
CPU + GPU unified memory architecture is nothing new, just hasn't been done for consumer level software but you can get unified memory with CUDA or HIP in Linux right now.

Or ps5 or Xbox , it's just levelling pc up to console.

Punkenjoy · Apr 11, 2023

I still suspect that by the end of this decade, we will have some chips that will target high performance and will be a single SoC a la MI300 on PC. At this time, having the ability for the GPU and CPU to use the same memory will just make way more sense and will allow another level of performance.

but with dedicated GPU, i think it will only be used marginally.

Mussels · Apr 13, 2023

Punkenjoy said:
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.

They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.

32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer

ValenOne · Apr 14, 2023

Punkenjoy said:
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.

Console's Zen 2 CPU cluster is limited by internal Infinity Links despite the high 256-bit or 320-bit GDDR6-14000 memory bandwidth. Only the IGP can fully exploit system memory bandwidth.

PCIe 4.0 16 lanes 32 GB/s bandwidth read direction is slightly above Xbox 360's 22.4 GB/s or about half of the texture memory bandwidth of Xbox One's 68 GB/s. PC iGPU is not limited by PCIe 4.0 16-lane link.

Punkenjoy · Apr 14, 2023

rvalencia said:
Console's Zen 2 CPU cluster is limited by internal Infinity Links despite the high 256-bit or 320-bit GDDR6-14000 memory bandwidth. Only the IGP can fully exploit system memory bandwidth.

PCIe 4.0 16 lanes 32 GB/s bandwidth read direction is slightly above Xbox 360's 22.4 GB/s or about half of the texture memory bandwidth of Xbox One's 68 GB/s. PC iGPU is not limited by PCIe 4.0 16-lane link.

Good point about the infinity fabrics limits and that would still be true in our case (traffic to the CPU die will have to compete with the traffic from memory). But anyway i don't really see scenario where that bandwidth would be used up to that point.

And also, i wonder how the cache will handle that as they cache memory line in main memory.

Mussels said:
They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.

32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer

the VRAM can't be compared to NVME so i am not sure what you are trying to describe here.

VRAM is temporary storage, same as main ram were NVME is long term storage. The data you would want to access in VRAM is probably not stored on the SSD anyway.

I see this more a good way to utilize more the GPU. Things like mesh shaders are super powerful, but you are limited to what you can do right now due to having to sync data between the GPU and CPU.

By example, i could see very complex mesh shaders that perform complex destructions on mesh that also affect the collision model. with this tech, the CPU could have the collision model in VRAM.

It will be really about exchanging temporary data. not static one.

ValenOne · Apr 15, 2023

Punkenjoy said:
Good point about the infinity fabrics limits and that would still be true in our case (traffic to the CPU die will have to compete with the traffic from memory). But anyway i don't really see scenario where that bandwidth would be used up to that point.

And also, i wonder how the cache will handle that as they cache memory line in main memory.

AMD 4700S APU recycled PS5 APU with 16 GB GDDR6-14000 memory for the PC market and it was benchmarked.

Despite APU's single-chip design, Infinity Link still exists between IO and CCX blocks i.e. AMD's cut-n-paste engineering.

Renior APU example

There's a reason for some Epyc SKUs having double infinity links between CCD and IO.

Mussels said:
They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.

32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer

FYI, AMD 4700S APU recycled PS5 APU with 16 GB GDDR6-14000 memory for the PC market and it was benchmarked.

AMD supplied "Design for Windows" ACPI UEFI Firmware for recycled PS5 4700S APU to boot ACPI HAL-enabled Windows.

PS5 APU with "Design for Windows" ACPI UEFI Firmware is an AMD-based X86-64 PC.

My QNAP NAS has Intel Haswell Core i7 4770T CPU (45 watts) and it can't directly boot Windows since it's missing "Design for Windows" ACPI-enabled UEFI. The same Intel Haswell Core i7 4770T CPU was recycled from a slim office Windows-based PC.

AMD designed AM4 cooler mounting holes for 4800S. AMD is not throwing away defective console APUs with working Zen 2 CPUs in the bin i.e. BOM cost for these chips needs to be recovered.

Mussels · Apr 15, 2023

Punkenjoy said:
the VRAM can't be compared to NVME so i am not sure what you are trying to describe here.

feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example

Punkenjoy · Apr 15, 2023

Mussels said:
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example

Maybe, but system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching.

Think more about data that both CPU and GPU need to have access to and that need to be modify on the GPU.

It's quite bit hard to see the real usage of this technology in gaming as many use case doesn't exist yet as it didn't make sense to use it

ValenOne · Apr 15, 2023

Mussels said:
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example

Using system ram as art assets cache didn't stop stuttering mess from the recent games that exceeded 8 GB VRAM. 32 GB /s from PCIe 4.0 16 lanes is about half of Xbox One's texture bandwidth.

Using system ram as a landing zone from NVMe adds additional memory copy latency.

Nvidia's GPUdirect with CUDA skips system memory landing zone for direct NVMe to GPU VRAM.

MS's current DX12U Direct Storage build for PC doesn't skip system memory and has double data storage issue. PC's DX12U Direct Storage needs to evolved when DX12U gains AMD Fusion like feature i.e. this topic's DX12U improvements.

Mussels · Apr 17, 2023

Punkenjoy said:
Maybe, but system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching.

Think more about data that both CPU and GPU need to have access to and that need to be modify on the GPU.

It's quite bit hard to see the real usage of this technology in gaming as many use case doesn't exist yet as it didn't make sense to use it

You missed the point - latency
It has to be moved from storage TO ram, and no, you cant fit everything into RAM. There are multiple AAA titles out there with 100GB+ sizes right now.

If it has to go storage -> RAM -> VRAM it's got delays every step of the way, vs the GPU just loading what's needed directly
DXdiag for example now shows a mix of VRAM + system RAM, with directstorage your NVME drive becomes part of that setup too - and the GPU is aware of it, instead of the CPU processing all the work prior to that point.

Punkenjoy · Apr 17, 2023

Mussels said:
You missed the point - latency
It has to be moved from storage TO ram, and no, you cant fit everything into RAM. There are multiple AAA titles out there with 100GB+ sizes right now.

If it has to go storage -> RAM -> VRAM it's got delays every step of the way, vs the GPU just loading what's needed directly
DXdiag for example now shows a mix of VRAM + system RAM, with directstorage your NVME drive becomes part of that setup too - and the GPU is aware of it, instead of the CPU processing all the work prior to that point.

I think you mix Direct Storage and this technology

What you discribe is Direct Storage. What this technology allow is for the CPU to be able to edit things in VRAM without having to copy them to local memory and also very importantly, without the GPU losing access to that data. It's true that you could maybe use that for something like Direct Storage, but it would be just be the tips of the iceberg. (And mostly, you don't need at all this technology to be able to acheive what you are describing, just DirectStorage is enough). Also Latency wise, the main source of latency will be the SSD access that is calculated in microseconds. System ram latency is calculated in nano seconds. But you save a big copy in ram so you save a lot of CPU cycles and also bandwidth by sending it directly where you want to send it.

This technology for scenario when you want to compute a data set with both GPU and CPU. Not just for copying stuff into VRAM without passing thru system RAM. This have usage outside of games right now but not much in game since you can't do that right now. The things possible to achieve with that will appear as the technology is deployed.

ValenOne · Apr 18, 2023

Punkenjoy said:
I think you mix Direct Storage and this technology

What you discribe is Direct Storage. What this technology allow is for the CPU to be able to edit things in VRAM without having to copy them to local memory and also very importantly, without the GPU losing access to that data. It's true that you could maybe use that for something like Direct Storage, but it would be just be the tips of the iceberg. (And mostly, you don't need at all this technology to be able to acheive what you are describing, just DirectStorage is enough). Also Latency wise, the main source of latency will be the SSD access that is calculated in microseconds. System ram latency is calculated in nano seconds. But you save a big copy in ram so you save a lot of CPU cycles and also bandwidth by sending it directly where you want to send it.

This technology for scenario when you want to compute a data set with both GPU and CPU. Not just for copying stuff into VRAM without passing thru system RAM. This have usage outside of games right now but not much in game since you can't do that right now. The things possible to achieve with that will appear as the technology is deployed.

You argued "But system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching."

The current PC Direct Storage implementation has system memory as a landing zone.

The current PC Direct Storage implementation has the PC's legacy "double copy" issue.

Before Direct Storage on the Windows-based PC

Meanwhile, NVIDIA's GPUDirect on HPC markets

Screen-Shot-2020-08-19-at-9.29.23-PM.png

NVIDIA's RTX IO with Ampere generation https://developer.nvidia.com/rtx-io

PC's current Direct Storage implementation needs middleware evolution for direct NVME to GPU path i.e. this topic's DirectX12U's improvement direction.

PC's current Direct Storage Tier 1.1 implementation is half-baked. The console's DirectStorage model is the destination.

Punkenjoy · Apr 18, 2023

ValenOne said:
You argued ....

Good point !

GPU decompression got added with Direct Storage 1.1

I was under the impression that Direct Storage 1.0 was directly from NVME to GPU but you are right. It's actually DMA access from main memory. The CPU doesn't have to intervene there and the GPU can communicate directly with the memory controller to get the data.

It's a bit the opposite of the technology described in this news.

In this news, it's the CPU that can access directly the GPU memory.

ValenOne · Apr 18, 2023

Punkenjoy said:
Good point !

GPU decompression got added with Direct Storage 1.1

I was under the impression that Direct Storage 1.0 was directly from NVME to GPU but you are right. It's actually DMA access from main memory. The CPU doesn't have to intervene there and the GPU can communicate directly with the memory controller to get the data.

It's a bit the opposite of the technology described in this news.

In this news, it's the CPU that can access directly the GPU memory.

PC's current Direct Storage Tier 1.1 implementation. From https://devblogs.microsoft.com/directx/directstorage-1-1-now-available/

There are the chipset's DMA functions from NVMe to system memory (we're not in PIO modes) and then there are GPU's DMA functions from system memory to GPU memory.

---
For this topic, Microsoft has announced a new DirectX12 GPU optimization feature in conjunction with Resizable-BAR, called GPU Upload Heaps that allows the CPU to have direct, simultaneous access to GPU memory. This can increase performance in DX12 titles and decrease system RAM utilization since the feature circumvents the need to copy data from the CPU to the GPU.

CPU ping-pong between GPU VRAM is limited by PCIe 4.0 16 lanes (32 GB/s per direction).

Killzone Shadowfall CPU GPU storage example.jpeg

Using PS4's Killzone Shadow Fall's example, the shared CPU-GPU data storage is usually small i.e. the bulk of CPU and GPU data sets don't need to be known by either CPU or GPU nodes e.g. CPU should not be interested in GPU's framebuffer and texture processing activities.

biggermesh · Apr 20, 2023

Mussels said:
Works fine in current gen consoles.

That's just DirectStorage. The textures/files are pre-compiled to run instantly, so it's literally an NVME cache for the GPU.

It works fine on current consoles because of unified memory. On PC you cannot use the vram for the cpu as you would the ram, and vice versa, the PCIE link latency is too high.

ValenOne said:
Using PS4's Killzone Shadow Fall's example, the shared CPU-GPU data storage is usually small i.e. the bulk of CPU and GPU data sets don't need to be known by either CPU or GPU nodes e.g. CPU should not be interested in GPU's framebuffer and texture processing activities.

Yes ! That's because
- cpu and GPU do very different things with different data, and
- sharing data between the gpu and cpu is very costly even with unified memory because of memory coherency : if the gpu and cpu want to modify the same memory range they have to be sychronized and their caches have to be flushed which destroys performance. The same issue occurs with atomic operations inside a multi core gpu, and it can destroy performance on a single chip with access to a common cache !

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

System Name	Pioneer
Processor	Ryzen 9 9950X
Motherboard	MSI MAG X670E Tomahawk Wifi
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	128GB (4x 32GB) G.Skill Flare X5 @ DDR5-4200(Running 1:1:1 w/FCLK)
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64, other office machines run Windows 11 Enterprise

System Name	Eula
Processor	AMD Ryzen 9 7950X
Motherboard	MSI MPG B850 Edge Ti WiFi
Cooling	Corsair H150i Elite LCD XT White
Memory	Trident Z5 Neo RGB DDR5-6000 CL32-38-38-96 1.40V 64GB (2x32GB) AMD EXPO F5-6000J3238G32GX2-TZ5NR
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Crucial P3 Plus, 4 TB NVMe, Samsung 980 Pro 2TB NVMe, Toshiba N300 10TB HDD, WDC Red Pro NAS HDD
Display(s)	Acer Predator X32FP 32in 160Hz 4K, Corsair Xeneon 32UHD144 32in 144 hz 4K
Case	Antec Constellation C8 RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro

System Name	HELLSTAR
Processor	AMD RYZEN 9 5950X
Motherboard	ASUS Strix X570-E
Cooling	2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory	4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s)	Sapphire Pulse RX 7900XTX. Water block. Crossflashed.
Storage	Optane 900P[Fedora] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO+SN560 1TB(W11)
Display(s)	Philips PHL BDM3270 + Acer XV242Y
Case	Lian Li O11 Dynamic EVO
Audio Device(s)	SMSL RAW-MDA1 DAC
Power Supply	Fractal Design Newton R3 1000W
Mouse	Razer Basilisk
Keyboard	Razer BlackWidow V3 - Yellow Switch
Software	FEDORA 41

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

DirectX 12 API New Feature Set Introduces GPU Upload Heaps, Enables Simultaneous Access to VRAM for CPU and GPU

R0H1T

Steevo

R-T-B

Bruno_O

ValenOne

Ferrum Master

Vya Domus

ValenOne

Mussels

Freshwater Moderator

Punkenjoy

TheoneandonlyMrK

Punkenjoy

Mussels

Freshwater Moderator

ValenOne

Punkenjoy

ValenOne

Mussels

Freshwater Moderator

Punkenjoy

ValenOne

Mussels

Freshwater Moderator

Punkenjoy

ValenOne

Punkenjoy

ValenOne

biggermesh

System Name	Rainbow Sparkles (Power efficient, <350W gaming load)
Processor	Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard	Asus x570-F (BIOS Modded)
Cooling	Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory	2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s)	Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage	2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s)	Phillips 32 32M1N5800A (4k144), LG 32" (4K60) \| Gigabyte G32QC (2k165) \| Phillips 328m6fjrmb (2K144)
Case	Fractal Design R6
Audio Device(s)	Logitech G560 \| Corsair Void pro RGB \|Blue Yeti mic
Power Supply	Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse	Logitech G Pro wireless + Steelseries Prisma XL
Keyboard	Razer Huntsman TE ( Sexy white keycaps)
VR HMD	Oculus Rift S + Quest 2
Software	Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores	Nyooom.

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506