The secret of Doom (2016) performance on AMD GPUs

Aquinus · Aug 3, 2016

GoldenX said:
Funny how in Windows you need a Skylake CPU or newer to use Vulkan, but Ivy Bridge and newer support DX12, and Vulkan on Linux. We need Zen now...

You mean like how AMDGPU-Pro supports Vulkan in Linux but, is limited to 3rd gen GCN GPUs at the moment?

GoldenX · Aug 3, 2016

Aquinus said:
You mean like how AMDGPU-Pro supports Vulkan in Linux but, is limited to 3rd gen GCN GPUs at the moment?

https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.8-wip-si
https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.9-si

You can test their progress on GCN 1.0, and to use the driver on GCN 1.1 hardware you only have to enable a kernel flag, both are experimental.
tldr, they are working on it.

And, the community has it's own implementation: https://www.phoronix.com/scan.php?page=news_item&px=RADV-Radeon-Vulkan-Driver

arbiter · Aug 3, 2016

eidairaman1 said:
Whats Interesting is AMD made the API open to NV to use but NV refused to use it, where as NV refuses to share their APIs. Honestly Open APIs allow everyone to win no matter Red or Green

Um AMD claimed they would make it open source but date they set to release the source came and went, 6 months went by and no source code. AMD refused to release the source to start with then they canned the project and turned it over to kronos. AMD dropped the ball not NV in that matter, don't say nv refused to use a closed API they never had access to cause it never made open source under AMD.

GoldenX · Aug 3, 2016

Well, to be fair, Nvidia took part in the development of Vulkan, that's better than having to just take Mantle and support it.

Nergal · Aug 3, 2016

I am failing to find consensus here. What is the middle-ground in this discussion that people agree on? Or is it all interpretation and speculation?

I read that Mantle was dropped because DX12 was doing the same; thus I can conclude that AMD and MS were working together on a level that NV wasn´t.

On the other hand I also read that DX11 was secretly supporting some extras(async) for NV, pointing out an agreement between NV and MS.

So is MS playing both sides, or just opportunistic?

I am not trying to make any tin foil hat theories; I am just trying to make sens out of what is being said here in one simple to understand big picture post. (and how this relates to the "secret" that OP mentions)

medi01 · Aug 3, 2016

arbiter said:
Well if it wasn't closed when was it ever open sourced before turning it over to kronos group?

Aquinus said:
You mean the 435 page programming guide isn't enough?

Yeah, cause there are much better ways to expose APIs, right?

Jeez. Name a single nV proprietary then become standard... pretty much anything, will ya? You know, just for a perspective on things.

Nergal said:
On the other hand I also read that DX11 was secretly supporting some extras(async) for NV, pointing out an agreement between NV and MS.

Me wonders what this speculation is based on.

And how it bodes with "nvidia = no async".

GoldenX · Aug 3, 2016

It was said Nvidia spent (my english is slipping, is "spent" the right use here?) serious money on reducing the cpu overhead of their DirectX11 and OpenGL drivers. Maybe they managed to make a better multi-thread use of them, but that doesn't mean the Vulkan and DX12 implementation of async compute is the same.

ValenOne · Aug 3, 2016

GoldenX said:
vk_nv_glsl_shader is for using existing GLSL shaders(2), the ones used in OpenGL (they are compiled in runtime) on Vulkan, so you don't have to port them to SPIR-V (the new universal method, mostly precompiled). It's purely for easing the porting work, it even makes things run slower.
Good point on the Far Cry 2 example, but you have to remember Nvidia refused to implement DX10.1, they had to do an implementation or they would have looked slower/older than the competition.

OpenGL had vendor specific extensions for decades, not just with the recent console generation(1).

1. So what? AMD's OpenGL didn't have shader intrinsics and features. The difference shows between AMD's OpenGL's and Vulkan's framerates.

2. For NVIDA GPUs, read https://developer.nvidia.com/reading-between-threads-shader-intrinsics This is applicable for NVidia's Vulkan, OpenGL, DX11, DX12 and NVAPI.

https://www.opengl.org/discussion_b...0-Nvidia-s-OpenGL-extensions-rival-AMD-Mantle
According to Carmack himself, Nvidia’s OpenGL extensions can give similar improvements – regarding draw calls – with AMD’s Mantle.

ValenOne · Aug 3, 2016

arbiter said:
Um AMD claimed they would make it open source but date they set to release the source came and went, 6 months went by and no source code. AMD refused to release the source to start((1) with then they canned the project(1) and turned it over to kronos(2). AMD dropped the ball not NV in that matter, don't say nv refused to use a closed API they never had access to cause it never made open source under AMD.

1. Mantle API is still listed as a working API in my latest Radeon driver. Mantle API wasn't completed and AMD wants to avoid slow/stone wall/filler buster API politics. NVidia has their own OpenGL vendor extensions and NVAPI competing against AMD's Mantle. NVAPI has existed for a long time i.e. before Mantle and DX11.

2. What's important is the end result.

GoldenX · Aug 3, 2016

What Carmack refers to is Nvidia's own nv_command_list OpenGL extension, it gives almost Vulkan level of overhead on OpenGL: https://www.opengl.org/registry/specs/NV/command_list.txt
It works like intended on Kepler and newer, and is badly implemented in Fermi.

Just for a quick example, AMD has GCN_shader in OpenGL, among others like pinned memory. I'm currently on Fedora, so I can give you only what the open driver offers, but it has enough examples:

GL_AMD_conservative_depth, GL_AMD_draw_buffers_blend,
GL_AMD_performance_monitor, GL_AMD_pinned_memory,
GL_AMD_seamless_cubemap_per_texture, GL_AMD_shader_stencil_export,
GL_AMD_shader_trinary_minmax, GL_AMD_vertex_shader_layer,
GL_AMD_vertex_shader_viewport_index

It even supports other vendor's extensions:

GL_NVX_gpu_memory_info,
GL_NV_conditional_render, GL_NV_depth_clamp, GL_NV_packed_depth_stencil,
GL_NV_texture_barrier, GL_NV_vdpau_interop

In Windows this list is more extensive.

Nvidia isn't the only one with their own optimizations, and not all of them are useful in gaming scenarios.

Aquinus · Aug 3, 2016

rvalencia said:
According to Carmack himself, Nvidia’s OpenGL extensions can give similar improvements – regarding draw calls – with AMD’s Mantle.

That's probably because nVidia does something similar at the driver level. I speculated before that nVidia's implementation might keep track of the OpenGL calls and might use a queue to buffer things like draw calls which are processed independently and joined up on when a OpenGL call comes through that requires the draw calls to all be completed.

In fact, for system integration at work I recently implemented a library (unpolished,) that takes in a stream of data with associated data dependencies. It blocks subsequent data from proceeding if there is something using a resource that must be handled serially but, allows other to continue to be processed in order to improve parallel throughput. I refer to this as "queue re-ordering based on data dependencies," and I wouldn't be surprised if nVidia did something similar at the driver level in order to give the appearance that certain calls are executed quickly but, in reality they very well might be happening asynchronously in another thread or process after being put on a queue in memory.

Simply put, you don't need async compute to use a queue to accelerate certain kinds of driver workloads (and in all seriously, many other kinds of workloads as well.) The nice thing about queues is that they decouple the what from the when so, instead of waiting for a draw call to complete, it returns immediately with the understanding that the draw was queued up and will eventually be executed before another OpenGL call is made that requires it to be complete. It's possible that certain OpenGL calls might tell nVidia's driver "look, you need to finish processing everything in the queue (or maybe even just the stuff that call cares about,) before continuing."

GoldenX said:
What Carmack refers to is Nvidia's own nv_command_list OpenGL extension, it gives almost Vulkan level of overhead on OpenGL: https://www.opengl.org/registry/specs/NV/command_list.txt
It works like intended on Kepler and newer, and is badly implemented in Fermi.

Just for a quick example, AMD has GCN_shader in OpenGL, among others like pinned memory. I'm currently on Fedora, so I can give you only what the open driver offers, but it has enough examples:

GL_AMD_conservative_depth, GL_AMD_draw_buffers_blend,
GL_AMD_performance_monitor, GL_AMD_pinned_memory,
GL_AMD_seamless_cubemap_per_texture, GL_AMD_shader_stencil_export,
GL_AMD_shader_trinary_minmax, GL_AMD_vertex_shader_layer,
GL_AMD_vertex_shader_viewport_index

It even supports other vendor's extensions:

GL_NVX_gpu_memory_info,
GL_NV_conditional_render, GL_NV_depth_clamp, GL_NV_packed_depth_stencil,
GL_NV_texture_barrier, GL_NV_vdpau_interop

In Windows this list is more extensive.

Nvidia isn't the only one with their own optimizations, and not all of them are useful in gaming scenarios.

...and that doesn't even touch on the implementation of the calls that are in the OpenGL (insert version you care about here,) spec. Just because you need to implement a spec that has functions x, y, and z doesn't mean that the implementations themselves are the same or even remotely similar.

tl;dr: I wouldn't be surprised if some of the things Vulkan is spec'ed out to do is merely done implicitly by nVidia's OpenGL drivers already, under the hood. Sometimes it's faster to put something on a queue and do it later than to do something on the spot so long as managing the queue doesn't outstrip the benefit.

ValenOne · Aug 4, 2016

Aquinus said:
That's probably because nVidia does something similar at the driver level. I speculated before that nVidia's implementation might keep track of the OpenGL calls and might use a queue to buffer things like draw calls which are processed independently and joined up on when a OpenGL call comes through that requires the draw calls to all be completed.

In fact, for system integration at work I recently implemented a library (unpolished,) that takes in a stream of data with associated data dependencies. It blocks subsequent data from proceeding if there is something using a resource that must be handled serially but, allows other to continue to be processed in order to improve parallel throughput. I refer to this as "queue re-ordering based on data dependencies," and I wouldn't be surprised if nVidia did something similar at the driver level in order to give the appearance that certain calls are executed quickly but, in reality they very well might be happening asynchronously in another thread or process after being put on a queue in memory.

Simply put, you don't need async compute to use a queue to accelerate certain kinds of driver workloads (and in all seriously, many other kinds of workloads as well.) The nice thing about queues is that they decouple the what from the when so, instead of waiting for a draw call to complete, it returns immediately with the understanding that the draw was queued up and will eventually be executed before another OpenGL call is made that requires it to be complete. It's possible that certain OpenGL calls might tell nVidia's driver "look, you need to finish processing everything in the queue (or maybe even just the stuff that call cares about,) before continuing."

From https://developer.nvidia.com/dx12-dos-and-donts

"On DX11, the driver does farm off asynchronous tasks to driver worker threads where possible".

GoldenX said:

What Carmack refers to is Nvidia's own nv_command_list OpenGL extension, it gives almost Vulkan level of overhead on OpenGL: https://www.opengl.org/registry/specs/NV/command_list.txt
It works like intended on Kepler and newer, and is badly implemented in Fermi.

Just for a quick example, AMD has GCN_shader in OpenGL, among others like pinned memory. I'm currently on Fedora, so I can give you only what the open driver offers, but it has enough examples:

GL_AMD_conservative_depth, GL_AMD_draw_buffers_blend,
GL_AMD_performance_monitor, GL_AMD_pinned_memory,
GL_AMD_seamless_cubemap_per_texture, GL_AMD_shader_stencil_export,
GL_AMD_shader_trinary_minmax, GL_AMD_vertex_shader_layer,
GL_AMD_vertex_shader_viewport_index

It even supports other vendor's extensions:

GL_NVX_gpu_memory_info,
GL_NV_conditional_render, GL_NV_depth_clamp, GL_NV_packed_depth_stencil,
GL_NV_texture_barrier, GL_NV_vdpau_interop

In Windows this list is more extensive.

Nvidia isn't the only one with their own optimizations, and not all of them are useful in gaming scenarios.

Click to expand...

Nearly useless post since it doesn't specifically address performance difference between AMD's Vulkan and OpenGL frame rate results. Furthermore, majority of PC games are written with Direct3D not OpenGL APIs.

AMD has recently enabled GCN's Shader Intrinsic Functions with Vulkan, DirectX11 and DirectX12, while NVIDIA has Shader Intrinsic Functions with Direct3D and NVAPI along time ago.

https://developer.nvidia.com/unlocking-gpu-intrinsics-hlsl

"None of the intrinsics are possible in standard DirectX or OpenGL. But they have been supported and well-documented in CUDA for years. A mechanism to support them in DirectX has been available for a while but not widely documented. I happen to have an old NVAPI version 343 on my system from October 2014 and the intrinsics are supported in DirectX by that version and probably earlier versions. This blog explains the mechanism for using them in DirectX.

Unlike OpenGL or Vulkan, DirectX unfortunately doesn't have a native mechanism for vendor-specific extensions. But there is still a way to make all this functionality available in DirectX 11 or 12 through custom intrinsics. That mechanism is implemented in our graphics driver and accessible through the NVAPI library."

Aquinus · Aug 5, 2016

rvalencia said:
From https://developer.nvidia.com/dx12-dos-and-donts

"On DX11, the driver does farm off asynchronous tasks to driver worker threads where possible".

What's more interesting from that link you provided:

Don’ts

Don’t rely on the driver to parallelize any Direct3D12 works in driver threads

On DX11 the driver does farm off asynchronous tasks to driver worker threads where possible – this doesn’t happen anymore under DX12

While the total cost of work submission in DX12 has been reduced, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to submit work in parallel, the more benefit in terms of draw call submission performance can be expected.

After reading that, it makes me think that DX12 very well might be too difficult to handle in the same way to accelerate what would normally behave like a serial workload but, gets delegated to the driver as I surmised (even for DX11 which isn't entirely surprising.) I see this as being a move to put more power in the hands of game developers and less in the hands of driver developers should extra performance be demanded. It takes driver developers off the hook for making up for poor engine implementations, which nVidia has done exceptionally well in my opinion. In general, I would call this a good thing but, that very well might mean we're not going to see the same kind of driver advantage nVidia has had over AMD going forward but, I think that will highly depend on how game devs implement and utilize their engines going forward and it will be less about what kind of optimizations that device driver devs can make.

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 4TB External
Display(s)	Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	96w Power Adapter
Mouse	Logitech MX Master 3
Keyboard	Logitech G915, GL Clicky
Software	MacOS 12.1

System Name	Ciel
Processor	AMD Ryzen R5 5600X
Motherboard	Asus Tuf Gaming B550 Plus
Cooling	ID-Cooling 224-XT Basic
Memory	2x 16GB Kingston Fury 3600MHz@3933MHz
Video Card(s)	Gainward Ghost 3060 Ti 8GB + Sapphire Pulse RX 6600 8GB
Storage	NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB
Display(s)	AOC Q27G3XMN + Samsung S22F350
Case	Cougar MX410 Mesh-G
Audio Device(s)	Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply	Aerocool KCAS-500W
Mouse	EVGA X15
Keyboard	VSG Alnilam
Software	Windows 11

Processor	i7-13700k
Motherboard	Asus Tuf Gaming z790-plus
Cooling	Coolermaster Hyper 212 RGB
Memory	Corsair Vengeance RGB 32GB DDR5 7000mhz
Video Card(s)	Asus Dual Geforce RTX 4070 Super ( 2800mhz @ 1.0volt, ~60mhz overlock -.1volts. 180-190watt draw)
Storage	1x Samsung 980 Pro PCIe4 NVme, 2x Samsung 1tb 850evo SSD, 3x WD drives, 2 seagate
Display(s)	Acer Predator XB273u 27inch IPS G-Sync 165hz
Power Supply	Corsair RMx Series RM850x (OCZ Z series PSU retired after 13 years of service)
Mouse	Logitech G502 hero
Keyboard	Logitech G710+

System Name	Ciel
Processor	AMD Ryzen R5 5600X
Motherboard	Asus Tuf Gaming B550 Plus
Cooling	ID-Cooling 224-XT Basic
Memory	2x 16GB Kingston Fury 3600MHz@3933MHz
Video Card(s)	Gainward Ghost 3060 Ti 8GB + Sapphire Pulse RX 6600 8GB
Storage	NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB
Display(s)	AOC Q27G3XMN + Samsung S22F350
Case	Cougar MX410 Mesh-G
Audio Device(s)	Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply	Aerocool KCAS-500W
Mouse	EVGA X15
Keyboard	VSG Alnilam
Software	Windows 11

Processor	I7-6700
Motherboard	ASRock Z170 Pro4S
Cooling	2*120mm
Memory	G.Skill D416GB 3200-14 Trident Z K2 GSK
Video Card(s)	Rx480 Sapphire
Storage	SSD Samsung 256GB 850 pro + bunch of TB
Case	Antec
Audio Device(s)	Creative Sound Blaster Z
Power Supply	be quit 900W
Mouse	Logitech G5
Keyboard	Logitech G11

The secret of Doom (2016) performance on AMD GPUs

Aquinus

Resident Wat-man

GoldenX

arbiter

GoldenX

Nergal

medi01

GoldenX

ValenOne

ValenOne

GoldenX

Aquinus

Resident Wat-man

ValenOne

Aquinus

Resident Wat-man

System Name	M3401 notebook
Processor	5600H
Motherboard	NA
Memory	16GB
Video Card(s)	3050
Storage	500GB SSD
Display(s)	14" OLED screen of the laptop
Software	Windows 10
Benchmark Scores	3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.

System Name	Eula
Processor	AMD Ryzen 9 7900X PBO
Motherboard	ASUS TUF Gaming X670E Plus Wifi
Cooling	Corsair H115i Elite Capellix XT
Memory	Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB and Toshiba N300 NAS 10TB HDD
Display(s)	2X LG 27UL600 27in 4K HDR FreeSync/G-Sync DP
Case	Phanteks Eclipse P500A D-RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro