• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

The secret of Doom (2016) performance on AMD GPUs

Funny how in Windows you need a Skylake CPU or newer to use Vulkan, but Ivy Bridge and newer support DX12, and Vulkan on Linux. We need Zen now...
You mean like how AMDGPU-Pro supports Vulkan in Linux but, is limited to 3rd gen GCN GPUs at the moment?
 
Last edited:
You mean like how AMDGPU-Pro supports Vulkan in Linux but, is limited to 3rd gen GCN GPUs at the moment?

https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.8-wip-si
https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.9-si

You can test their progress on GCN 1.0, and to use the driver on GCN 1.1 hardware you only have to enable a kernel flag, both are experimental.
tldr, they are working on it.

And, the community has it's own implementation: https://www.phoronix.com/scan.php?page=news_item&px=RADV-Radeon-Vulkan-Driver
 
Whats Interesting is AMD made the API open to NV to use but NV refused to use it, where as NV refuses to share their APIs. Honestly Open APIs allow everyone to win no matter Red or Green
Um AMD claimed they would make it open source but date they set to release the source came and went, 6 months went by and no source code. AMD refused to release the source to start with then they canned the project and turned it over to kronos. AMD dropped the ball not NV in that matter, don't say nv refused to use a closed API they never had access to cause it never made open source under AMD.
 
Well, to be fair, Nvidia took part in the development of Vulkan, that's better than having to just take Mantle and support it.
 
I am failing to find consensus here. What is the middle-ground in this discussion that people agree on? Or is it all interpretation and speculation?

I read that Mantle was dropped because DX12 was doing the same; thus I can conclude that AMD and MS were working together on a level that NV wasn´t.

On the other hand I also read that DX11 was secretly supporting some extras(async) for NV, pointing out an agreement between NV and MS.

So is MS playing both sides, or just opportunistic?


I am not trying to make any tin foil hat theories; I am just trying to make sens out of what is being said here in one simple to understand big picture post. (and how this relates to the "secret" that OP mentions)
 
Well if it wasn't closed when was it ever open sourced before turning it over to kronos group?
You mean the 435 page programming guide isn't enough?
Yeah, cause there are much better ways to expose APIs, right?

Jeez. Name a single nV proprietary then become standard... pretty much anything, will ya? You know, just for a perspective on things.

On the other hand I also read that DX11 was secretly supporting some extras(async) for NV, pointing out an agreement between NV and MS.
Me wonders what this speculation is based on.

And how it bodes with "nvidia = no async".
 
It was said Nvidia spent (my english is slipping, is "spent" the right use here?) serious money on reducing the cpu overhead of their DirectX11 and OpenGL drivers. Maybe they managed to make a better multi-thread use of them, but that doesn't mean the Vulkan and DX12 implementation of async compute is the same.
 
vk_nv_glsl_shader is for using existing GLSL shaders(2), the ones used in OpenGL (they are compiled in runtime) on Vulkan, so you don't have to port them to SPIR-V (the new universal method, mostly precompiled). It's purely for easing the porting work, it even makes things run slower.
Good point on the Far Cry 2 example, but you have to remember Nvidia refused to implement DX10.1, they had to do an implementation or they would have looked slower/older than the competition.

OpenGL had vendor specific extensions for decades, not just with the recent console generation(1).
1. So what? AMD's OpenGL didn't have shader intrinsics and features. The difference shows between AMD's OpenGL's and Vulkan's framerates.

2. For NVIDA GPUs, read https://developer.nvidia.com/reading-between-threads-shader-intrinsics This is applicable for NVidia's Vulkan, OpenGL, DX11, DX12 and NVAPI.

https://www.opengl.org/discussion_b...0-Nvidia-s-OpenGL-extensions-rival-AMD-Mantle
According to Carmack himself, Nvidia’s OpenGL extensions can give similar improvements – regarding draw calls – with AMD’s Mantle.

attachment.php
 
Last edited:
Um AMD claimed they would make it open source but date they set to release the source came and went, 6 months went by and no source code. AMD refused to release the source to start((1) with then they canned the project(1) and turned it over to kronos(2). AMD dropped the ball not NV in that matter, don't say nv refused to use a closed API they never had access to cause it never made open source under AMD.
1. Mantle API is still listed as a working API in my latest Radeon driver. Mantle API wasn't completed and AMD wants to avoid slow/stone wall/filler buster API politics. NVidia has their own OpenGL vendor extensions and NVAPI competing against AMD's Mantle. NVAPI has existed for a long time i.e. before Mantle and DX11.

2. What's important is the end result.
 
What Carmack refers to is Nvidia's own nv_command_list OpenGL extension, it gives almost Vulkan level of overhead on OpenGL: https://www.opengl.org/registry/specs/NV/command_list.txt
It works like intended on Kepler and newer, and is badly implemented in Fermi.

Just for a quick example, AMD has GCN_shader in OpenGL, among others like pinned memory. I'm currently on Fedora, so I can give you only what the open driver offers, but it has enough examples:

GL_AMD_conservative_depth, GL_AMD_draw_buffers_blend,
GL_AMD_performance_monitor, GL_AMD_pinned_memory,
GL_AMD_seamless_cubemap_per_texture, GL_AMD_shader_stencil_export,
GL_AMD_shader_trinary_minmax, GL_AMD_vertex_shader_layer,
GL_AMD_vertex_shader_viewport_index

It even supports other vendor's extensions:

GL_NVX_gpu_memory_info,
GL_NV_conditional_render, GL_NV_depth_clamp, GL_NV_packed_depth_stencil,
GL_NV_texture_barrier, GL_NV_vdpau_interop

In Windows this list is more extensive.

Nvidia isn't the only one with their own optimizations, and not all of them are useful in gaming scenarios.
 
According to Carmack himself, Nvidia’s OpenGL extensions can give similar improvements – regarding draw calls – with AMD’s Mantle.
That's probably because nVidia does something similar at the driver level. I speculated before that nVidia's implementation might keep track of the OpenGL calls and might use a queue to buffer things like draw calls which are processed independently and joined up on when a OpenGL call comes through that requires the draw calls to all be completed.

In fact, for system integration at work I recently implemented a library (unpolished,) that takes in a stream of data with associated data dependencies. It blocks subsequent data from proceeding if there is something using a resource that must be handled serially but, allows other to continue to be processed in order to improve parallel throughput. I refer to this as "queue re-ordering based on data dependencies," and I wouldn't be surprised if nVidia did something similar at the driver level in order to give the appearance that certain calls are executed quickly but, in reality they very well might be happening asynchronously in another thread or process after being put on a queue in memory.

Simply put, you don't need async compute to use a queue to accelerate certain kinds of driver workloads (and in all seriously, many other kinds of workloads as well.) The nice thing about queues is that they decouple the what from the when so, instead of waiting for a draw call to complete, it returns immediately with the understanding that the draw was queued up and will eventually be executed before another OpenGL call is made that requires it to be complete. It's possible that certain OpenGL calls might tell nVidia's driver "look, you need to finish processing everything in the queue (or maybe even just the stuff that call cares about,) before continuing."
What Carmack refers to is Nvidia's own nv_command_list OpenGL extension, it gives almost Vulkan level of overhead on OpenGL: https://www.opengl.org/registry/specs/NV/command_list.txt
It works like intended on Kepler and newer, and is badly implemented in Fermi.

Just for a quick example, AMD has GCN_shader in OpenGL, among others like pinned memory. I'm currently on Fedora, so I can give you only what the open driver offers, but it has enough examples:

GL_AMD_conservative_depth, GL_AMD_draw_buffers_blend,
GL_AMD_performance_monitor, GL_AMD_pinned_memory,
GL_AMD_seamless_cubemap_per_texture, GL_AMD_shader_stencil_export,
GL_AMD_shader_trinary_minmax, GL_AMD_vertex_shader_layer,
GL_AMD_vertex_shader_viewport_index

It even supports other vendor's extensions:

GL_NVX_gpu_memory_info,
GL_NV_conditional_render, GL_NV_depth_clamp, GL_NV_packed_depth_stencil,
GL_NV_texture_barrier, GL_NV_vdpau_interop

In Windows this list is more extensive.

Nvidia isn't the only one with their own optimizations, and not all of them are useful in gaming scenarios.
...and that doesn't even touch on the implementation of the calls that are in the OpenGL (insert version you care about here,) spec. Just because you need to implement a spec that has functions x, y, and z doesn't mean that the implementations themselves are the same or even remotely similar.

tl;dr: I wouldn't be surprised if some of the things Vulkan is spec'ed out to do is merely done implicitly by nVidia's OpenGL drivers already, under the hood. Sometimes it's faster to put something on a queue and do it later than to do something on the spot so long as managing the queue doesn't outstrip the benefit.
 
That's probably because nVidia does something similar at the driver level. I speculated before that nVidia's implementation might keep track of the OpenGL calls and might use a queue to buffer things like draw calls which are processed independently and joined up on when a OpenGL call comes through that requires the draw calls to all be completed.

In fact, for system integration at work I recently implemented a library (unpolished,) that takes in a stream of data with associated data dependencies. It blocks subsequent data from proceeding if there is something using a resource that must be handled serially but, allows other to continue to be processed in order to improve parallel throughput. I refer to this as "queue re-ordering based on data dependencies," and I wouldn't be surprised if nVidia did something similar at the driver level in order to give the appearance that certain calls are executed quickly but, in reality they very well might be happening asynchronously in another thread or process after being put on a queue in memory.

Simply put, you don't need async compute to use a queue to accelerate certain kinds of driver workloads (and in all seriously, many other kinds of workloads as well.) The nice thing about queues is that they decouple the what from the when so, instead of waiting for a draw call to complete, it returns immediately with the understanding that the draw was queued up and will eventually be executed before another OpenGL call is made that requires it to be complete. It's possible that certain OpenGL calls might tell nVidia's driver "look, you need to finish processing everything in the queue (or maybe even just the stuff that call cares about,) before continuing."
From https://developer.nvidia.com/dx12-dos-and-donts

"On DX11, the driver does farm off asynchronous tasks to driver worker threads where possible".



What Carmack refers to is Nvidia's own nv_command_list OpenGL extension, it gives almost Vulkan level of overhead on OpenGL: https://www.opengl.org/registry/specs/NV/command_list.txt
It works like intended on Kepler and newer, and is badly implemented in Fermi.

Just for a quick example, AMD has GCN_shader in OpenGL, among others like pinned memory. I'm currently on Fedora, so I can give you only what the open driver offers, but it has enough examples:

GL_AMD_conservative_depth, GL_AMD_draw_buffers_blend,
GL_AMD_performance_monitor, GL_AMD_pinned_memory,
GL_AMD_seamless_cubemap_per_texture, GL_AMD_shader_stencil_export,
GL_AMD_shader_trinary_minmax, GL_AMD_vertex_shader_layer,
GL_AMD_vertex_shader_viewport_index

It even supports other vendor's extensions:

GL_NVX_gpu_memory_info,
GL_NV_conditional_render, GL_NV_depth_clamp, GL_NV_packed_depth_stencil,
GL_NV_texture_barrier, GL_NV_vdpau_interop

In Windows this list is more extensive.

Nvidia isn't the only one with their own optimizations, and not all of them are useful in gaming scenarios.
Nearly useless post since it doesn't specifically address performance difference between AMD's Vulkan and OpenGL frame rate results. Furthermore, majority of PC games are written with Direct3D not OpenGL APIs.


AMD has recently enabled GCN's Shader Intrinsic Functions with Vulkan, DirectX11 and DirectX12, while NVIDIA has Shader Intrinsic Functions with Direct3D and NVAPI along time ago.

https://developer.nvidia.com/unlocking-gpu-intrinsics-hlsl

"None of the intrinsics are possible in standard DirectX or OpenGL. But they have been supported and well-documented in CUDA for years. A mechanism to support them in DirectX has been available for a while but not widely documented. I happen to have an old NVAPI version 343 on my system from October 2014 and the intrinsics are supported in DirectX by that version and probably earlier versions. This blog explains the mechanism for using them in DirectX.

Unlike OpenGL or Vulkan, DirectX unfortunately doesn't have a native mechanism for vendor-specific extensions. But there is still a way to make all this functionality available in DirectX 11 or 12 through custom intrinsics. That mechanism is implemented in our graphics driver and accessible through the NVAPI library."

 
Last edited:
From https://developer.nvidia.com/dx12-dos-and-donts

"On DX11, the driver does farm off asynchronous tasks to driver worker threads where possible".

What's more interesting from that link you provided:
Don’ts
  • Don’t rely on the driver to parallelize any Direct3D12 works in driver threads
    • On DX11 the driver does farm off asynchronous tasks to driver worker threads where possible – this doesn’t happen anymore under DX12
    • While the total cost of work submission in DX12 has been reduced, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to submit work in parallel, the more benefit in terms of draw call submission performance can be expected.

After reading that, it makes me think that DX12 very well might be too difficult to handle in the same way to accelerate what would normally behave like a serial workload but, gets delegated to the driver as I surmised (even for DX11 which isn't entirely surprising.) I see this as being a move to put more power in the hands of game developers and less in the hands of driver developers should extra performance be demanded. It takes driver developers off the hook for making up for poor engine implementations, which nVidia has done exceptionally well in my opinion. In general, I would call this a good thing but, that very well might mean we're not going to see the same kind of driver advantage nVidia has had over AMD going forward but, I think that will highly depend on how game devs implement and utilize their engines going forward and it will be less about what kind of optimizations that device driver devs can make.
 
Back
Top