Monday, August 31st 2015

Lack of Async Compute on Maxwell Makes AMD GCN Better Prepared for DirectX 12

It turns out that NVIDIA's "Maxwell" architecture has an Achilles' heel after all, which tilts the scales in favor of competing AMD Graphics CoreNext architecture, in being better prepared for DirectX 12. "Maxwell" lacks support for async compute, one of the three highlight features of Direct3D 12, even as the GeForce driver "exposes" the feature's presence to apps. This came to light when game developer Oxide Games alleged that it was pressured by NVIDIA's marketing department to remove certain features in its "Ashes of the Singularity" DirectX 12 benchmark.

Async Compute is a standardized API-level feature added to Direct3D by Microsoft, which allows an app to better exploit the number-crunching resources of a GPU, by breaking down its graphics rendering tasks. Since NVIDIA driver tells apps that "Maxwell" GPUs supports it, Oxide Games simply created its benchmark with async compute support, but when it attempted to use it on Maxwell, it was an "unmitigated disaster." During to course of its developer correspondence with NVIDIA to try and fix this issue, it learned that "Maxwell" doesn't really support async compute at the bare-metal level, and that NVIDIA driver bluffs its support to apps. NVIDIA instead started pressuring Oxide to remove parts of its code that use async compute altogether, it alleges.
"Personally, I think one could just as easily make the claim that we were biased toward NVIDIA as the only "vendor" specific-code is for NVIDIA where we had to shutdown async compute. By vendor specific, I mean a case where we look at the Vendor ID and make changes to our rendering path. Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that. The only other thing that is different between them is that NVIDIA does fall into Tier 2 class binding hardware instead of Tier 3 like AMD which requires a little bit more CPU overhead in D3D12, but I don't think it ended up being very significant. This isn't a vendor specific path, as it's responding to capabilities the driver reports," writes Oxide, in a statement disputing NVIDIA's "misinformation" about the "Ashes of Singularity" benchmark in its press communications (presumably to VGA reviewers).

Given its growing market-share, NVIDIA could use similar tactics to keep game developers away from industry-standard API features that it doesn't support, and which rival AMD does. NVIDIA drivers tell Windows that its GPUs support DirectX 12 feature-level 12_1. We wonder how much of that support is faked at the driver-level, like async compute. The company is already drawing flack for using borderline anti-competitive practices with GameWorks, which effectively creates a walled garden of visual effects that only users of NVIDIA hardware can experience for the same $59 everyone spends on a particular game. Sources: DSOGaming, WCCFTech
Add your own comment

196 Comments on Lack of Async Compute on Maxwell Makes AMD GCN Better Prepared for DirectX 12

#177
FordGT90Concept
"I go fast!1!11!1!"
We kind of already gathered that, no? Async on AMD cards is executed asynchronously while async on NVIDIA cards is executed synchronously.

Interesting that on both architectures, 100 threads appears to be the sweet spot.
Posted on Reply
#178
Xzibit
truth teller
turns out nvidia async implementation is just a wrapper


source: ka_rf @ beyond3d forum

their marketing department should have a full schedule on the next days
Someone has to find out what kind of performance will be expected if the developer codes for Async and has multi-gpu scalability to the game engine.
Posted on Reply
#179
BiggieShady
Those nvidia cons managed to have a gpu architecture that is good for dx12 while it is good for dx11 at the same time, in the middle of a transition from dx11 to dx12 ... damn them.
Those bastards may even engineer Pascal completely with dx12 in mind.

But seriously, Maxwell architecture seems to handle async task concurrency between themselves just fine (latencies are in accordance with 32 queue depth)... problem is graphics workload being synchronous against async compute workload - if there is no architectural reason to be that way, this could be solved through a driver update. Troubling thing is, if nvidia knew they could fix it in driver update, they'd be faster with their response. Maybe Jen-Hsun Huang is writing a heartwarming letter.
Posted on Reply
#180
qubit
Overclocked quantum bit
Sony Xperia S
Why am I not surprised at all by this too ?

Nvidia is so dirty like pigs in the mud.

I keep telling you that this company is not good but how many listen to me ?
The world will become a better place when we get rid of nvidia.

Monopoly of AMD (with good hearts) will be better than monopoly of nvidia (who only look how to screw technological progress).
Say whut?! :eek: :laugh: Especially the bold bit. With idiotic statements like that, no wonder you're getting criticized by everyone.
Posted on Reply
#181
GC_PaNzerFIN
What a huge pile of dog turd over something which seems to have been a driver issue, completely expectable with Alpha level implementation of first ever DX12 title. NVIDIA DX12 driver does not seem to yet fully support Async Shaders, although Oxide dev thought it does.

Yeah, AMD technical marketing might not be your best source for info about competitor products. Combine that with a meltdown from a game dev... Then we have some good old fashioned NVIDIA bashing.

http://wccftech.com/nvidia-async-compute-directx-12-oxide-games/

"We actually just chatted with Nvidia about Async Compute, indeed the driver hasn’t fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute.
"
Posted on Reply
#184
Ikaruga
BiggieShady
@Ikaruga I was wondering what took toothless Spanish Nvidia employee so long
But he has some teeth;)
Posted on Reply
#185
rtwjunkie
PC Gaming Enthusiast
That comedy team has been running these gags almost 10 years. There's been a few people in other threads that actually think he works for Nvidia, so I figured I would throw this up here:
Posted on Reply
#186
BiggieShady
Ikaruga
But he has some teeth;)
He has a tooth, you are being generous with the plural there
Posted on Reply
#187
P-40E
Sony Xperia S
Really, I never knew and actually don't wanna know that this fruit the apple is so divine. :laugh:

Seriously, how would I have known that ? When this is the first time I hear someone speaking like that ?
Because you are supposed to be smart enough to comprehend it. (No offense) But that's how life works.
Posted on Reply
#188
anubis44
Mr McC
No support, no problem, just pay them to gimp it on AMD cards: welcome to the wonderful world of the nVidia console, sorry, pc gaming, the way it's meant to be paid.
More like the way WE'RE meant to be played.
Posted on Reply
#189
Vlada011
I don't see single DX12 game, only some calculation for possible scenario.
Until 10 DX12 games show on market Pascal will be old more than year.
Because of that I don't see reason for panic, if some card support DX12 that's not same as capable to offer playable fps.
I remember when 5870 with 2GB show up I bought immediately as first card with DX11 support.
Card was excellent but for DX9 and few DX10 environment, but first playable fps and much better with tessellation and DX11 was GTX580.
I changed and ATI5870 and ATI6970 but only with GTX580 situation become really better and with Tahiti later. Period between ATI 5870 and AMD 7970 AMD didn't improve nothing on DX11 field and people who wait and upgrade on GTX580 played much better, until HD7950/HD7970.
Because of that no reason to panic, NVIDIA will be ready when time come...
Only one other thing is bad and that's tendency to NVIDIA write driver only for last architecture.
If they continue to do that people will turn them back. At least middle segment.
That's much bigger reason for worry than Maxwell and DX12. We will not play nice DX12 games at least 2 years.
Maybe some very rich people with multi GPU. But I talk for people who play games as on beginning with single powerful graphic.
Posted on Reply
#190
Ikaruga
Vlada011
Only one other thing is bad and that's tendency to NVIDIA write driver only for last architecture.
If they continue to do that people will turn them back. At least middle segment.
That's much bigger reason for worry than Maxwell and DX12. We will not play nice DX12 games at least 2 years.
Maybe some very rich people with multi GPU. But I talk for people who play games as on beginning with single powerful graphic.
Nvidia has very good drivers for older generations, even Fermi cards run recent games quite nicely, one just needs to reduce some settings, but that's always the case as time goes by, you get a new card or reduce some settings. I recently helped somebody who has a 560ti + a 2500k (he bought those from me), and most of the games still look and run quite nicely with "medium" settings, and some of them even runs fine with "high". I can't imagine my Maxwell2 would need a replacement any time soon because of performance problems, if I will replace it, that will only happen because I won't be able to resist the upgrade itch again.
Posted on Reply
#191
rvalencia
BiggieShady
That is a claim presented at the beginning of the article. Through the end, if you read it, it is proven in benchmark that it is not true (number of queues horizontally and time spent computing vertically - lower is better)

Maxwell is faster than GCN up to 32 queues, and it evens out with GCN to 128 queues, where GCN has same speed up to 128 queues.
It's also shown that with async shaders it's extremely important how they are compiled for each architecture.
Good find @RejZoR
From https://forum.beyond3d.com/posts/1870374/



For pure compute, AMD's compute latency (green color areas) rivals NVIDIA's compute latency (refer to the attached file).

http://www.overclock.net/t/1569897/various-ashes-of-the-singularity-dx12-benchmarks/1710#post_24368195

Here's what I think they did at Beyond3D:
  1. They set the amount of threads, per kernel, to 32 (they're CUDA programmers after-all).
  2. They've bumped the Kernel count to up to 512 (16,384 Threads total).
  3. They're scratching their heads wondering why the results don't make sense when comparing GCN to Maxwell 2
Here's why that's not how you code for GCN


Why?:
  1. Each CU can have 40 Kernels in flight (each made up of 64 threads to form a single Wavefront).
  2. That's 2,560 Threads total PER CU.
  3. An R9 290x has 44 CUs or the capacity to handle 112,640 Threads total.
If you load up GCN with Kernels made up of 32 Threads you're wasting resources. If you're not pushing GCN you're wasting compute potential. In slide number 4, it stipulates that latency is hidden by executing overlapping wavefronts. This is why GCN appears to have a high degree of latency but you can execute a ton of work on GCN without affected the latency. With Maxwell/2, latency rises up like a staircase with the more work you throw at it. I'm not sure if the folks at Beyond3D are aware of this or not.


Conclusion:

I think they geared this test towards nVIDIAs CUDA architectures and are wondering why their results don't make sense on GCN. If true... DERP! That's why I said the single Latency results don't matter. This test is only good if you're checking on Async functionality.


GCN was built for Parallelism, not serial workloads like nVIDIAs architectures. This is why you don't see GCN taking a hit with 512 Kernels.

What did Oxide do? They built two paths. One with Shaders Optimized for CUDA and the other with Shaders Optimized for GCN. On top of that GCN has Async working. Therefore it is not hard to determine why GCN performs so well in Oxide's engine. It's a better architecture if you push it and code for it. If you're only using light compute work, nVIDIAs architectures will be superior.

This means that the burden is on developers to ensure they're optimizing for both. In the past, this hasn't been the case. Going forward... I hope they do. As for GameWorks titles, don't count them being optimized for GCN. That's a given. Oxide played fair, others... might not.
Posted on Reply
#192
BiggieShady
rvalencia
This test is only good if you're checking on Async functionality.
That's exactly what the test is for ... checking on how much latency Async functionality introduces on both architectures.
GCN has a constant latency, good enough for compute loads made of small number of async tasks and great for huge number of async tasks. Additionaly GCN mixes compute async load and graphics load in near perfect parallelism.
Maxwell shows varying latency that is extremely low for small number of async tasks and surpasses GCN over 128 async tasks. What's really bad is that in current drivers async compute load and graphics load are done serially.
Mind you, every single async compute task is parallel in itself and can occupy 100% of GPU if the job is suitable (parallelizable), so in most cases penalty boils down in how many times and how much context switching is done. Maxwell has nice cache hierarchy to help with that.
GCN should destroy Maxwell in special cases where huge number of async tasks depend on results calculated by huge number of other async tasks that are greatly varying in computational complexity ;)
Posted on Reply
#193
Xzibit
Gears of War Ultimate Will Have Unlocked Frame Rate; Devs Explain How They’re Using DX12 & Async Compute

To begin with, Cam McRae (Technical Director for the Windows 10 PC version) explained how they’re going to use DirectX 12 and even Async Compute in Gears of War Ultimate.
We are still hard at work optimising the game. DirectX 12 allows us much better control over the CPU load with heavily reduced driver overhead. Some of the overhead has been moved to the game where we can have control over it. Our main effort is in parallelising the rendering system to take advantage of multiple CPU cores. Command list creation and D3D resource creation are the big focus here. We’re also pulling in optimisations from UE4 where possible, such as pipeline state object caching. On the GPU side, we’ve converted SSAO to make use of async compute and are exploring the same for other features, like MSAA.
Posted on Reply
#195
FordGT90Concept
"I go fast!1!11!1!"
I'm sure they did. They did that to the Windows version of Minecraft already. Of course there's no technical reason for doing so.
Posted on Reply
#196
Xzibit
rtwjunkie
So is MS pulling another Halo 2, and making this W10 exclusive? It sounds like it, but doesn't outright say it.
Its X-Box One & Windows 10. Fable Legends is the same.
Posted on Reply
Add your own comment