• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Radeon RX 6000 Series "Big Navi" GPU Features 320 W TGP, 16 Gbps GDDR6 Memory

Performance per watt did go up on Ampere, but that's to be expected given that Nvidia moved from TSMCs 12nm to Samsung’s 8nm 8LPP, a 10nm extension node. What is not impressive is only 10% performance per watt increase over Turing while being build on 25% denser node. RDNA2 arch being on 7 nm+ looks to be even worse efficiency wise given that density of 7nm+ is much higher, but let's wait for the actual benchmarks.

Did you literally just completely ignore the chart that was a few posts above you? 100/85 = 117.6% so still a 17.6% improvement in performance/watt over the most efficient Turing GPU.
 
Really, AMD needs to put out a new optimization guide that contains information like this (which they haven't written one since the 7950 series
Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.
If you are shader-launch constrained, it isn't a big deal to have a for(int i=0; i<16; i++){} statement wrapping your shader code. Just loop your shader 16 times before returning.
I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.:oops:
 
Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.

I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.:oops:
If this is really possible, it would be really awesome.
 
If this is really possible, it would be really awesome.
Yes, just power gate them until they are ready for full operation with no delay since they denote it is already an established problem to keep pipelines full rather than to empty them. If it helps, turning off shader array could provide a overclock ceiling expansion which also speeds up the idle recovery.
Funny thing is the rgp looks like a tapered trapezoid at the time distal end, so they ought to work on the retiring speed also.
I don't get it, still. All thread blocks are limited to 1024 size. Even an ai could pattern all possible permutations of a 1024 units workgroup. They aren't trying hard enough, haven't they even played any Starcraft... build orders are everything. Just 4pool, gg wp.
 
Back
Top