Friday, March 6th 2020

AMD RDNA2 Graphics Architecture Detailed, Offers +50% Perf-per-Watt over RDNA

With its 7 nm RDNA architecture that debuted in July 2019, AMD achieved a nearly 50% gain in performance/Watt over the previous "Vega" architecture. At its 2020 Financial Analyst Day event, AMD made a big disclosure: that its upcoming RDNA2 architecture will offer a similar 50% performance/Watt jump over RDNA. The new RDNA2 graphics architecture is expected to leverage 7 nm+ (7 nm EUV), which offers up to 18% transistor-density increase over 7 nm DUV, among other process-level improvements. AMD could tap into this to increase price-performance by serving up more compute units at existing price-points, running at higher clock speeds.

AMD has two key design goals with RDNA2 that helps it close the feature-set gap with NVIDIA: real-time ray-tracing, and variable-rate shading, both of which have been standardized by Microsoft under DirectX 12 DXR and VRS APIs. AMD announced that RDNA2 will feature dedicated ray-tracing hardware on die. On the software side, the hardware will leverage industry-standard DXR 1.1 API. The company is supplying RDNA2 to next-generation game console manufacturers such as Sony and Microsoft, so it's highly likely that AMD's approach to standardized ray-tracing will have more takers than NVIDIA's RTX ecosystem that tops up DXR feature-sets with its own RTX feature-set.
AMD GPU Architecture Roadmap RDNA2 RDNA3 AMD RDNA2 Efficiency Roadmap AMD RDNA2 Performance per Watt AMD RDNA2 Raytracing
Variable-rate shading is another key feature that has been missing on AMD GPUs. The feature allows a graphics application to apply different rates of shading detail to different areas of the 3D scene being rendered, to conserve system resources. NVIDIA and Intel already implement VRS tier-1 standardized by Microsoft, and NVIDIA "Turing" goes a step further in supporting even VRS tier-2. AMD didn't detail its VRS tier support.

AMD hopes to deploy RDNA2 on everything from desktop discrete client graphics, to professional graphics for creators, to mobile (notebook/tablet) graphics, and lastly cloud graphics (for cloud-based gaming platforms such as Stadia). Its biggest takers, however, will be the next-generation Xbox and PlayStation game consoles, who will also shepherd game developers toward standardized ray-tracing and VRS implementations.

AMD also briefly touched upon the next-generation RDNA3 graphics architecture without revealing any features. All we know about RDNA3 for now, is that it will leverage a process node more advanced than 7 nm (likely 6 nm or 5 nm, AMD won't say); and that it will come out some time between 2021 and 2022. RDNA2 will extensively power AMD client graphics products over the next 5-6 calendar quarters, at least.
Add your own comment

242 Comments on AMD RDNA2 Graphics Architecture Detailed, Offers +50% Perf-per-Watt over RDNA

#51
HD64G
This improvement in efficiency means that for double the RX5700XT performance (80CU) it will consume close to 300W. And that is the worst Navi case in efficiency. Let's see if that is what AMD will bring to the table.
Posted on Reply
#52
gamefoo21
Hmm... 50% more perf per watt than previous gen Vega.

That's a comparison against Vega10. Then the slides show 50% more against RDNA 1. Without any process improvements eh?

I predict we'll see a 386bit memory Navi. Navi is bandwidth starved at the moment.
Posted on Reply
#53
R0H1T
The 50% perf/W improvement includes IPC as well as process improvements. They'd be well ahead of Nvidia if they could pull 2 gens of such improvements without process efficiency!
Posted on Reply
#54
_larry
I'm just glad AMD is getting their $hit together GPU wise again finally. They have already done VERY well with their CPUs, now if they can get closer to what Nvidia delivers, it's gonna be another game changer. (Pun intended)

When the R9's came out I was stoked. I still have my R9 290 from 2013 and it still can handle most games at 1440p with some settings turned down. I was very disappointed with the Polaris architecture. All they did was make them more power efficient with the same performance as my 290. Hell, my 290 still beats the RX580 in some benchmarks... I am looking forward to getting a 5700XT when the new cards drop though :)
Posted on Reply
#55
gamefoo21
_larry
I'm just glad AMD is getting their $hit together GPU wise again finally. They have already done VERY well with their CPUs, now if they can get closer to what Nvidia delivers, it's gonna be another game changer. (Pun intended)

When the R9's came out I was stoked. I still have my R9 290 from 2013 and it still can handle most games at 1440p with some settings turned down. I was very disappointed with the Polaris architecture. All they did was make them more power efficient with the same performance as my 290. Hell, my 290 still beats the RX580 in some benchmarks... I am looking forward to getting a 5700XT when the new cards drop though :)
The Fury X was so limited by it's vMem but it was a big GPU that fought with the 980. Then AMD just rode on Polaris and we haven't had a true high end GPU for a while. Vega 56/64 were pro GPUs forced into gaming. The V2 was the same, it is a beast of a workstation card, that plays games while arguing with the 2080.

The 5700XT was... Well a 2070 killer and a 2070 Super fighter.

It'll be nice if AMD can finally field another Radeon that can actually challenge for the performance crown again.

How long has it been since the Fury X came out? :-(

R0H1T
The 50% perf/W improvement includes IPC as well as process improvements. They'd be well ahead of Nvidia if they could pull 2 gens of such improvements without process efficiency!
I really don't see how AMD can get a 50% boost over RDNA 1 without a new and wider memory controller.

The 5700XT is desperately starved for bandwidth.

It's like my modified Fury X. Tightened up the HBM timings and at stock speed I can get over 300GB/s in OCLMembench. Stock as a rock the Fury X gets between 180-220GB/s for memory bandwidth. At 500mhz or well DDR for 1000 effective, it's theoretical is at 512GB/s.

It's hard for me to compare apples to apples because the mods also undervolted and under-clocked the core. Though it's similar with the 5700 XT... You can get nearly the same performance with less power by undervolting and mild under-clocked.

Either way a Fury X at 1000/1000 blows the doors off one at 1050/1000. On stock bios and I need to push the volts but it takes 1150/1200 to match.

It burns a lot more power. Tuned up makes a much happier Fury X that gets a significant bump to perf vs watts.

So if AMD could just not have to bring their damn architecture for every clock, it's possible to get most of the way there.

Which is why I think...

A refined 5700XT with 384bit memory that drops even 1-200mhz core from where it is now with a matching drop in vcore. That's not adding any other extra transistors to the die. Bump it to 44 CUs from 32, drop the core clocks 2-400mhz... All the way there.

Look at the 2080 Ti vs the 2080 Super. Bigger silicon, significantly less clocks, but it still performs.
Posted on Reply
#56
moproblems99
oxrufiioxo
Either way if they can get close to 50% per watt over a 5700XT while keeping prices sane it will be a nice card........... If its $800 it will be another fail. Well I guess that also depends on what Nvidia does they could raise prices again who knows.
Better not be their top. That is not good enough.
Posted on Reply
#57
oxrufiioxo
moproblems99
Better not be their top. That is not good enough.
well with AMD at this point it would just make me happy if they could compete with Nvidias 2nd best card and you figure whatever ampere brings the 3080 will be 10-30% faster than a 2080 ti most likely so competing with that would be a step in the right direction. oh and also not be 6-12 months late competing would be nice.
Posted on Reply
#58
moproblems99
oxrufiioxo
well with AMD at this point it would just make me happy if they could compete with Nvidias 2nd best card and you figure whatever ampere brings the 3080 will be 10-30% faster than a 2080 ti most likely so competing with that would be a step in the right direction. oh and also not be 6-12 months late competing would be nice.
Agreed, but I am not even sure 2 x 5700 would do that.
Posted on Reply
#59
MrMilli
gamefoo21
It's like my modified Fury X. Tightened up the HBM timings and at stock speed I can get over 300GB/s in OCLMembench. Stock as a rock the Fury X gets between 180-220GB/s for memory bandwidth. At 500mhz or well DDR for 1000 effective, it's theoretical is at 512GB/s.
No surprises there. Historically ATI has been terrible at making memory controllers.
Even if you go back more than a decade to the times of northbridges, ATI was the worst (while nVidia was the best at maximizing bandwidth). Nothing has changed.
Often reviewers site that nVidia designs are more memory bandwidth efficient, but while this might be true, my guess is that nVidia just gets more effective bandwidth out of the memory.
Posted on Reply
#60
Vya Domus
Cheeseball
While it is technically impressive, please note that RDNA is quite different to GCN at SIMD-level, where RDNA works with SIMD32 (native Wave32!!) and single-cycle instructions.

GCN (5th gen) used SIMD16, which means it issues instructions every 4(??) cycles, where as RDNA issues it every cycle. This inherently makes a 40 CU (RX 5700 XT) cluster faster than the previous 64 CU cards (Vega 64/Radeon VII).

Depending on what you're trying to achieve (raw core performance vs. optimized IPC), GCN5 can still compete well against its younger sibling. However RDNA can do everything GCN5 can do, except beating it in raw compute loads.
There isn't really anything inherently faster about that if the workload is nontrivial, it's just a different way to schedule work. Over the span of 4 clock cycles both the GCN CU and and RDNA CU would go through the same amount of threads. To be fair there is nothing SIMD like anymore about both of these, Terrascale was the last architecture that used a real SIMD configuration, everything is now executed by scalar units in a SIMT fashion.

Instruction throughput is not indicative of performance because that's not how GPUs extract performance. Let's say you want to perform one FMA over 256 threads, with GCN5 you'd need 4 wavefronts that would take 4 clock cycles within one CU, with RDNA you'd need 8 wavefronts which would also take the same 4 clock cycles within one CU. The same work got done within the same time, it wasn't faster in either case.

Thing is, it takes more silicon and power to schedule 8 wavefronts instead of 4 so that actually makes GCN more efficient space and power wise, if you've ever wondered why AMD would always be able to fit more shaders within the same space and TDP than Nvida, that's how they did it. And that's also probably why Navi 10 wasn't as impressive power wise as some expected and why it had such a high transistor count despite it not having any RT and tensor hardware (Navi 10 and TU106 practically have the same transistor count).

But as always there's a trade off, a larger wavefront means more idle threads when a hazard occurs such as branching. Very few workloads are hazard-free, especially a complex graphics shader, so actually in practice GCN ends up being a lot more inefficient per clock cycle on average.
Posted on Reply
#61
Cheeseball
Not a Potato
Vya Domus
Instruction throughput is not indicative of performance because that's not how GPUs extract performance. Let's say you want to perform a one FMA over 256 threads, with GCN5 you'd need 4 wavefronts that would take 4 clock cycles within one CU, with RDNA you'd need 8 wavefronts which would also take the same 4 clock cycles within one CU. The same work got done within the same time, it wasn't faster in either case.
You're correct about this though, any wavefront branching would require cycling through again until it was correctly executed, which can be inefficient.
Posted on Reply
#62
Prime2515102
MrMilli
No surprises there. Historically ATI has been terrible at making memory controllers.
Even if you go back more than a decade to the times of northbridges, ATI was the worst (while nVidia was the best at maximizing bandwidth). Nothing has changed.
Often reviewers site that nVidia designs are more memory bandwidth efficient, but while this might be true, my guess is that nVidia just gets more effective bandwidth out of the memory.
ATi never made northbridges.
Posted on Reply
#64
eidairaman1
The Exiled Airman
gamefoo21
The Fury X was so limited by it's vMem but it was a big GPU that fought with the 980. Then AMD just rode on Polaris and we haven't had a true high end GPU for a while. Vega 56/64 were pro GPUs forced into gaming. The V2 was the same, it is a beast of a workstation card, that plays games while arguing with the 2080.

The 5700XT was... Well a 2070 killer and a 2070 Super fighter.

It'll be nice if AMD can finally field another Radeon that can actually challenge for the performance crown again.

How long has it been since the Fury X came out? :-(



I really don't see how AMD can get a 50% boost over RDNA 1 without a new and wider memory controller.

The 5700XT is desperately starved for bandwidth.

It's like my modified Fury X. Tightened up the HBM timings and at stock speed I can get over 300GB/s in OCLMembench. Stock as a rock the Fury X gets between 180-220GB/s for memory bandwidth. At 500mhz or well DDR for 1000 effective, it's theoretical is at 512GB/s.

It's hard for me to compare apples to apples because the mods also undervolted and under-clocked the core. Though it's similar with the 5700 XT... You can get nearly the same performance with less power by undervolting and mild under-clocked.

Either way a Fury X at 1000/1000 blows the doors off one at 1050/1000. On stock bios and I need to push the volts but it takes 1150/1200 to match.

It burns a lot more power. Tuned up makes a much happier Fury X that gets a significant bump to perf vs watts.

So if AMD could just not have to bring their damn architecture for every clock, it's possible to get most of the way there.

Which is why I think...

A refined 5700XT with 384bit memory that drops even 1-200mhz core from where it is now with a matching drop in vcore. That's not adding any other extra transistors to the die. Bump it to 44 CUs from 32, drop the core clocks 2-400mhz... All the way there.

Look at the 2080 Ti vs the 2080 Super. Bigger silicon, significantly less clocks, but it still performs.
Same can be said of a 290X vs a 290.
Posted on Reply
#65
Valantar
efikkan
I feel it's disappointing to see that there are no major new architecture in sight; just more iterations of Navi.
Uh... You know that Navi is the new major architecture, right? As in RDNA (1) and not GCN? Of which there has been just one generation of chips? Expecting another within even a few years is silly. First come optimizations and revisions. They are probably working on the next arch on a conceptual level already, but it'll be quite a while before we see it.
moproblems99
Better not be their top. That is not good enough.
Why would it be? The main reason for perf/w improvements is to be able to cool a bigger/higher performing die in a PCIe form factor. Also, AMD has explicitly stated (both now and previously) that they will be competing at flagship level with this upcoming generation.
Posted on Reply
#66
Super XP
ratirt
Yeah. The x2 performance uplift for the RDNA2 is in comparison to GCN. This can be confusing for some people.

EDIT: Performance/Watt to be exact.
Umm nope, 50% Performance over RDNA.

moproblems99
Agreed, but I am not even sure 2 x 5700 would do that.
RDNA2 is targeting Nvidia's next generation GPU, called Ampere or the rumoured RTX 3080 series.
Again, RDNA2 is NOT competing with Nvidia's current generation graphics. Which is why there was some patents out about a possible RX 5800XT & 5900XT based on a revamped RDNA1 as a placeholder until RDNA2 is released by the beginning of Q4 2020. Or these revamps could be RDNA2, despite that gen being called RX 6000 series.
Posted on Reply
#67
ARF
I hope these new Navi 2* based cards will receive all new features like full hardware acceleration of anything 8K video related.
They also need full support of the latest HDMI and DisplayPort interfaces - HDMI 2.1 and DP 2.0.

50% performance/watt improvement is good - it means a card at 150 W which renders a game with 100 FPS, will now render it with 150 FPS.
Posted on Reply
#68
efikkan
Valantar
Uh... You know that Navi is the new major architecture, right? As in RDNA (1) and not GCN?
That's just marketing, even though many don't want to hear this. Internally in the driver Navi is still referred to as GCN, and the ISA is virtually unchanged. While there are some good improvements in Navi, these are still small compared to the pace Nvidia is innovating at.

Valantar
Of which there has been just one generation of chips? Expecting another within even a few years is silly. First come optimizations and revisions.
Only minor architectures for 8 years with the GCN/RDNA family, compared to Nvidia which seems to be doing minor/major every other time or so. I'm worried that the efficiency gap with Nvidia will increase if they don't keep up.

Valantar
They are probably working on the next arch on a conceptual level already, but it'll be quite a while before we see it.
They better be, Nvidia usually have three generations in various stages of development at any time, and designing a new architecture usually takes 3-6 years to market.
Posted on Reply
#69
ARF
RDNA 2.0 will be the 100% new micro-architecture.
RDNA 1.0 is just a hybrid, it keeps GCN characteristics.
Posted on Reply
#70
Vya Domus
ARF
RDNA 1.0 is just a hybrid, it keeps GCN characteristics.
RDNA is already worlds apart from GCN, the only real thing in common is that RDNA supports both wavefronts of 32 and 64, that's it. Well, that comes with the caveat that GPU architectures in general aren't very different one from another. GPUs have shallow pipelines, no out of order execution, no real branch prediction, they're mostly simple vector processors, there is just not a whole lot you can tweak and change.

In fact if you look throughout the history of GPUs you'll see that most of the performance typically comes from more shaders and higher clockspeeds, that's pretty much the number one driving factor for progress by far.
Posted on Reply
#71
ARF
RDNA 1.0 is just a heavily modified, rearranged GCN.

RDNA 2.0 will have ray-tracing hardware and variable rate shading capability which on their own should rearrange the architecture even further.

VLIW5 - VLIW4 - GCN:



Radeon HD 7870 Pitcairn GCN 1.0 original:



Radeon RX Vega GCN 1.4 vs Radeon RX 5700 XT RDNA 1.0 original:

Posted on Reply
#73
Super XP
ARF
RDNA 2.0 will be the 100% new micro-architecture.
RDNA 1.0 is just a hybrid, it keeps GCN characteristics.
Based on all the data available today RDNA2 will be a new uArch. One major difference I heard was that RDNA2 will have a completely new redesigned cache system. I think this has to do with next generation gaming consoles because Micro$oft has been closely working with AMD on its RDNA2. This is key to the PC Gaming Market. We are talking about a significant performance uplift over GCN and RDNA1 with great efficiency.
Posted on Reply
#74
medi01
moproblems99
Agreed, but I am not even sure 2 x 5700 would do that.
2080Ti is about 46%/55% faster than 5700XT (ref vs ref) at 1440p/4k respectively in TPU benchmarks.
Posted on Reply
#75
sergionography
Valantar
Sorry, but how on earth does anyone see "Navi 2X" in friggin' quotes without any further data and think "Oh, that must mean 2x the performance"? Sorry, but that is a rather extreme leap of the imagination. Also, x as a multiplier is generally lower case, this is upper case, which is generally X as an unknown variable. 2X = 20, 21, 22, etc. is much more reasonable of an assumption than 2X = 2x performance.

2X is the generational code name for all consumer-oriented non-semi custom RDNA 2 silicon, with each piece of silicon then having a distinct second digit. End discussion.
Yes that's true, but 2x performance is very likely non the less. Keep in mind that Navi 10/5700xt is a small 250mm2 chip. 50% performance per watt means AMD can scale up more shaders before running into a performance/power wall. If anything, this gives credit to big Navi being twice the size of navi10, a 500+mm2 chip with double the shaders. A 5120 Radeon core chip below the 300watt pci limit all for sudden becomes a possibility
Posted on Reply
Add your own comment