Tuesday, May 21st 2019

AMD "Navi" Features 8 Streaming Engines, Possible ROP Count Doubling?

AMD's 7 nm "Navi 10" silicon may finally address two architectural shortcomings of its performance-segment GPUs, memory bandwidth, and render-backends (deficiency thereof). The GPU almost certainly features a 256-bit GDDR6 memory interface, bringing about a 50-75 percent increase in memory bandwidth over "Polaris 30." According to a sketch of the GPU's SIMD schematic put out by KOMACHI Ensaka, Navi's main number crunching machinery is spread across eight shader engines, each with five compute units (CUs).

Five CUs spread across eight shader engines, assuming each CU continues to pack 64 stream processors, works out to 2,560 stream processors on the silicon. This arrangement is in stark contrast to the "Hawaii" silicon from 2013, which crammed 10 CUs per shader engine across four shader engines to achieve the same 2,560 SP count on the Radeon R9 290. The "Fiji" silicon that followed "Hawaii" stuck to the 4-shader engine arrangement. Interestingly, both these chips featured four render-backends per shader engine, working out to 64 ROPs. AMD's decision to go with 8 shader engines raises hopes for the company doubling ROP counts over "Polaris," to 64, by packing two render backends per shader engine. AMD unveils Navi in its May 27 Computex keynote, followed by a possible early-July launch.
Sources: KOMACHI Ensaka (Twitter), Edificil (Reddit)
Add your own comment

24 Comments on AMD "Navi" Features 8 Streaming Engines, Possible ROP Count Doubling?

#1
Zubasa
If this is true, there might be some hope for RTG.
Assuming Raja actually did his job before he left for Intel, which was to make GCN more scale-able.
The Geometry limit was the Achilles heel of GCN in terms of gaming performance.

In Pixel Fill-rate the Radeon VII actually beats the 2080, but when you look at Geometry it is far behind.
The ROP count might not be the actual issue.


Posted on Reply
#2
biffzinker
btarunr, post: 4051934, member: 43587"
CU continues to pack 64 stream processors
The other change was the CU has supposedly been broken up into 2x32 unlike Fuji/Vega (8x5x2x32 = 2560)
Posted on Reply
#3
snakefist
Raja did *something* in AMD for sure. Head of Research or Engineering Team or whatever, he couldn't *just* reiterate GCN for endless generations. I would go that far to say that he probably developed at least a base for Next architecture (so say many others). In Intel, his influence also won't be felt for few years... To consumers, that is.
Posted on Reply
#4
TheLostSwede
I'm not planning on getting a new graphics card this year anyhow, but I do hope AMD can come up with something competitive at least, as it's much needed.
Monopolies aren't good for anyone, as it normally means higher prices, slower innovation and poor selection.
Nvidia might not be a monopoly, but with their performance lead on the higher end of the market, they might as well be.
Here's also fingers crossed that Intel will bring out something competitive when they launch their GPUs.
I long for the days when there were half a dozen competitive GPU makers, but that was a very long time ago and before they were called GPUs...
Posted on Reply
#6
Zubasa
Windyson, post: 4051979, member: 187282"
8 SEs ? unbelievable !
Yeah it would require substantial change to the front end.
Remember the 128ROP BS that pop up around Vega 20? This can easily be another round of BS before launch.
Posted on Reply
#7
biffzinker
Here's a better rendition of the 8 Streaming Engines.
Posted on Reply
#8
_Flare
1. AMD has as far as i know nearly no pain when using ROP-Blending, so at that side it has nearly no Problems with bandwidth, opposing to that Nvidia looses a good bunch when using ROP-Blending.
2. AMD is very flexible in the wired amount of ROP, so they could´ve used 128 ROP with Hawaii if they wanted, they´ve seen no need for that by now, even for the Radeon VII, or MI60 they didn´t.
3. Even Nvidia is shy for using 8 Geo-Engines because the wirering will owerwhelm the chip with nearly no efficiency-gain
4. Navi is GCN and more than 4 Shader-Arrays are forbidden in GCN.

The picture on the bottom is changeable to avec (with) blending. The Vegas will be over 100 for the upper 4 numbers, wich is no bad at all.
https://www.hardware.fr/articles/955-7/performances-theoriques-pixels.html
It´s the newest one Mr Triolet made before his departure to AMD and later to Intels Graphic division.
Posted on Reply
#10
efikkan
AMD's 7 nm "Navi 10" silicon may finally address two architectural shortcomings of its performance-segment GPUs, memory bandwidth, and render-backends (deficiency thereof).
Neither memory bandwidth nor render backends are shortcomings of GCN. First of all, if ROPs were a bottleneck, they would have easily added more. Secondly, GCN cards have plenty of memory bandwidth vs. their Nvidia counterparts;
Radeon VII (1 TB/s) vs. RTX 2080 (448 GB/s)
Vega 64 (484 GB/s) vs. GTX 1080 (320 GB/s)
RTX 580 (256 GB/s) vs. GTX 1060 (192 GB/s)
Posted on Reply
#11
M2B
efikkan, post: 4052244, member: 150226"
Neither memory bandwidth nor render backends are shortcomings of GCN. First of all, if ROPs were a bottleneck, they would have easily added more. Secondly, GCN cards have plenty of memory bandwidth vs. their Nvidia counterparts;
Radeon VII (1 TB/s) vs. RTX 2080 (448 GB/s)
Vega 64 (484 GB/s) vs. GTX 1080 (320 GB/s)
RTX 580 (256 GB/s) vs. GTX 1060 (192 GB/s)
It doesn't even need that much knowledge to understand these simple facts.
Just looking at benchmarks and comparing AMD cards to their Nvidia rivals in different resolutions is enough.
AMD generally puts more bandwidth on their cards simply because they can't compete head to head and need more raw resourses to do so.
Posted on Reply
#12
my_name_is_earl
AMD still struggles to compete with last years' card. It's over for AMD. It's over. I'm an Nvidia believer now.
Posted on Reply
#13
mtcn77
Zubasa, post: 4051937, member: 30988"
In Pixel Fill-rate the Radeon VII actually beats the 2080, but when you look at Geometry it is far behind.
The ROP count might not be the actual issue.

Rop count is indeed the issue, otherwise color compression scores would be higher. The shaders as a total cannot be pipelined more than the transmitted data packets. It is a simple modem with all the bells and whistles that make it tick. If you select 4-byte packets, the router is overloaded. 8&16-byte packing must be the basic unit count for maximum effect.
efikkan, post: 4052244, member: 150226"
Neither memory bandwidth nor render backends are shortcomings of GCN. First of all, if ROPs were a bottleneck, they would have easily added more. Secondly, GCN cards have plenty of memory bandwidth vs. their Nvidia counterparts;
Radeon VII (1 TB/s) vs. RTX 2080 (448 GB/s)
Vega 64 (484 GB/s) vs. GTX 1080 (320 GB/s)
RTX 580 (256 GB/s) vs. GTX 1060 (192 GB/s)
Yes, but for latency reasons all cannot be utilised in short shaders - that all changes when shader packing api is integrated into the Vulkan pipeline. Timothy Lottes did much on that end.
_Flare, post: 4052118, member: 85512"
1. AMD has as far as i know nearly no pain when using ROP-Blending, so at that side it has nearly no Problems with bandwidth, opposing to that Nvidia looses a good bunch when using ROP-Blending.
2. AMD is very flexible in the wired amount of ROP, so they could´ve used 128 ROP with Hawaii if they wanted, they´ve seen no need for that by now, even for the Radeon VII, or MI60 they didn´t.
3. Even Nvidia is shy for using 8 Geo-Engines because the wirering will owerwhelm the chip with nearly no efficiency-gain
4. Navi is GCN and more than 4 Shader-Arrays are forbidden in GCN.

The picture on the bottom is changeable to avec (with) blending. The Vegas will be over 100 for the upper 4 numbers, wich is no bad at all.
https://www.hardware.fr/articles/955-7/performances-theoriques-pixels.html
It´s the newest one Mr Triolet made before his departure to AMD and later to Intels Graphic division.
  1. Yes, but that happened due to rop backends having their own seperate caches, with Vega AMD reverts to the same in-cache rop bandwidth amplification - that is more bandwidth for cacheable operations, but due to common cache architecture, buffer overflows lead to cataclysmic performance loss. It is like reverting from dual cores to single core hyperthreading.
  2. AMD is improving upon bitpacking. They already had the most integrated rop pipeline since Cayman 6900's. The depreciation of rop functions made that obsolete, now everything is done in shaders and common buffers. That lead to a general high-cache low-shader design, the same as Nvidia. This is not much to say about efficiency since there are better alternatives to pushing simple pixels; however as it stands quality pixels aren't as equitable as shader effects. You just wouldn't base a pick upon say, rapid packed math - although that is just the thing to fit 4K into a 150w console form factor. Initiating writes take up memory interface latency, so two writes for the instance of one is fundamentally efficient.
Posted on Reply
#15
btarunr
Editor & Senior Moderator
biffzinker, post: 4052539, member: 163731"
@btarunr, Might want to include credit to this reddit thread. There saying you stole the ascii diagram.
Amd/comments/braa94/_/eof8jvd
Done.
Posted on Reply
#16
rvalencia
efikkan, post: 4052244, member: 150226"
Neither memory bandwidth nor render backends are shortcomings of GCN. First of all, if ROPs were a bottleneck, they would have easily added more. Secondly, GCN cards have plenty of memory bandwidth vs. their Nvidia counterparts;
Radeon VII (1 TB/s) vs. RTX 2080 (448 GB/s)
Vega 64 (484 GB/s) vs. GTX 1080 (320 GB/s)
RTX 580 (256 GB/s) vs. GTX 1060 (192 GB/s)
NVIDIA's GPUs has superior memory compression.
Posted on Reply
#17
Vayra86
my_name_is_earl, post: 4052443, member: 66600"
AMD still struggles to compete with last years' card. It's over for AMD. It's over. I'm an Nvidia believer now.
It was over when they announced they were going to 'focus on midrange' with Polaris. The hidden message there is 'we can't keep up', and every high end release after that simply confirmed it.

They're stuck, and they have been stuck since Hawaii, its what I've been seeing and saying ever since. Fury X was not competitive and HBM for the gaming segment was a stopgap measure to keep GCN in the game, not something you do if you like a healthy profit margin. Vega simply didn't perform as it should have (or should have been ready for launch when Nvidia launched GP104), and VII is saved by the 7nm node; 'sorta'.

Beyond that, there is nothing to give. At the same time, they haven't got the technology/performance lead that provides the necessary time to complete revamp GCN from the ground up. Ironically, its rather similar to Intel's current CPU roadmap. Perhaps that is part of the rationale for Intel to focus on GPU as well; perhaps they've seen you can't be leading all the time without creating new risk (stagnation).
Posted on Reply
#18
efikkan
rvalencia, post: 4052639, member: 99935"
NVIDIA's GPUs has superior memory compression.
Memory compression really only helps for sparse texture/buffer data, and while Nvidia employs more advanced compression than AMD, it doesn't account for 30-50% more effective bandwidth.
Posted on Reply
#19
mtcn77
efikkan, post: 4052868, member: 150226"
Memory compression really only helps for sparse texture/buffer data, and while Nvidia employs more advanced compression than AMD, it doesn't account for 30-50% more effective bandwidth.
AMD is band-limited for texture sampling by the pixel clock rate. They have an advantage, but not in the pixel shader pathway. Anisotropic filtering is also cache-limited as per every 4th clock cycle, so to start every pixel from the pixel shader - it is a significant difference having 2x rops like Nvidia, or not.
The other method is the compute shader: it does not throttle tmus, but the tmu cache is still quarter rate per 16x af and it does not work like the pixel shader. One benefit of the pixel shader is, it is fully pipelined: you don't go full netburst-prescott disaster; it is pipelined, every data is memory mapped and you get the usual benefits. The gain of compute shader is that it is the non-native version of this pipeline - whether the developer can benefit from his own custom pipeline is his doing. While the pixel shader is streamlined per memory accesses(reads) for less latency by default, the compute shader has the benefit of write streamlining, since caches are a faster storage medium than memory. It is just a coincidence which fits the attempted end result - using caches instead of vram has the added benefit of cutting out the middleman oem manufacturers at setting the memory timing parameters in their premium gpu lines; caches are a more uniform solution than custom gddr dies.
Posted on Reply
#20
rvalencia
efikkan, post: 4052868, member: 150226"
Memory compression really only helps for sparse texture/buffer data, and while Nvidia employs more advanced compression than AMD, it doesn't account for 30-50% more effective bandwidth.
Nvidia has robust immediate mode tile cache render since Maxwell.

Posted on Reply
#21
theoneandonlymrk
Zubasa, post: 4051937, member: 30988"
If this is true, there might be some hope for RTG.
Assuming Raja actually did his job before he left for Intel, which was to make GCN more scale-able.
The Geometry limit was the Achilles heel of GCN in terms of gaming performance.

In Pixel Fill-rate the Radeon VII actually beats the 2080, but when you look at Geometry it is far behind.
The ROP count might not be the actual issue.



While I Agree on the Geometry comments i think your chart shows it could still do with more rops ,perhaps for navi 20 though eh.

Been thinking about that name Navi ,I got New Architecture for Vertical Integration, i'm thinking :);) it'll be modular obviously and obviously adaptable.

The star's a coincidence.
Posted on Reply
#22
Manoa
this suckx :x only 2500 shaders ? whay ?! radeon 7 gived 4000 :x
it a joke, 780 Ti from zillion years ago have 2800 :x
Posted on Reply
#23
efikkan
A chip with 2560 SPs to match RTX 2070 would require a fairly large efficiency gain, which would be appreciated, but I remain sceptical.
Posted on Reply
#24
Aquinus
Resident Wat-man
I always felt that the way Hawaii was setup was more conducive to doing GPGPU as opposed to rendering. It really had quite a lot of power, but rarely would you really see it taken advantage of. To be honest, nVidia is more like this, where each SM didn't contain quite as many shaders as a GCN CU. If AMD has made their shaders efficient enough (and I'm feeling fairly confident that they have,) then this should help mitigate some issues people have been attributing to GCN. Honestly, GCN really isn't a bad architecture. It's just how these GPUs have been designed because raw compute power doesn't always translate to better graphics performance.

Honestly, consider for a moment that a 390 has more double-precision compute power than a 2080 Ti which would be hilarious beside the fact that it gets you absolutely nothing in games.
Posted on Reply
Add your own comment