• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

R870 in two months?

I've heard rumours that the R870 will somehow pack 2,000 shaders into a 40nm core. I'd piss my pants with joy if this were true.

http://www.nordichardware.com/news,7766.html

Though the article is old, I find it believable; the RV670 had 320SPs and the RV770 has 800. So they more than doubled the SP count and it basically (along with other architectural tweaks) doubled the card's performance.

That said, I don't think I'll be upgrading for a while.
 
There's some consistent figures there with the memory bandwidth and TFLOPS. Something else to consider is that with the die shrink and only upping the shaders a small amount they could most likely take a big leap on the clock speed to get the same performance spike. Like how the 4830 can be clocked up to 4850 levels. Also the 2000 shader estimates have been around from before the 4xxx series was released, a lot of people seem to think they're talking about the 5870x2 in that case.
 
...which anyone who can operate a pocket calculator could fabricate.
 
One thousand SP isn't even possible, so you can throw that idea out of the window unless ATi fancy an overhaul... I think 1200 or 1600 is more likely. Anything very close to 1000 wouldn't offer a significant enough advantage over the mature and cheaper 800 parts... Just look at 4830 performance, it's pretty damn close to 4850...

1000 would not be a bad idea if ati decided to totally change the arch and copied nvidia in uing scaler instead of super scaler that would make their 1000SP's have such a heavier punch. that and if they unified them..doesnt ati still used programmed SP's? like 64 vertex 128 geometry or did they stop that?
 
1000 would not be a bad idea if ati decided to totally change the arch and copied nvidia in uing scaler instead of super scaler that would make their 1000SP's have such a heavier punch.
No reason to go back to scalar m-arch. ATi's super scalars clearly are more scalable (on silicon & in performance) and much, much cheaper on transistor budget.

that and if they unified them..doesnt ati still used programmed SP's? like 64 vertex 128 geometry or did they stop that?
R600/RV670/RV770 support SM4.0 thus they are programmable by definition...
 
Last edited:
No reason to go back to scalar SIMDs. ATi's super scalars clearly are more scalable and cheap on transistor budjet.


R600/RV670/RV770 support SM4.0 thus they are programmable by definition...

what i mean is nvidias unified shaders do not have a predefined function they process whatever needs to be processed but i know you already know that what i mean is are amd's universal or are they still like the old cards were certain clusters have a certain function?
 
what i mean is are amd's universal or are they still like the old cards were certain clusters have a certain function?

They are unified, you can program them to process pixel/vertex/geometry/HPC ...what 'fully programmable mean

...again parallelism depends on what role those "stream processors" really play: For an RV770:

10 shader blocks (16 [SIMDs] * 10 [Pipes] * 5 [ALUs]) = 800 SPs
 
Last edited:
Infact, nV's SPs are more of fixed function in nature.
Each nV SP ever since G80 has been MADD+MUL, while each AMD SIMD "SP" is 5 MADD.
MADD = multiply + addition function "dual issue"
MUL = multiplication function

For an RV770:

10 shader blocks (10 [SIMDs] * 16 [Pipes] * 5 [ALUs]) = 800 SPs
Pipes?
AMD's ROPs aren't linked to the shaders.
There's just 160 5-way SIMDs.
 
The whole point of using GDDR5 is because it doesn't need a 512 bit wide bus. I run my PowerColor 4870 PCS+ 1GB @ 840/1000 (4Ghz GDDR5) which gives me a bandwidth of 128GB/s, a lot more than a 512-bit GTX280 GDDR3 using a 512-bit wide bus. GDDR5 is more expensive, but the 256-bit bus (and wiring) makes up for the difference while boosting performance.


Yet benchmarks don't show anything in the way of how special the use of DDR5 should be, especially at the rated speeds.

And it seems to be the case that larger bus GPUs end up with lower minimum frame rates, as opposed to faster memory that processes things quicker, but then tanks when under pressure.

I've heard rumours that the R870 will somehow pack 2,000 shaders into a 40nm core. I'd piss my pants with joy if this were true.

http://www.nordichardware.com/news,7766.html

Though the article is old, I find it believable; the RV670 had 320SPs and the RV770 has 800. So they more than doubled the SP count and it basically (along with other architectural tweaks) doubled the card's performance.

That said, I don't think I'll be upgrading for a while.


Ya but would 2000shaders do any real good? Would it stop the cards from taking such a performance hit when they are being taxed? I don't think any amount of shader units, core frequency or memory speed is going to enable GPUs to withstand the punishment that even current 3d applications can dish out -full fledged. The only thing these increasing monsters are going to do, is slow the rate at which they start to lose performance, but it will never be enough, because applications will get more demanding. Eventually something has to give.

Ultimatley we're looking for gaming GPUs, that can run real time vertex drawing while in motion. Unfortunatley we could be several years away from seeing that, due to how much of an architectural change would have to take place, and in the meantime, they'll keep pumping out these monstrosities, as if they were rival factions in an arms race.
 
Last edited:
ahh thanks for clearing it up.
 
Pipes?
AMD's ROPs aren't linked to the shaders.
There's just 160 5-way SIMDs.

Not ROP, the line that connects dispatch to the shader export you see in the R600 schematic:
block-r600.gif


You don't just have 160 SIMDs, you have groups of 16/each that share a route. I just lost the term =)

RV770: 16 x 10 = 160, 160 x 5 = 800
 
I'm quite happy with the R6xx performance, (eg 4870), but not happy with the R6xx power consumption and heat output. If R7xx can get more performance at lower power (or even, similar performance at much lower power, viz HD5830 series) then I'm very very interested.

I want stealthy silent performance.
 
I'm quite happy with the R6xx performance, (eg 4870), but not happy with the R6xx power consumption and heat output. If R7xx can get more performance at lower power (or even, similar performance at much lower power, viz HD5830 series) then I'm very very interested.

I want stealthy silent performance.
Welcome to 6 months ago... :D
RV770, and not R6__, is HD48_0 - and it did all that.
 
This is ridiculous. My 4870 x2 is fine for a year or 2.
 
I doubt NVidia has those sentiments :P
 
Yet benchmarks don't show anything in the way of how special the use of DDR5 should be, especially at the rated speeds.

And it seems to be the case that larger bus GPUs end up with lower minimum frame rates, as opposed to faster memory that processes things quicker, but then tanks when under pressure.

You don't seem to see the key point here - it's all memory bandwidth! You could have a 1-bit bus and DDR that does 512 times the data rate, and it would be just as good (and the same as far as the GPU knows) as a 512-bit bus with non-DDR... There is nothing increasing the bus width can do, that increasing the RAM speed cannot!
 
Yep.
Memory "bus width" is nothing but one of the several factors that dictate the total memory bandwidth. One cannot say a 512bit bus inherently performs better than a 256bit bus. Or a bus of any given width, for the record...
 
You don't seem to see the key point here - it's all memory bandwidth! You could have a 1-bit bus and DDR that does 512 times the data rate, and it would be just as good (and the same as far as the GPU knows) as a 512-bit bus with non-DDR... There is nothing increasing the bus width can do, that increasing the RAM speed cannot!

Yep.
Memory "bus width" is nothing but one of the several factors that dictate the total memory bandwidth. One cannot say a 512bit bus inherently performs better than a 256bit bus. Or a bus of any given width, for the record...

Even though GPUs benefit more from bandwidth than other types of precessors, bandwidth is not all that matters.

Actually, from a pure performance POV, a bigger bus is better than smaller one at double the speed. Reasons are various, but the most notables are these ones:

- Reduced latencies: yes, latencies are related to the actual speed and usually the end acces time is smaller for the faster memory: i.e. 4T @ 500 Mhz is actually higher latency than 7T @ 1000 Mhz. BUT for the same chunk of data to be sent you need twice the clock cycles, so even though the acces time to memory is smaller, the access time to the same amount of required data is bigger. Usually memory accesses are buffered reducing the effect of the aforementioned, but if you have enough context changes (either you want to use a different instruction or a different data type, or simply data that is in a different place in the memory) within short time periods the sum of the latencies end up being much much higher. (Main reason Nvidia wants 512 bits, probably even with GDDR5, it is very important for CUDA)

- Advantage with longer instruction words: In a bigger bus you can send more and bigger instructions in the same clock. This is also inevitably tied to latencies, because when the instruction word is bigger than your bus width you obviously need more than one cycle s to send the instruction. This can be very true for R6xx/7xx/8xx architecture because of it's VLIW (very long instruction word) nature. In order to achieve high shader utilisation the same long instruction must contain instructions for all 5 units and:

256 bit / 5 ALU = 51.2

Even though I don't know how long instructions are in GPUs you can clearly see that the 256 bit bus is not enough to send 5 64 bit operands, let alone 10 if every instruction requires 2 operands. It is simply easier to fit more big things into bigger buses, so the efficiency is higher.
 
DarkMatter,
You ignored the fact that the narrow bus (GDDR5) is much, much faster than the wide bus (GDDR3).
Absolute latency (= time) is what matters, numerical amount of latency cycles is irrelevant.

for the same chunk of data to be sent you need twice the clock cycles
Correct. It's a given that one needs more cycles on the "narrow bus" BUT it doesn't matter as the length of the clock cycle of the fast but narrow bus is half of that on the slow but wide bus.
 
a bigger bus is better than smaller one at double the speed

You can't make that assumption when you realize a bus at double the speed can handle twice the amount of requests with twice the granularity of the bus. Think of it like this: a bus is nothing but a road. A 512 bit wide bus have 512 lanes while a 256 bit wide bus have 256 lanes. The narrower road have traffic that moves at twice that speed. There's always some gaps in the traffic, and the narrower road will present gaps open for traffic twice as often as the wider bus.
 
You can't make that assumption when you realize a bus at double the speed can handle twice the amount of requests with twice the granularity of the bus. Think of it like this: a bus is nothing but a road. A 512 bit wide bus have 512 lanes while a 256 bit wide bus have 256 lanes. The narrower road have traffic that moves at twice that speed. There's always some gaps in the traffic, and the narrower road will present gaps open for traffic twice as often as the wider bus.

It's a give or take between that and what I said. Wider bus does have finer granularity depending of what you understand from granularity in this case.

Following your analogy one of the things I said can be represented as you needing 196 lanes available at the same time to fit in a big vehicle. The 512 lanes one will have those lines available way more often. In the 256 line road you will probably have to stall the traffic in order to be able to fit your 192 wide vehicle. The final performace depends on how many times that will happen. Twice the requests only matters for punctual or aleatory SMALL chunks of data, which are scarce in graphics, and much scarcer in R6xx/7xx/8xx series of cards.

Aditionally you have free gaps more often, but the amount of them is smaller. Over the time the amount of gaps is the same, which is what matters is the end. The narrower road does have an advantage for punctual accesses but for that to be really advantageous, your acceleration has to be faster (lower latency). Without fast acceleration that would let you access the road faster, the gaps have to be bigger in order for you to be able to enter the lines (bad analogy probably). Latency is almost always comparatively higher on faster memories.

If you want another key factor that favors wider buses (and that also was behind my claim), power consumption is one. When circuits are running close to their clock limits power consumption (and heat, and current leakage and electromigration and probably many other things I can't think of now) grows exponentially. On the other hand increasing bus width increases it almost linearly.

Anyway, my assumtions are not such, because are based on an study I read some years ago that favored 256 bit wide bus against faster 128 bit one on graphics cards back then, with acutal empirical testing.

Of course my claim is not true for ALL buses and all implementations, but it's true in the case of graphics cards. I say this because maybe your problem was with the fact that my claim looked like I said a wider bus is always better, something that I didn't want to say.

DarkMatter,
You ignored the fact that the narrow bus (GDDR5) is much, much faster than the wide bus (GDDR3).
Absolute latency (= time) is what matters, numerical amount of latency cycles is irrelevant.

Actually no. Imagine you have to access 2048 bits of data, with both 256 and 512 bit buses. It will take 4 cycles to the 512 bit one and 8 cycles to 256 one. Now imagine a typical situation where the faster memory has higher numerical amount of latency cycles but runs at twice the speed. Imagine that translates to 4 ns (256 bit) and 5 ns (512bit) latencies (this is typical, like CAS 5 DDR2 and CAS 8 DDR3 for example). That would translate to 256 bit bus having 4*8= 32 ns of acumulated latency versus 5*4 = 20 ns of latency on the wider bus. Of course this is a worst case scenario for the 256 bit one because it implies that both have to access to the memory every cycle, and both find all the data they need in every cycle. In buffered situations the issue mentioned above loses importance, but it's still present to some extent, making the wider bus inherently better in that respect.

There's another situation, the one relevant to what Scyphe said. And that's when the buses have to access tons of small chunks of data. In this situation the slower wider bus is in a dissadvantage, because availavility of the bus is much more important than the amount of data it can carry. But this situation is extremely rare in buffered memories and much more in graphics cards, where data is usually big and coherent with the surrounding. For example vertex will have X, Y, Z components and pixels will have R, G , B, A.

The end result is a mix of those two extremes, between the need of more space or more availability. Stadistic science says that it is easier to fit (you can fit more) big things into big continents. So when data chunk is big enough, and in graphics they are, a wider bus is better.


@ both: I never said it is much better anyway. I would say the difference is within a 5% difference, but it IS esentially and stadistically better.
 
Last edited:
It's a give or take between that and what I said. Wider bus does have finer granularity depending of what you understand from granularity in this case.

Following your analogy one of the things I said can be represented as you needing 196 lanes available at the same time to fit in a big vehicle. The 512 lanes one will have those lines available way more often. In the 256 line road you will probably have to stall the traffic in order to be able to fit your 192 wide vehicle. The final performace depends on how many times that will happen. Twice the requests only matters for punctual or aleatory SMALL chunks of data, which are scarce in graphics, and much scarcer in R6xx/7xx/8xx series of cards.

Aditionally you have free gaps more often, but the amount of them is smaller. Over the time the amount of gaps is the same, which is what matters is the end. The narrower road does have an advantage for punctual accesses but for that to be really advantageous, your acceleration has to be faster (lower latency). Without fast acceleration that would let you access the road faster, the gaps have to be bigger in order for you to be able to enter the lines (bad analogy probably). Latency is almost always comparatively higher on faster memories.

If you want another key factor that favors wider buses (and that also was behind my claim), power consumption is one. When circuits are running close to their clock limits power consumption (and heat, and current leakage and electromigration and probably many other things I can't think of now) grows exponentially. On the other hand increasing bus width increases it almost linearly.

Anyway, my assumtions are not such, because are based on an study I read some years ago that favored 256 bit wide bus against faster 128 bit one on graphics cards back then, with acutal empirical testing.

Of course my claim is not true for ALL buses and all implementations, but it's true in the case of graphics cards. I say this because maybe your problem was with the fact that my claim looked like I said a wider bus is always better, something that I didn't want to say.



Actually no. Imagine you have to access 2048 bits of data, with both 256 and 512 bit buses. It will take 4 cycles to the 512 bit one and 8 cycles to 256 one. Now imagine a typical situation where the faster memory has higher numerical amount of latency cycles but runs at twice the speed. Imagine that translates to 4 ns (256 bit) and 5 ns (512bit) latencies (this is typical, like CAS 5 DDR2 and CAS 8 DDR3 for example). That would translate to 256 bit bus having 4*8= 32 ns of acumulated latency versus 5*4 = 20 ns of latency on the wider bus. Of course this is a worst case scenario for the 256 bit one because it implies that both have to access to the memory every cycle, and both find all the data they need in every cycle. In buffered situations the issue mentioned above loses importance, but it's still present to some extent, making the wider bus inherently better in that respect.

There's another situation, the one relevant to what Scyphe said. And that's when the buses have to access tons of small chunks of data. In this situation the slower wider bus is in a dissadvantage, because availavility of the bus is much more important than the amount of data it can carry. But this situation is extremely rare in buffered memories and much more in graphics cards, where data is usually big and coherent with the surrounding. For example vertex will have X, Y, Z components and pixels will have R, G , B, A.

The end result is a mix of those two extremes, between the need of more space or more availability. Stadistic science says that it is easier to fit (you can fit more) big things into big continents. So when data chunk is big enough, and in graphics they are, a wider bus is better.


@ both: I never said it is much better anyway. I would say the difference is within a 5% difference, but it IS esentially and stadistically better.

EDIT: OK. Even though the above is very true in theory, I have to take that back in the case of 256bit/512bit once that I have made some calcs with the most common operand widths I know. The better example to explain what I said above and what changed my mind about 256 vs 512 afterwards is this:

You want to transport 48bit words.

- In a 128 bit bus you can carry 128/48 = 2,66; that is 2 of them.

- In a 256 bit bus you can carry 256/48 = 5,33; so 5 of them. As you can see even if the 128 bit bus ran at 2x the speed the advantage for the 256 bus is clear: 4 vs 5, 20% slower.

- In a 512 bit bus you can carry 512/48 = 10,66; a 256 bit at 2x the speed can carry just the same amount of them with the added benefits of a faster bus.

Now where 48 bit words are common, I don't know of any word or operand type that would lead to the same situation I represented above. I actually don't know how the instruction sets are in GPUs but 48 bit wide data is very common. Unless 96bit data/instruction words are common nowadays, 512 bit wide bus does not benefit from the theory I have been representing. I admit I was defeated on this particular case.
 
Back
Top