Monday, September 27th 2010

AMD Radeon HD 6700 Series ''Barts'' Specs Sheet Surfaces

Here is the slide we've been waiting for, the specs sheet of AMD's next-generation Radeon HD 6700 series GPUs, based on a new, radically redesigned core, codenamed "Barts". The XT variant denotes Radeon HD 6770, and Pro denotes HD 6750. AMD claims that the HD 6700 series will pack "Twice the Horsepower", over previous generation HD 5700 series. Compared to the "Juniper" die that went into making the Radeon HD 5700 series, Barts features twice the memory bandwidth thanks to its 256-bit wide high-speed memory interface, key components such as the SIMD arrays split into two blocks (like on Cypress), and we're now getting to learn that it uses a more efficient 4-D stream processor design. There are 1280 stream processors available to the HD 6770 (Barts XT), and 1120 stream processors to the HD 6750 (Barts Pro). Both SKUs use the full 256-bit memory bus width.

The most interesting specification here is the shader compute power. Barts XT churns out 2.3 TFLOP/s with 1280 stream processors, GPU clocked at 900 MHz, while the Radeon HD 5870 manages 2.72 TFLOP/s with 1600 stream processors, 850 MHz. So indeed the redesigned SIMD core is working its magic. Z/Stencil performance also shot up more than 100% over the Radeon HD 5700 series. Both the HD 6770 and HD 6750 will be equipped with 5 GT/s memory chips, at least on the reference-design cards, which are technically capable of running at 1250 MHz (5 GHz effective), though are clocked at 1050 MHz (4.20 GHz effective) on HD 6770, and 1000 MHz (4 GHz effective) on HD 6750. Although these design changes will inevitably result in a larger die compared to Juniper, it could still be smaller than Cypress, and hence, more energy-efficient.

Source: PCinLife
Add your own comment

245 Comments on AMD Radeon HD 6700 Series ''Barts'' Specs Sheet Surfaces

#1
Tatty_One
Super Moderator
yogurt_21 said:
well 1 5850 vs 5870 yet you used it as a reason why the 6770 and 6870 would have the same number of rop's. If you're going to use the cypress you have to incorporate juniper as a comparison for barts to caymen, not cypress pro vs cypress xt. again were talking mid range to highend not lower highend to higher highend.

so the gap has to be larger between the two to make sense in pricing and market positioning.


second overclock a 5850 to 5870's clocks and it'll bench just a hair lower. overclock a 5850 past a 5870 and it'll bench higher. so while shaders do help, there's plenty of them on all modern gpu's. This is exactly why far more 5850's sold than 5870s, the prformance was similar but the prices were not.

plus with the swapout from 4 simple + 1 complex to 4 moderately complex we're likly going to see more frames per shader out of the 6k series. So if we're talking the same rop's and more shaders it's unlikely that caymen would be that much better than barts. after all the chart shows barts at 1280 medium complexity shaders that should be a stark contrast with the 320 complex and 1280 simple on cypress xt.

if you take a look at 5770 vs 5830 where both have 16 rop's, clocks are close with the exception of memory clock and the memory bit is different, but the main difference is 800 shaders vs 1120 shaders (40% more) the difference averages to 13% in W1z's reviews. Now while I feel 256bit vs 128 bit accounts for at least a couple of those frames It's more than easy enough to make up that amount with overclocking.

so if caymen is only increasing shaders by 50% and tmu's while keeping the same rop's, the performance won't be as scalable as the 5770 to 5870 and we'll have a 6770 capable of taking sales away from the 6870 not just in price/performance but performance in general.

imo it would be a bad bad move when they have the chance to repeat the success of the 5xxx series.
Lol, I didnt use the comparision as a "reason" they should be compared to bart etc, I used it because in your previous post you said that you found it difficult to belive that mid and high end cards would have the same ROP count and my example clearly shows that is not always the case because the 5850 and 5870 do. All that you have said does not change the fact that currently, in order for the ROP count to be increased, the memory bus must also be increased, so unless you are sure that we will see some 512bit bus versions then which ever way you want a look at it, you are going to pay a huge premium for that, one of the main reasons ATi have been so competative price wise recently is because they have gone for the 256 bus, NVidia's 384bit + bus widths cost more to produce, just in PCB terms alone .
Using the comparison between the 5830 and the 5770, throws up some odd results, as well as what you have mentioned, despite having double the memory bus it has the same ROP count as the 5770 but were you aware, despite it having double the memory bus, the 5830 is actually SLOWER in pixel fill rate than the 5770, now thats for a couple of reasons but my point is Bus and ROP count are just ingredients in the overall performance, people seem to get too hung up on it, you can get to a point where too many ROP's actually strangle performance and show little improvement where other ingredients can give a greater boost.

Now if we do see a 512bit bus..... and I am not saying we won't, then as you have said, there is more potential there, but with that comes a fairly large hike in prices, I have some doubts that AMD want to go down that route personally, although maybe on just the one top end card.......... my point all along has simply been 2 fold.......

1. Currently I beleive there are limitations on ROP count against Memory Bus size, you aint gonna get 64 ROP's on a 256bit wide bus.
2. There are a lot more factors to overall performance than just bus size and ROP count.

Simple as that really.
Posted on Reply
#2
dalelaroy
Increased ROPs Without Increased Memory bus

Tatty_One said:
All that you have said does not change the fact that currently, in order for the ROP count to be increased, the memory bus must also be increased,
Note that although both Redwood and Juniper have 128-bit memory buses, Redwood has 8 ROPs versus Juniper's 16 ROPs. It would not violate the pattern for Cayman to have twice the ROPs of Barts without an increase in memory width. It would simply be applying the Evergreen 128-bit pattern, in which Redwood and Juniper are the only families sharing a bus width, to the Northern Islands 256-bit width, with Barts and Cayman being the only families that share the same bus width.

I think it is more likely that Cayman will have 384-bit memory, but I also think that it might take less board real estate to simply double the ROPs per memory controller. As for the bandwidth argument, even with GF104 having less bandwidth than Cypress, it seems to have greater ROP performance. Doubling the ROPs may be overkill, but Cayman needs at least double the ROPs performance of Barts to take on GF100 in those applications where ROPs are the limitation.
Posted on Reply
#4
btarunr
Editor & Senior Moderator
Nah, no Barts launch on 18th ± 2 days, AFAIK. Also, I'd dismiss that new "we are right, they all were wrong" specs sheet some sites are sharing as "RV770 has 480 stream processors, not 800, as rumors claimed" encore. If Hilbert got those specs from AMD (because that article is written more like stating facts than inquisition), he'd also have an NDA over him.

In no way am I giving credibility to the information we have, but just saying that at this point that specs sheet is not one bit more credible.
Posted on Reply
#5
bear jesus
btarunr said:
Nah, no Barts launch on 18th ± 2 days, AFAIK. Also, I'd dismiss that new "we are right, they all were wrong" specs sheet some sites are sharing as "RV770 has 480 stream processors, not 800, as rumors claimed" encore.
I have given up on trying to make sense of all the "information" on all the different tech sites, it's all lies :laugh:

to be honest as it gets closer to release (whenever it may be) it's time to ignore all the "leaks" and just wait for amd to say something official.
Posted on Reply
#6
meran
so ,it makes sense to built 2xbarts on one board than one huge chip am i right or:toast:
Posted on Reply
#7
dalelaroy
meran said:
so ,it makes sense to built 2xbarts on one board than one huge chip am i right or:toast:
Only from the point of view of marketing. Unless....

I still think that Barts will have 1024 shaders, with Barts XT shipping with 960 shaders active. I think yields of Barts XT will be too low to justify completely replacing Cypress Pro with Barts without a defect tolerant design. However yield of defect free Barts GPUs would be adequate for fully functional GPUs to be used in a dual GPU product. Along those same lines of logic, there should be too few Barts GPUs with defective ROPs to justify a mass market product like Cypress LE, but if these GPUs could be salvaged for a dual GPU product.

This could also explain the Radeon HD 6990. If Cayman XT is, like GTX 480, a cut down Cayman, and called the Radeon HD 6870, then if the dual GPU variant uses fully functional GPUs, it would make sense to call it a Radeon HD 6990 to signify it is more than a dual Radeon HD 6870.
Posted on Reply
#8
cheezburger
wahdangun said:
are you afraid to bet? lets see who are the winner,
no i'm not afraid of betting , it's just i can't ignore the stupidity that's all. amd is not going to make a 500mm^2 die gpu just to add more ALU and feature 256bit bus and 32 rops, when ading ALU will cost more die space? that is hard fact!

just a question. what do you need so many shader for if your frame rate won't increase from 200 fps to 800 fps... just being feature rich? folding@home is generally garbage for vast high end gamer and "NO ONE WILL BUY A GFX JUST TO RUN FOLDING@HOME TO SAVE THE MANKIND WHILE CAN'T DO SHIT ON FRAME RATE" if human would die then let them all die....simple.

i would personally throw 500 dollars into water than save human race

anway read below post before you start think 32rop, 256bus with ridiculous 2560 shader will hit to the market with such bad scaling design.
shader die space in cypress is 60% and 4D shader is 80% of 5D shader in size and SIMD controller and TMU took about 15% then here will be 2(334x 0.6 x0.8)+2(334x0.15)+334x0.25 = 320.64 + 100.2 +83.5 = 504.34mm^2 + hard wiring = 510mm^2

that is huge die and such 510mm^2 only has 32 rops????and i don't see any reason why we'd need 640ALU for? folding@home?
and you expect a 510mm^2 chip using a narrow 256bit bus on it?

if the shader turn out to be 5120(1280ALU) then the die size will be:

4(334x 0.6 x0.8)+4(334x0.15)+334x0.25 = 641.28 + 200.4 + 83.5 = 925.18mm^2 + hard wiring = 940mm^2......

shader like this are pointless if you don't have more rops to push it. like g92 was bottleneck by its 16 rop while it had 128 ALU. and now cayman that has 1280 ALU but 32 rops....that is a big joke...

if the specification turn out to be 1920:96:64 512bit story will be vastly different from above

1.5(334x0.6x0.8)+1.5(334x0.15)+2(334x0.25) = 240.48 + 75.15 + 167 = 482.64mm^2 + hard wiring = 484mm^2

480ALU is what we need in existed 40nm..no go further....
yogurt_21 said:


so if caymen is only increasing shaders by 50% and tmu's while keeping the same rop's, the performance won't be as scalable as the 5770 to 5870 and we'll have a 6770 capable of taking sales away from the 6870 not just in price/performance but performance in general.

imo it would be a bad bad move when they have the chance to repeat the success of the 5xxx series.
hard fact, however people just don't listen

Tatty_One said:


1. Currently I beleive there are limitations on ROP count against Memory Bus size, you aint gonna get 64 ROP's on a 256bit wide bus.
2. There are a lot more factors to overall performance than just bus size and ROP count.

Simple as that really.
of cause you can not boost up performance by just adding rop/bus. you also can't just add ALU without major increase on rops/bus
Posted on Reply
#9
wahdangun
cheezburger said:
no i'm not afraid of betting , it's just i can't ignore the stupidity that's all. amd is not going to make a 500mm^2 die gpu just to add more ALU and feature 256bit bus and 32 rops, when ading ALU will cost more die space? that is hard fact!

just a question. what do you need so many shader for if your frame rate won't increase from 200 fps to 800 fps... just being feature rich? folding@home is generally garbage for vast high end gamer and "NO ONE WILL BUY A GFX JUST TO RUN FOLDING@HOME TO SAVE THE MANKIND WHILE CAN'T DO SHIT ON FRAME RATE" if human would die then let them all die....simple.

i would personally throw 500 dollars into water than save human race

anway read below post before you start think 32rop, 256bus with ridiculous 2560 shader will hit to the market with such bad scaling design.






hard fact, however people just don't listen



of cause you can not boost up performance by just adding rop/bus. you also can't just add ALU without major increase on rops/bus
first of all, i don't give a shit about F@H, and second we don't know for sure, its useless to speculate right now, just look at HD 4870 launch people speculate it will have 480 shader but in the end we get 800 shader more than twice the shader on HD 3870, and btw maybe cayman will just have 20% different on the performance than bart, and if this are big GPU like nvdia, ATI will like to cute the cost and use 256 bit instead, and maybe thats why bart was launched earlier because to wait those high speed GDDR5 ready, just like HD 4850 was launched earlier.
Posted on Reply
#10
cheezburger
wahdangun said:
first of all, i don't give a shit about F@H, and second we don't know for sure, its useless to speculate right now, just look at HD 4870 launch people speculate it will have 480 shader but in the end we get 800 shader more than twice the shader on HD 3870, and btw maybe cayman will just have 20% different on the performance than bart, and if this are big GPU like nvdia, ATI will like to cute the cost and use 256 bit instead, and maybe thats why bart was launched earlier because to wait those high speed GDDR5 ready, just like HD 4850 was launched earlier.
you haven't answer my question, why would amd want to make a huge die GPU by adding more ALU/shader if they knew it will cost more by adding more shader? why don;t they just simply optimize their ALU more and adding rops/bus instead?

this is no long speculation, this is fact! we all know shader cost 60% die space in current evergreen design and adding more then twice shader is non sense and make gpu as big as fermi while no frame rate gain and bad scaling is just plenty stupid. you can add more shader on 3870 is because r670 only has die size of 179mm^2 and 282mm^2 in 4870. increase roughly 60% while adding extra 100ALU/24TMU&SIMD cluster.but if we speculate this on cayman it will be 534mm^2 if you design to add more ALU like it did on r770. you fail one thing, if cayman is ONLY 20% gain in performance over barts then why is amd bother to make it out if it's only 20% over a mid range card while having die size of 500mm^2?? a 480:96:64 will have better scaling and frame rate burst over a 1280: (128)64:32.

guess you didn't know anything about how a gpu work. ALU in gpu are act as program decoder and material generator. while rops(Raster Operations Pipeline or Render Output Units in nvidia) are operate as material/texture loading and instruction processed by shader/ALU and finalize. more ALU don't ensure performance boost, in extreme case like highest detail/AA/AF it helps frame rate from dropping in serious margin. for example r670 and r770 don't see much of difference in fps when comes to lower detail/lighting and frame rate are mostly identical except fps. but when come to extreme detail r770 will take advantage because of shader and drop less than r670. however both r670 and r770 having little difference in pixel fill rate except r770 having higher clock and given little more fps. so you want more frame rate then you will need more rops.
Posted on Reply
#11
yogurt_21
Tatty_One said:
Lol, I didnt use the comparision as a "reason" they should be compared to bart etc, I used it because in your previous post you said that you found it difficult to belive that mid and high end cards would have the same ROP count and my example clearly shows that is not always the case because the 5850 and 5870 do. All that you have said does not change the fact that currently, in order for the ROP count to be increased, the memory bus must also be increased, so unless you are sure that we will see some 512bit bus versions then which ever way you want a look at it, you are going to pay a huge premium for that, one of the main reasons ATi have been so competative price wise recently is because they have gone for the 256 bus, NVidia's 384bit + bus widths cost more to produce, just in PCB terms alone .
Using the comparison between the 5830 and the 5770, throws up some odd results, as well as what you have mentioned, despite having double the memory bus it has the same ROP count as the 5770 but were you aware, despite it having double the memory bus, the 5830 is actually SLOWER in pixel fill rate than the 5770, now thats for a couple of reasons but my point is Bus and ROP count are just ingredients in the overall performance, people seem to get too hung up on it, you can get to a point where too many ROP's actually strangle performance and show little improvement where other ingredients can give a greater boost.

Now if we do see a 512bit bus..... and I am not saying we won't, then as you have said, there is more potential there, but with that comes a fairly large hike in prices, I have some doubts that AMD want to go down that route personally, although maybe on just the one top end card.......... my point all along has simply been 2 fold.......

1. Currently I beleive there are limitations on ROP count against Memory Bus size, you aint gonna get 64 ROP's on a 256bit wide bus.
2. There are a lot more factors to overall performance than just bus size and ROP count.

Simple as that really.
again 5850 and 5870 are in the same range, to actually seperate mid from high or high from enthusiest ati/amd has given vast spec differences, infact double in the case of 5770>5870>5970. so I think the thing you're missign here is the fact that I consider the 5850 a highend part, not a midrange to me midrange spans the 100-200$ price point at launch. highend 300-500 and enthusiest 500+. if you read that correctly fermi has no enthusiest single part in my mind, and only enter that realm in sli.

and again the 5850 and the 5870 have the came config only different shaders and clocks, what i refered to in my above post is that clocks makes up 99% of the performance difference between the two cards and when you match their clock speeds on the same rig, the 5870 will barely edge out the 5850 at the same clock speeds. Proving that the shader difference between the two doesn't affect performance significantly.

now doubling the shader count might, but not likly enough to grant as much a performance difference as ther eis between the 5870 and 5770 which regarldless will skew pucharse decision away from the highend parts. Being that highend parts already sell less than midrange and are more expensive to manufacturer it could be a costly decision.

despite having double the memory bus it has the same ROP count as the 5770 but were you aware, despite it having double the memory bus, the 5830 is actually SLOWER in pixel fill rate than the 5770


don't know why you posted this as it proves my point, since the 5770 has the same rop/tmu/memory bit per shader balance as the the 5870 it has a nice scalable architecture that as you pointed out has a better fillrate than the 5830 depite the fact that the 5830 has 40% more shaders. so...shaders again aren't enough on their own. They need the raw hp of the rop combined with the tmu to get the job done. And no you comclusion based on the data is incorrect, the 5830 has a SHADER bottleneck, not an rop/tmu one. that's why the 5770 with 40% less shaders and 40% less tmu's can have a higher fillrate. (granting the 200MHZ memory and 50MHZ core increase in clock speed on the 5770 might be helping the fillrate).

based on what we know about ati, though they cannot increase the rop count per memory bit in a series,they can disable them second thing we know is that cypress was essentially two seperate cores on a single die and juniper was a single of those cores.

it is possible that ATI/AMD already have a working core with 64 rops on a 256bit bus and we're seeign half of that on barts. another thing to keep in mind is that a few years ago 16 rop's were the max ati could do on a 256-bit bus, so at the time I could have argued that they couldn't put 32 rops on that bus width, I would have been wrong.

besides the fact I don't care if they have to go to a 384-bit bus width with 48 rop's, caymen needs to increase the rop count as well as shaders and tmu's to fit in with barts in the lineup otherwise barts will be the odd man out and steal the sales.
Posted on Reply
#12
bear jesus
yogurt_21 said:

besides the fact I don't care if they have to go to a 384-bit bus width with 48 rop's, caymen needs to increase the rop count as well as shaders and tmu's to fit in with barts in the lineup otherwise barts will be the odd man out and steal the sales.
After learning a little more about the limitations in gpu core design I'm kind of hoping it would be a 384bit bus as it looks like it would be the best option for increasing everything but not pushing the die size too far, but then again i am just a noob when it comes to gpu chip design :laugh:
Posted on Reply
#13
dalelaroy
cheezburger said:
not until i get his 5850 first :D then i'll trade my 9600gt to him for GT240 for physx :D





i don't see there's any point adding ridiculous number of shader on exist 40nm fab..based on my previous calculation if cayman is double of barts even except rops/bus increase as you were mention it will turn out to be like below if the spec is 2560:128:32 and 256bit bus

shader die space in cypress is 60% and 4D shader is 80% of 5D shader in size and SIMD controller and TMU took about 15% then here will be 2(334x 0.6 x0.8)+2(334x0.15)+334x0.25 = 320.64 + 100.2 +83.5 = 504.34mm^2 + hard wiring = 510mm^2

that is huge die and such 510mm^2 only has 32 rops????and i don't see any reason why we'd need 640ALU for? folding@home?
and you expect a 510mm^2 chip using a narrow 256bit bus on it?

if the shader turn out to be 5120(1280ALU) then the die size will be:

4(334x 0.6 x0.8)+4(334x0.15)+334x0.25 = 641.28 + 200.4 + 83.5 = 925.18mm^2 + hard wiring = 940mm^2......

shader like this are pointless if you don't have more rops to push it. like g92 was bottleneck by its 16 rop while it had 128 ALU. and now cayman that has 1280 ALU but 32 rops....that is a big joke...

if the specification turn out to be 1920:96:64 512bit story will be vastly different from above

1.5(334x0.6x0.8)+1.5(334x0.15)+2(334x0.25) = 240.48 + 75.15 + 167 = 482.64mm^2 + hard wiring = 484mm^2

480ALU is what we need in existed 40nm..no go further....
First of all, I read an interview with an AMD engineer in which he stated that the shaders of Cypress take up 80% of the Cypress die. This was within the context of discussing SIMD pipelines, so he might have meant SIMD pipelines, which would be shaders plus TMUs plus SIMD logic, but even your 60% for shaders plus 15% for TMUs and SIMD logic do not add up to the 80% stated by this engineer. Where do you get your figures.

Second, while it is common to quote 1600 for the number of shaders in Cypress, Cypress actually has 1600 ALUs organized as 320 shaders, that are arranged in 20 SIMD pipelines having 16 shaders and 4 TMUs each. Each shader has 4 simple ALUs and 1 complex ALU. Barts/Cayman is supposed to have 4 moderate complexity ALUs per shader.

Barts/Cayman are not derivatives of Juniper or Cypress. They were designed in parallel with Evergreen by the team(s) that designed RV7xx, including RV740. The engineer that was interviewed stated that the 4 ALU per shader design of Northern Islands took up slightly less space per shader than the 4+1 ALU design of Cypress while delivering between 1.5x to 1.8x the performance per shader of Cypress. The engineer might have meant 1.5x to 1.8x the performance per ALU, deliberately using the wrong term to make things clearer to the interviewer that often mentioned the 1600 shaders of Cypress.

The Radeon HD 5830 has the same number of ROPs and memory controllers as the Radeon HD 4870/4890, and falls between the two of them in average performance despite having 1.4x the number of SIMD pipelines. Chances are that it is not the performance of the individual shaders/TMUs that is crippling Cypress, but the SIMD control logic. My guess is that the NI design team went with a 4 moderate complexity ALU design for NI to simplify the control logic, thus enabling them to achieve at least the per shader performance of RV770 while implementing double precision floating point, as well as the DX11 features. Just getting NI to RV770 level per ALU performance would have given NI 12% higher performance per shader than Cypress. And it is possible that other improvements, including higher utilization of the ALUs due to fewer of them per shader and the number of ALUs per shader being a power of two, increased performance per shader to within 95% of the 4+1 ALU shaders. Thus the 1.5x to 1.8x figure quoted.

My guess is that, since the small die size strategy was well established at the time NI was being designed, and 32nm allows for just a bit over 56% more transistors per mm2 versus 40nm, and the 4 ALU shader design is only slightly smaller than the 4+1 ALU shader design, Turks was to be 1.6x Redwood, Barts 1.6x Juniper, and Cayman 1.6x Cypress with regards to shaders/SIMD pipelines. This would make Turks 128 shaders(512 ALUs), Barts 256 shaders (1024 ALUs), and Cayman 512 shaders (2048 ALUs). When 40nm was cancelled, only Cayman had to be cut down, and this was only to keep the TDP within the limits of what was needed to produce a dual GPU "Cayman".

Bus width is primarily a function of die size, and since Barts would have had about the same die size as Juniper at 32nm, Barts would have started with a 128-bit bus. But with Barts having over 50% more core performance than Juniper, there would have been a push towards either increasing the number of ROPs per memory controller by at least 50% or increasing the memory width by 50%. If they went with the memory width solution, Barts would have had a 192-bit wide bus at 32nm. Cayman was probably not large enough for a 384-bit memory bus at 32nm, so my guess is that the number of ROPs per memory controller was increased.

If indeed the Radeon HD 2900 GT had 12 ROPs (persumably 16 total with 4 disabled) it is Cayman might have had 12 ROPs per memory controller at 32nm. Well actually 16 ROPs per memory controller organized as four clusters of 4 ROPs each, with one ROP cluster per memory controller serving as a spare. I estimate that, at the time the GTX 480 was introduced, approximately 14% of all Radeon HD 5850/5870 yield was being lost to defective ROP clusters. At the time the Radeon HD 5830 was introduced this yield loss to defective ROP clusters would have been higher, thus the need to salvage a part with one ROP cluster per memory controller disabled. ATI probably anticipated similar yield problems at 32nm, and at least wanted one spare ROP cluster per memory controller available to improve yields, so the design could have been three ROP clusters per memory contoller with the third serving only as a spare, but more likely, with the need for 50% higher ROP performance to match the 50% higher core performance, ROP clusters per memory controller were doubled, with the fourth ROP cluster per memory controller serving as a spare.

With 32nm being cancelled and NI reimplemented at 40nm, die size grew, and there was increased perimeter on which to implement edge pads, enabling Barts to grow from 192-bits to 256-bits, and perhaps Cayman can now be 384-bit instead of 256-bit. If not however, I do expect Cayman to have at least 50% more ROPs per memory controller.
Posted on Reply
#14
cadaveca
My name is Dave
daelalroy, I gotta agree with your thoughts about control logic. Given that nVidia has now said that this exact thing is what went wrong with Fermi in development, and given Huang's explanation, I feel it's safe to say that this is definately a sore spot for the 40nm process. Also, AMD has previously mentioned that the dispatch processor would get a serious revamp.
Posted on Reply
#15
wahdangun
cheezburger said:
you haven't answer my question, why would amd want to make a huge die GPU by adding more ALU/shader if they knew it will cost more by adding more shader? why don;t they just simply optimize their ALU more and adding rops/bus instead?

this is no long speculation, this is fact! we all know shader cost 60% die space in current evergreen design and adding more then twice shader is non sense and make gpu as big as fermi while no frame rate gain and bad scaling is just plenty stupid. you can add more shader on 3870 is because r670 only has die size of 179mm^2 and 282mm^2 in 4870. increase roughly 60% while adding extra 100ALU/24TMU&SIMD cluster.but if we speculate this on cayman it will be 534mm^2 if you design to add more ALU like it did on r770. you fail one thing, if cayman is ONLY 20% gain in performance over barts then why is amd bother to make it out if it's only 20% over a mid range card while having die size of 500mm^2?? a 480:96:64 will have better scaling and frame rate burst over a 1280: (128)64:32.

guess you didn't know anything about how a gpu work. ALU in gpu are act as program decoder and material generator. while rops(Raster Operations Pipeline or Render Output Units in nvidia) are operate as material/texture loading and instruction processed by shader/ALU and finalize. more ALU don't ensure performance boost, in extreme case like highest detail/AA/AF it helps frame rate from dropping in serious margin. for example r670 and r770 don't see much of difference in fps when comes to lower detail/lighting and frame rate are mostly identical except fps. but when come to extreme detail r770 will take advantage because of shader and drop less than r670. however both r670 and r770 having little difference in pixel fill rate except r770 having higher clock and given little more fps. so you want more frame rate then you will need more rops.
sorry i don't know how to design the GPU, i'm just saying it because the correlation between each GPU design,
Posted on Reply
#16
bear jesus
I have to admit all this is getting so confusing, i wish AMD would hurry up and start telling us something official about the cards.
Posted on Reply
#17
jasper1605
bear jesus said:
I have to admit all this is getting so confusing, i wish AMD would hurry up and start telling us something official about the cards.
Amen to that! For someone who doesn't understand ultra tech lingo to begin with and then reading conflicting views on ROPS SIMD lanes ALUs MEOW (just for kix) it gets very confusing :(
Posted on Reply
#18
bear jesus
jasper1605 said:
Amen to that! For someone who doesn't understand ultra tech lingo to begin with and then reading conflicting views on ROPS SIMD lanes ALUs MEOW (just for kix) it gets very confusing :(
:laugh:
I have almost given up on trying to understand this all, although i admit it was a good excuse to read up on gpu design but really i'm only that interested in how powerful a card is and how that translates into high fps at high resolution and detail within a reasonable cost.

I damn AMD for being so quiet about it all, i geuss all we can do is wait for the release as i'm not expecting much official information before then, hopefully AMD has a nice supprise for us all.
Posted on Reply
#19
Tatty_One
Super Moderator
yogurt_21 said:
it is possible that ATI/AMD already have a working core with 64 rops on a 256bit bus and we're seeign half of that on barts. another thing to keep in mind is that a few years ago 16 rop's were the max ati could do on a 256-bit bus, so at the time I could have argued that they couldn't put 32 rops on that bus width, I would have been wrong.

besides the fact I don't care if they have to go to a 384-bit bus width with 48 rop's, caymen needs to increase the rop count as well as shaders and tmu's to fit in with barts in the lineup otherwise barts will be the odd man out and steal the sales.
We could disagree over individual points on this all day.... as it seems we are, and to be honest, I have lost the will to live! So i will just re-iterate my origional point which instigated this lengthy discussion, not just with you but with one or two others....current architecture prohibits more than 32 ROP's on a 256 bit memory bus, not being an Engineer or whatever, I don't know if thats because it's technically impossible (because of the interlinked technology or whether it is just totally impractical which is precisely why NVidia have had to raise said bus to 384 bit to fit more ROP's on, don't you or anyone else think that if 64 ROP's could be linked to a cheaper 256bit bus without to much grief then manuafacturers would adopt that higher performance lower cost option? (assuming the cost would be lower as no additional PCB layers would need to be added) I am not saying it is impossible, I am saying that both AMD and NVidia's architecture and relationship between their memory controllers and ROP's suggests strongly to me that this will not happen.

As I said earlier, I am quite prepared to stand up and proclaim I am wrong if more than 32 appear on a 256 bit Bus. I don't and have never argued against the benefits of a wider bus with a greater ROP count, just the point that there are many more elements to performance than just that and if the 5870/5850 only show that to a small degree, that is probably simply due to the fact that in retail, AMD's easiest and cheapest option is just to raise core clocks, I am sure if they wanted to they could have increased the performance some more without increasing the bus/ROP count.... but why would they want to with the cards positioning? I simply think that Cayman may well have more ROP's than 32, i suppose I just don't think that they will be on a 256bit bus :) just my thoughts and opinions.
Posted on Reply
#20
yogurt_21
Tatty_One said:
We could disagree over individual points on this all day.... as it seems we are, and to be honest, I have lost the will to live! So i will just re-iterate my origional point which instigated this lengthy discussion, not just with you but with one or two others....current architecture prohibits more than 32 ROP's on a 256 bit memory bus, not being an Engineer or whatever, I don't know if thats because it's technically impossible (because of the interlinked technology or whether it is just totally impractical which is precisely why NVidia have had to raise said bus to 384 bit to fit more ROP's on, don't you or anyone else think that if 64 ROP's could be linked to a cheaper 256bit bus without to much grief then manuafacturers would adopt that higher performance lower cost option? (assuming the cost would be lower as no additional PCB layers would need to be added) I am not saying it is impossible, I am saying that both AMD and NVidia's architecture and relationship between their memory controllers and ROP's suggests strongly to me that this will not happen.

As I said earlier, I am quite prepared to stand up and proclaim I am wrong if more than 32 appear on a 256 bit Bus. I don't and have never argued against the benefits of a wider bus with a greater ROP count, just the point that there are many more elements to performance than just that and if the 5870/5850 only show that to a small degree, that is probably simply due to the fact that in retail, AMD's easiest and cheapest option is just to raise core clocks, I am sure if they wanted to they could have increased the performance some more without increasing the bus/ROP count.... but why would they want to with the cards positioning? I simply think that Cayman may well have more ROP's than 32, i suppose I just don't think that they will be on a 256bit bus :) just my thoughts and opinions.
as always none of are engineers so it's all speculation (and if there is an amd engineer watching this thread, wtf? get back to work!) we'll see how it comes out, they could very well prove us al wrong and show such a strong improvemnt in shader power that we start seeing nvidia style shader counts for all we know. lol
Posted on Reply
#21
cheezburger
dalelaroy said:
First of all, I read an interview with an AMD engineer in which he stated that the shaders of Cypress take up 80% of the Cypress die. This was within the context of discussing SIMD pipelines, so he might have meant SIMD pipelines, which would be shaders plus TMUs plus SIMD logic, but even your 60% for shaders plus 15% for TMUs and SIMD logic do not add up to the 80% stated by this engineer. Where do you get your figures.

Second, while it is common to quote 1600 for the number of shaders in Cypress, Cypress actually has 1600 ALUs organized as 320 shaders, that are arranged in 20 SIMD pipelines having 16 shaders and 4 TMUs each. Each shader has 4 simple ALUs and 1 complex ALU. Barts/Cayman is supposed to have 4 moderate complexity ALUs per shader.

Barts/Cayman are not derivatives of Juniper or Cypress. They were designed in parallel with Evergreen by the team(s) that designed RV7xx, including RV740. The engineer that was interviewed stated that the 4 ALU per shader design of Northern Islands took up slightly less space per shader than the 4+1 ALU design of Cypress while delivering between 1.5x to 1.8x the performance per shader of Cypress. The engineer might have meant 1.5x to 1.8x the performance per ALU, deliberately using the wrong term to make things clearer to the interviewer that often mentioned the 1600 shaders of Cypress.

The Radeon HD 5830 has the same number of ROPs and memory controllers as the Radeon HD 4870/4890, and falls between the two of them in average performance despite having 1.4x the number of SIMD pipelines. Chances are that it is not the performance of the individual shaders/TMUs that is crippling Cypress, but the SIMD control logic. My guess is that the NI design team went with a 4 moderate complexity ALU design for NI to simplify the control logic, thus enabling them to achieve at least the per shader performance of RV770 while implementing double precision floating point, as well as the DX11 features. Just getting NI to RV770 level per ALU performance would have given NI 12% higher performance per shader than Cypress. And it is possible that other improvements, including higher utilization of the ALUs due to fewer of them per shader and the number of ALUs per shader being a power of two, increased performance per shader to within 95% of the 4+1 ALU shaders. Thus the 1.5x to 1.8x figure quoted.

My guess is that, since the small die size strategy was well established at the time NI was being designed, and 32nm allows for just a bit over 56% more transistors per mm2 versus 40nm, and the 4 ALU shader design is only slightly smaller than the 4+1 ALU shader design, Turks was to be 1.6x Redwood, Barts 1.6x Juniper, and Cayman 1.6x Cypress with regards to shaders/SIMD pipelines. This would make Turks 128 shaders(512 ALUs), Barts 256 shaders (1024 ALUs), and Cayman 512 shaders (2048 ALUs). When 40nm was cancelled, only Cayman had to be cut down, and this was only to keep the TDP within the limits of what was needed to produce a dual GPU "Cayman".

Bus width is primarily a function of die size, and since Barts would have had about the same die size as Juniper at 32nm, Barts would have started with a 128-bit bus. But with Barts having over 50% more core performance than Juniper, there would have been a push towards either increasing the number of ROPs per memory controller by at least 50% or increasing the memory width by 50%. If they went with the memory width solution, Barts would have had a 192-bit wide bus at 32nm. Cayman was probably not large enough for a 384-bit memory bus at 32nm, so my guess is that the number of ROPs per memory controller was increased.

If indeed the Radeon HD 2900 GT had 12 ROPs (persumably 16 total with 4 disabled) it is Cayman might have had 12 ROPs per memory controller at 32nm. Well actually 16 ROPs per memory controller organized as four clusters of 4 ROPs each, with one ROP cluster per memory controller serving as a spare. I estimate that, at the time the GTX 480 was introduced, approximately 14% of all Radeon HD 5850/5870 yield was being lost to defective ROP clusters. At the time the Radeon HD 5830 was introduced this yield loss to defective ROP clusters would have been higher, thus the need to salvage a part with one ROP cluster per memory controller disabled. ATI probably anticipated similar yield problems at 32nm, and at least wanted one spare ROP cluster per memory controller available to improve yields, so the design could have been three ROP clusters per memory contoller with the third serving only as a spare, but more likely, with the need for 50% higher ROP performance to match the 50% higher core performance, ROP clusters per memory controller were doubled, with the fourth ROP cluster per memory controller serving as a spare.

With 32nm being cancelled and NI reimplemented at 40nm, die size grew, and there was increased perimeter on which to implement edge pads, enabling Barts to grow from 192-bits to 256-bits, and perhaps Cayman can now be 384-bit instead of 256-bit. If not however, I do expect Cayman to have at least 50% more ROPs per memory controller.
that 80% is already included TMU/SIMD controller. consider amd's architecture is shader/ALU tight up with TMU/SIMD ctrl in the same module while it separate rop and bus to another section. so basically my calculation is close to it.

hd 2900gt was indeed 16 total with 4 disable. consider the die size and yield is completely identical to xt/pro version. however like 5830, its bad scaling ending generate more heat and far less performance in expectation. any cut down version that to be 3/4 or going odd number like fermi will cause bad scaling and performance loss. especially on amd's bus design it is impossible to go 6/12 configure then 8/16. their SIMD cluster and instruction pipeline has prevent it happen. so it will be logical either stay the same or double it. 40/320bit or 48rop/384bit bus will not possible on amd line, at least not in this generation.
Posted on Reply
#22
Wile E
Power User
TheMailMan78 said:
Its just you have bad cards. I ran crossfire with 4850s for a very long time without issue.
I've run single 4850, single 4870, crossfire 4850's, 4870+4850, crossfire 4870, 4870x2 + 4870, and finally just 4870X2.

Bugs in every single release past 8.10. Even on completely clean OS installs.
Posted on Reply
#23
bear jesus
Wile E said:
I've run single 4850, single 4870, crossfire 4850's, 4870+4850, crossfire 4870, 4870x2 + 4870, and finally just 4870X2.

Bugs in every single release past 8.10. Even on completely clean OS installs.
To be honest i'm sure one major reason why some people seam to have bugs and others don't is mainly due to different hardware/os setups and also different choices in games.
Posted on Reply
#24
Widjaja
Wile E said:
I've run single 4850, single 4870, crossfire 4850's, 4870+4850, crossfire 4870, 4870x2 + 4870, and finally just 4870X2.

Bugs in every single release past 8.10. Even on completely clean OS installs.
Bugs?

If there are I have not noticed them with my HD4850.
Posted on Reply
#25
mdsx1950
Even my 5970s seem to be running without any driver problems. Currently running 10.8.
Maybe its because i haven't OCed the card and left it the way it is.
Posted on Reply
Add your own comment