Monday, November 22nd 2010

AMD Cayman, Antilles Specifications Surface

At last, specifications of AMD's elusive Radeon HD 6970 and Radeon HD 6990 graphics accelerators made it to the internet, with slides exposing details such as stream processor count. The Radeon HD 6970 is based on a new 40 nm GPU by AMD, codenamed "Cayman". The dual-GPU accelerator being designed using two Cayman GPUs is codenamed "Antilles", and carries the product name Radeon HD 6990.

Cayman packs 1920 stream processors, spread across 30 SIMD engines, indicating the 4D stream processor architecture, generating single-precision computational power of 3 TFLOPs. It packs 96 TMUs, 128 Z/Stencil ROPs, and 32 color ROPs. Its memory bandwidth of 160 GB/s indicates that it uses a 256-bit wide GDDR5 memory interface. The memory amount, however, seems to have been doubled to 2 GB on the Radeon HD 6970. Antilles uses two of these Cayman GPUs, combined computational power of 6 TFLOPs, a total of 3840 stream processors, total memory bandwidth of 307.2 GB/s, a total of 4 GB of memory, load and idle board power ratings at 300W and 30W, respectively.

Source: 3DCenter Forum
Add your own comment

134 Comments on AMD Cayman, Antilles Specifications Surface

#1
N3M3515
Benetanegia said:
It has 14% less shaders, but yeah it's a good point and maybe I exagerated a bit, although the reason I mistakenly exagerated is because I was assuming almost perfect efficiency. To answer your question, the explanation of why that happens is easy. AMD's architecture is not efficient, it's far from being efficient from a utilization POV. Basically it's not the HD6850 which is faster than "it should", it's HD6870 which is not as fast as it should, because it cannot use all it's resources as well as the HD6850. And this is even more true for the HD5870 that with 1600SP "should be" 2x as fast as the HD4890, but it isn't
In order to be 2x faster than HD4890, doesn't it must have 2x everything?
2x850Mhz
2x4800Mhz
2x800 shaders
2xtmus
2xrops

??
Posted on Reply
#2
Benetanegia
N3M3515 said:
In order to be 2x faster than HD4890, doesn't it must have 2x everything?
2x850Mhz
2x4800Mhz
2x800 shaders
2xtmus
2xrops

??
Short answer. No.

Especially it doesn't need 2x850Mhz if it has 2xthe shaders. As long as it has 2x the Gflops (shaders x mhz x 2) it "should" be twice as fast. It all depends on the architecture tho. Fermi is like that, twice the flops, exactly twice the performance. It also usually means 2x the die area. With AMD 2x shaders does not equal 2x the performance, but usually they have also managed to not double up the die area.

AMD= efficient at manufacturing time
Nvidia= efficient at execution time
Posted on Reply
#3
cadaveca
My name is Dave
Benetanegia said:
NO. How does set-up affect 4D no (as in the dispatcher). How does 4D affect the set-up (as in making the vertex/raster engine more efficient). Repeating my prvious post, the vertex engine was not even used to a 5% of it's capabilities. Ok, let me rephrase it: it was not even used to a 5% of it's allegued capabilities. So why add another one?
Let me put this very simply. Barts has "2" setup engines(more like one dual-issue, but whatever). Together, they process 1 polygon per clock.

Cayman is MORE than twice the theoretical math power of Barts, due to the 4-D switch.

How is the set-up engine that can barely feed Barts work on Cayman? Does it not have to have twice the output as the Barts set-up, in order to be able to feed Cayman?

Of course the previous incarnation sucked! Explain why they were unable to fully utilize vertex setup, and you have your answer? It's all very obvious!
Posted on Reply
#4
N3M3515
Benetanegia said:
Short answer. No.

Especially it doesn't need 2x850Mhz if it has 2xthe shaders. As long as it has 2x the Gflops (shaders x mhz x 2) it "should" be twice as fast. It all depends on the architecture tho. Fermi is like that, twice the flops, exactly twice the performance. It also usually means 2x the die area. With AMD 2x shaders does not equal 2x the performance, but usually they have also managed to not double up the die area.

AMD= efficient at manufacturing time
Nvidia= efficient at execution time
And what about 2x memory speed?
Posted on Reply
#5
Benetanegia
cadaveca said:
Let me put this very simply. Barts has "2" setup engines(more like one dual-issue, but whatever). Together, they process 1 polygon per clock.

Cayman is MORE than twice the theoretical math power of Barts, due to the 4-D switch.

How is the set-up engine that can barely feed Barts work on Cayman? Does it not have to have twice the output as the Barts set-up, in order to be able to feed Cayman?

Of course the previous incarnation sucked! Explain why they were unable to fully utilize vetex setup, and you have your answer? It's all very obvious!
But like I said, it must be the rasterizer which was the bottleneck not the vertex engine per se. What they have doubled is afaik, from what I can read on the slides the vertex engine only. maybe I always understood this wrong but Cypress and Barts both have two rasterizers too and even then the bottleneck was there. It had to be there or on the dispatch unit. But in either case it doesn't matter, because neither have been increased (maybe improved). And again to my point: it's something else that was preventing the vertex engine from achieving it's peak of 15 mllion per frame, so why on earth it was only this unit that got doubled? It's that What I cannot understand. Maybe the diagrams on Cypress/Barts were misleading and did not have 2 rasterizer/dispatch units? I just don't understand it looking at the diagrams.

N3M3515 said:
And what about 2x memory speed?
Not required either. If the memory was holding down the performance, overclocking the memory would have increased the perforance linearly or almost linearly and that never happened. In fact it was far from it.
Posted on Reply
#6
cadaveca
My name is Dave
NO diagrams in existence are 100% factual representations of a gpu's design. They merely serve as FLOWCHARTS depicting how data will flow through the gpu, but do not denote actual functionality.

But, what the kicker here is that although Barts is far more efficient that Cypress, this efficiency increase is almost 100% in the setup engine. In fact, we all know that this is really the only change from Cypress to Barts...besides memory control.

So, the tidbit if info you may be missing is that although Barts is 1120 shaders, AMD also had a design with 1280 shaders(another two SIMD clusters), but limitation in the set-up engine limited the performance increase to just 2%...2%, from a 12.5% increase in math power!

Also of note is that Bart's memory controller is 50% of the functionality of Cypress(literally takes up hallf the die space), and this led to the reduction of memory speeds in the Barts chips(the smaller controller cannot maitain high speeds very well)....but even so, performance is barely impacted...unless you run high resolutions(and hence Barts being the new "mainstream"). So while the lack of 7Gbps memory may concern some, it should only really affect a small part of the marketplace.
Posted on Reply
#8
cadaveca
My name is Dave
Yeah, this is nothing new to ME, personally. I'm trying to explain to Bene that the 4D shader arrangement is what required the higher polygon output, but he doesn't seem to understand why(although, i must say, I do understand where he is coming from).
Posted on Reply
#9
TheMailMan78
Big Member
cadaveca said:
Yeah, this is nothing new to ME, personally.
Well some of us are not as 133t as you. :laugh:

Don't you have a tweaker to design?
Posted on Reply
#10
cadaveca
My name is Dave
TheMailMan78 said:
Well some of us are not as 133t as you. :laugh:

Don't you have a tweaker to design?
It's not like this is magic pixie dust, there is very logical steps to this progression in gpu design, and even more so now that they are confined within the limits of the process.

You want another chip like TWKR, tell JF_AMD to give me a job.:laugh: Seems AMD might need some new blood in marketing anyway.
Posted on Reply
#11
Benetanegia
cadaveca said:
NO diagrams in existence are 100% factual representations of a gpu's design. They merely serve as FLOWCHARTS depicting how data will flow through the gpu, but do not denote actual functionality.
Fair enough.
But, what the kicker here is that although Barts is far more efficient that Cypress, this efficiency increase is almost 100% in the setup engine. In fact, we all know that this is really the only change from Cypress to Barts...besides memory control.
Kinda. I atribute it to the fact that Barts has a comparable setup engine to Cypress but far less shaders to feed. If this is what you refer to a efficiency increase on the setup engine then we agree. I don't think there was any other improvement on the "classic" setup engine. There were those improvements to the registers between the setup and the tesselator tho ("non-classic" setup engine he :)), but I don't remember reading anything else.

EDIT: And I think that the answer to my question is precisely in those buffers on the set-up output. After reading the scarce info on those buffers in Techreport and Anandtech, it looks like they are just a few series of FIFO registers and that's probaby the info I was missing. The vertex/raster engine can generate many polys a second, but has apparently not enough place to store them until other units finish their work on previous ones. Hence it stays stalled for long periods of time. Doubling the engine doubled the buffers and with them the performance. Maybe I'm wrong on that, but it IS something I thought was different and could explain why. For the record, previously I thought the buffer between setup and the rest of the chip was an actual cache, biderectional to be more precise.
So, the tidbit if info you may be missing is that although Barts is 1120 shaders, AMD also had a design with 1280 shaders(another two SIMD clusters), but limitation in the set-up engine limited the performance increase to just 2%...2%, from a 12.5% increase in math power!
It was also 128 bit and 16 ROPs, that's where the limitation was most probaby, not the setup engine. Based on the relation of performance per clock between HD6870 vs HD5850 vs HD5870 I would say that the set-up limit was somewhere between 1120 and 1440 SPs. Probably closer to 1440, because the HD5850 is significantly faster than HD6870 whn @900 Mhz.
Posted on Reply
#12
u2konline
thunderising said:
160GBPS that's it??? WHAT SHIT GTX580 is near 200GBPS
Maybe i am looking at something else, but i see 300gb of bandwidth
Posted on Reply
#13
cadaveca
My name is Dave
Benetanegia said:
It was also 128 bit and 16 ROPs, that's where the limitation was most probaby, not the setup engine. Based on the relation of performance per clock between HD6870 vs HD5850 vs HD5870 I would say that the set-up limit was somewhere between 1120 and 1440 SPs. Probably closer to 1440, because the HD5850 is significantly faster than HD6870 whn @900 Mhz.
AMD would be the source of info claiming it's the set-up engine that limited Barts with 1280SPs vs 1120, so the breakpoint is 1120 for Barts' set-up engine, clear as day(as they cannot add just one SIMD to barts' dual-engine). What remains to be seen is if they have simply doubled up the Barts setup engine, or if it's a completle redesign, but I doubt they'd venture too far away from Barts...at least in overall implementation.

You could be right in it the limit being cache, but also maybe an increase in set-up registers also allows for doubling of polygons per clock. In fact, I trust AMD wouldn't have added anything they did not need, purely based onthem being so limited by the process...Cayman is a HUGE-ASS chip.
Posted on Reply
#14
Swamp Monster
u2konline said:
Maybe i am looking at something else, but i see 300gb of bandwidth
160Gbps is for single GPU card, but 307GBps is for dual GPU card.
Posted on Reply
#15
Benetanegia
cadaveca said:
AMD would be the source of info claiming it's the set-up engine that limited Barts with 1280SPs vs 1120, so the breakpoint is 1120 for Barts' set-up engine, clear as day(as they cannot add just one SIMD to barts' dual-engine).
idk maybe you are right. My only source on that is Anandtech review where they said:
However it’s worth noting that internally AMD was throwing around 2 designs for Barts: a 16 SIMD (1280 SP) 16 ROP design, and a 14 SIMD (1120 SP) 32 ROP design that they ultimately went with. The 14/32 design was faster, but only by 2%. This along with the ease of porting the design from Cypress made it the right choice for AMD, but it also means that Cypress/Barts is not exclusively bound on the shader/texture side or the ROP/raster side.
The rest is mostly assumption on my part. i.e HD5830 is definately bottlenecked by 16 ROPs, hence a Barts with 16 ROPs and more SPs than HD5830 would definately be bottlenecked. IMO I don't even know why AMD tried that one internally tbh.
You could be right in it the limit being cache, but also maybe an increase in set-up registers also allows for doubling of polygons per clock. In fact, I trust AMD wouldn't have added anything they did not need, purely based onthem being so limited by the process...Cayman is a HUGE-ASS chip.
Yeah. All my confussion came from the fact that the architecture is far more "set in stone" than I thought. I just thought that since Cypress/Barts had two rasterizers and only one setup engine, it was also posible to have 2 tesselators and one engine without the engine (or anything in between) becoming a bottleneck, but not necessarily, because the architecture might not permit it, after all. That was my only concern, and it was also stupid on my part that I was always repeating on my head "but why would they have a vertex engine capable of 850 million just to have it unused all the time". I was stuck on that tbh, when the question is "why not", in the end it's only one poly per clock, you can't (don't need to) go lower than that. :banghead:
Posted on Reply
#16
TheMailMan78
Big Member
cadaveca said:
It's not like this is magic pixie dust, there is very logical steps to this progression in gpu design, and even more so now that they are confined within the limits of the process.

You want another chip like TWKR, tell JF_AMD to give me a job.:laugh: Seems AMD might need some new blood in marketing anyway.
I only want a TWKR chip if it was signed by you.
Posted on Reply
#17
cadaveca
My name is Dave
TheMailMan78 said:
I only want a TWKR chip if it was signed by you.
Get one to my door, and I'll gladly sign it and return it to you.


:roll:



:shadedshu
Posted on Reply
#18
pantherx12
cadaveca said:
AMD would be the source of info claiming it's the set-up engine that limited Barts with 1280SPs vs 1120, so the breakpoint is 1120 for Barts' set-up engine, clear as day(as they cannot add just one SIMD to barts' dual-engine). What remains to be seen is if they have simply doubled up the Barts setup engine, or if it's a completle redesign, but I doubt they'd venture too far away from Barts...at least in overall implementation.

You could be right in it the limit being cache, but also maybe an increase in set-up registers also allows for doubling of polygons per clock. In fact, I trust AMD wouldn't have added anything they did not need, purely based onthem being so limited by the process...Cayman is a HUGE-ASS chip.
What like a non chopped up 6850? ( I only say this because of 960 being half of 1920 lol)


*edit* actually looking at the crossfire review, even has the performance powercolor hinted too. 20-50% better than 5870 ( depending on resolution and game of course)
Posted on Reply
#19
cadaveca
My name is Dave
pantherx12 said:
What like a non chopped up 6850? ( I only say this because of 960 being half of 1920 lol)


*edit* actually looking at the crossfire review, even has the performance powercolor hinted too. 20-50% better than 5870 ( depending on resolution and game of course)
The only thing is that 6850 is 5-D, and Cayman is 4-D, so there isn't really any way we can make a guess at performance...it's just far too different from the past tech...This is the first break-away from the R600 design.

The potential is there for Cayman to do far more than just +50% of Cypress...it truly depends on how many of those shaders they can keep fed all the time. 5870 is rarely more than 60% loaded, even when it indicates that gpu laod is 100%...you can tell this by power consumption.
Posted on Reply
#20
pantherx12
cadaveca said:
The only thing is that 6850 is 5-D, and Cayman is 4-D, so there isn't really any way we can make a guess at performance...it's just far too different from the past tech...This is the first break-away from the R600 design.
Maybe they've found a neat way of melting them together using the 5th shader as solder :laugh:

And yeah I know what you mean about guessing, if it was just up-scaled barts with the power to feed the shader it's simply the thing I was being silly about earlier :laugh: (or 70% improvement over 6870 if it scaled nicely, which thus far the 5d architecture has not as far as I'm aware )

It has to scale over 5870 by 60% in order to beat 580 in everything and 50% to win more then loose but not a straight up win .

This is one of the more interesting new gpu times IMO :cool:


Sorry for rambly post. I ramble when posting : ]


Assuming all shaders are fed 100% etc, can we work anything out from that? Like what it's optimal theoretical performance could be? :laugh:
Posted on Reply
#21
char[] rager
The 4870x2 has around 2.5 TFLOPS of single-precision compute performance. Am I right?

So if the 6970 has around 3 TFLOPS of single-precision compute performance, it should be faster than the 4870x2?
Posted on Reply
#22
KainXS
the 5870 is basically the same performance of a 4870x2


?????

why would the 6970 be slower?
Posted on Reply
#23
TheMailMan78
Big Member
KainXS said:
the 5870 is basically the same performance of a 4870x2


?????
Pretty much man. If the 6970 isnt the same speed as two 5870 in crossfire then it will be fail.
Posted on Reply
#24
Sapientwolf
TheMailMan78 said:
Pretty much man. If the 6970 isnt the same speed as two 5870 in crossfire then it will be fail.
That's an awful lot to ask for considering there wasn't a change to a smaller fabrication process. It's not gonna happen.
Posted on Reply
#25
WarEagleAU
Bird of Prey
wow 30w at idle is incredible....if you ain't gaming or watching a movie, that is awesome.
Posted on Reply
Add your own comment