Perhaps that would be another bottleneck, but this is not what held the GT200 back. The GT200 did not have 2 times the shaders. It did not have 2 times the memory bandwidth of an 8800GTX--it was only a little over 50% increase. It hardly had any more TMU's than a 9800GTX, at a lower clock. The shader clocks were also slower. The GFLOPS theoretical max was only around 40% higher also (instead of 100% if 2x is to be expected). That's a lot of things that are just not 2x, unlike a 5870 that is 2x of a 4890 in EVERYTHING except memory bandwidth.
You got a good point, though. However, I do not think the "peak threads" would matter too much, if Nvidia is actually cutting it down for their Fermi chips that are consisted of 3 billion+ transistors. The peak threads thing must be such a high ceiling that has never ever been reached anyways, like with PCI-E 2.0 16x for a single GPU.
^^^^^^^^^^^^^^
Anyways, in one of my posts above --the one with all of the benchmarks from Firingsquad, do you think that 2x 4890's in CF (but with memory downclocked all the way down to 2400MHz effective on each card so that it adds up to the same 4800MHz bandwidth on a 5870) would still perform any better than a 5870 in *ANY* of the games? Keep in mind that 2x 4890's beat a 5870 in every game tested by Firingsquad--sometimes by a huge margin.
This is something for all of us to keep in mind--and especially ATI with their proven capability to do 512-bit bandwidth!
Regarding GT200, I'm just sharing what they said, I don't know if that was the case or not myself. And yeah I know everything was not doubled, but at launch it didn't even perform as it should. There was a 25% increase across the board with one of the driver releases that only marginally increased performance on other cards, so something was happenning whatever the problem was.
RV870 has doubled everything when it comes to execution units, but the underlying hardware has probably not been doubled up, that's what I'm saying. I'm blaiming the thread dispatcher because it makes more sense to me than, say, the bottlenck being on too few registers or slowish internal communications, because those are far easier problems to overcome without going to the drawing board. If you read the link from beyond3D (good read BTW) they do mention some problems in both the setup engine and thread dispatcher, alhoutgh they do blame the front-end registers apparently, or at least they mention a problem generated by register-pressure.
All in all it's clear that something is happening because on their charts, the more specific and theoretical they are the closer the HD5870 is from being 2x the HD4890, but as long as more are put into the equation the closer it gets to actual gaming performance. The best example and after reading the article and benchmarks, what I think it's to blame is texture filtering:
http://www.beyond3d.com/content/reviews/53/12
Texture fillrate is undeniably faster like almost 3x that of the HD4890, but texture filtering si only marginallly faster, it doesn't make sense to me unless something happens outside of the texture filtering units that prevents them from performing. Pay attention how slowing the mem bandwidth to that of the HD4890, has little effect too. That's something you can see throughout the entire article and shows the HD5870 is not memory bottlenecked.
And oh BTW you can't put 2xHD4890 to 2.4 gT/s in order to match the HD5870, because there is much more traffic going on on an SLI/Crossfire setup. Namely, geometry and texture data has to be sent twice. It's not apples to apples.
Similarly, the HD5770 is very different too, you can't extrapolate the results of the HD5770/HD4890 to the HD5870 basing in memory bandwidth. Double the performance doesn't mean it needs double the memory bandwidth. The memory space and memory bandwidth asociated to geometry and textures (and data in general) is the same in both cases, because both have to render the same thing. That part of the memory (a big one I must say) is only refreshed based on game time* and not based in the number of frames being rendered.
* If you make a 360 turn slowly, both cards will have to load the same geometry/textures at the exact same time, regardless of how many frames per second are being rendered.