I have said many times and still mantain that the problem is IMO in the thread dispatch processor/setup engine.
1- Both RV770 and RV870 tout the same peak execution of 32k (kilo) threads, so probably the TP/SE has not been changed.
2- It's been said that RV870 is the exact same architecture as RV770 + DX11 support on the shaders, so probably only the ISA on the shaders have changed, if at all.
3- I know comparing different architectures is kinda stupid, but it can be valid as a guideline. Nvidia's GT200 had 32k peak threads too, but they have already said (I think it was on Fermi white paper) that in reality it could only do 10-12k and that was part of the reason for the "lacking" performance of GT200, at least at launch. Fermi will have 24k peak only, but thanks to 16 kernels and 2 different dispatch processors they think they will be able to max it out. SO even if we can't compare architectures directly, we do know that one of the companies did a thorought study on their hardware to test usage and saw that their 32k thread processor (12k in ractice) would not cut it, so they decided to put two, a different/weaker ones, but two.
We could speculate wether AMD's dispatch processor was more efficient or not, but given the performance similarity it most probably had a similar one + the advantage of higher clocks if at all. Now imagine it was indeed a little bit more efficient so that that thread dispatch processor was excessive for RV770, with a heavy overhead they could not really test, because it was the rest of the chip that was holding it down. Imagine that RV770 could only do 10-12k on the shader side of things, just like GT200 did as a whole* and that AMD thought that in theory the DP/SE could really do 24k. In order to realease Evergreen as fast as they did, they probably didn't touch the DP at all, being that in theory it could handle 32k and 24k according to their estimates, plenty. But what if the DP can't do 20k and it only does 16k, for example? Then you have a bottleneck where you didn't thought you would have one. It's not as if you could do anything without a complete redesing so you release that, because, in the end it still is a fast card (the fastest), because you will release much sooner and because you expect to improve the efficiency of usage with future drivers.
My two cents.