Friday, February 10th 2012
NVIDIA GeForce Kepler Packs Radically Different Number Crunching Machinery
NVIDIA is bound to kickstart its competitive graphics processor lineup to AMD's Southern Islands Radeon HD 7000 series with GeForce Kepler 104 (GK104). We are learning through reliable sources that NVIDIA will implement a radically different design (by NVIDIA's standards anyway) for its CUDA core machinery, while retaining the basic hierarchy of components in its GPU similar to Fermi. The new design would ensure greater parallelism. The latest version of GK104's specifications looks like this:
SIMD Hierarchy
Source:
3DCenter.org
SIMD Hierarchy
- 4 Graphics Processing Clusters (GPC)
- 4 Streaming Multiprocessors (SM) per GPC = 16 SM
- 96 Stream Processors (SP) per SM = 1536 CUDA cores
- 8 Texture Units (TMU) per SM = 128 TMUs
- 32 Raster OPeration Units (ROPs)
- 256-bit wide GDDR5 memory interface
- 2048 MB (2 GB) memory amount standard
- 950 MHz core/CUDA core (no hot-clocks)
- 1250 MHz actual (5.00 GHz effective) memory, 160 GB/s memory bandwidth
- 2.9 TFLOP/s single-precision floating point compute power
- 486 GFLOP/s double-precision floating point compute power
- Estimated die-area 340mm²
139 Comments on NVIDIA GeForce Kepler Packs Radically Different Number Crunching Machinery
If we wanted to go crazy there are all sorts of products released that are technically better, the HD5970 is to this day ridiculously powerful, and surprisingly cost efficient. I also omitted the HD4890, because it was launched months after the rest of the 4xxx series.
My listings are still accurate. There are outliers, but for the most part all of those cards were the original high-end GPU of their corresponding series.
but your free to believe what you wish, :roll:
1. download nvidia inspector
2. open the advanced driver settings
3. look at the advanced configs (scroll down)
4. set FXAA to 1 (default 0/off)
there are also some hidden settings there like framecap/framerate limit, SLI and/or AA flags etc.
also, some moar rumour tablez
forum.beyond3d.com/showthread.php?p=1619912
Total Average = is 12% difference across all those tests
Borderlands 2 or new Brothers in Arms running on Kepler? : D
As for that suspicious table, based on the specs which I think we can agree that are more or less accurate, this table was done by somebody who has done his homework. 30% plus on average above the GTX580 which brings us to that 10% over the 7970. If you look carefully you'll see the clocks - 1050 and 1425 - very high for a stock card and above the reported 950 for GPU. It is also done at 1080p where the mem bandwidth disadvantage is less pronounced.
So what I'm saying is that if this is close to real then NV will launch the GK104 under the name GTX680, a slightly faster card than the 7970 with certain weak points due to the fact that the chip was initially designed for the performance segment but after AMD's launch it can fulfill other expectations. Price? Neither 300$ nor 550$
and another note is that this article claims gk104 is a 340mm die which is nvidias mid range, the hd 7970 has a die size of 375mm, so much for the "we expected more from amd" talk
not to mention nvidias high end is said to have a 550mm die size, well amd could easily build a gpu that big and pack more transistors but that is usualy a very bad business choice, and nvidia suffer from it almost every time.
As far as Kepler goes, yes it's a tweaked Fermi in 99% of cases, you can see it in the specs and schematics. The only difference is that they dropped the hot-clocks, which makes SPs substantially smaller and doubled the amount of them per SM to compensate.
No one knows exactly how much smaller SPs are, but just as an example of how much clocks can affect the size of some units, AMD Bart's memory controler is half as big as Cypress/Cayman because it's designed to work at ~1000 Mhz instead of >1200 Mhz. Those extra 200 Mhz make the memory controler in Cypress/Cayman twice as big. So in case of Kepler and looking at specs and 340 mm2, we can assume that non hot-clocked SPs are around half the size.
but well i guess that makes sense doing so in order to scale at high clocks kinda like cpus having longer pipelines to scale at high frequency but there is no way it will make that much difference(especially since the whole point of architecture that aim for high frequency is to make smaller chips with less hardware and lower ipc but with more throughput, but thats in cpus im not sure about gpus), mayb the 1536 is refering to the bigger gtx680/780 which would have a 550mm2 die size(read that in previous leaks/rumors)
because even considering the die size which is much smaller than the 580 yet it triples the core count
even with 28nm thats only 40% smaller and its near impossible to get perfect scaling
Based on die area GK104 has to have around 3.6-4.0 billion transistors, that's twice as much as GF104/114, the chip it's based on. Would you have doubted so much if Nvidia had made a 768 SP Fermi(ish) part with 256 bit memory interface? Twice the SPs at twice the number of transistors, while keeping 256 bit MC. It's 100% expected don't you think? And now they have this 768 SP "GF124" and it's here where they drop hot-clocks, thus making the SP much smaller, and allowing them to put 2x as many of them: GK104 is born.
Also remember that doubling SPs per SM is a lot more area/transistor efficient than doubling the number of SMs.
And to finish, never look at die size for comparing, look at transistor count. Scaling varies a lot from one node to another, and transistor density can change a lot as a node matures, i.e. look at Cypress vs Cayman. GK104 has twice as many transistors as GF104 and that's all that you should look at. It's pointless to even compare to GF100/110, because GF100 is a compute oriented chip, with far more GPGPU features than GF104/114 and GK104. Even GF104 is 60% as big as GF100, but it has 75% of gaming performance.
as for cypress and cayman it seems like it happened from the other extreme isnt it? as far as i remember it was pretty much getting rid of the sps that werent being utilized and change vliw5 to vliw4 and ended up with smaller SM's that performed the same as their predecessor but at a smaller size allowing them to fit more SM's into the 6970 so even though shader count was less, it performed like 20% better.
though i still think there is still more behind this, having hot clocks has its benefits, but has its limitations too, like i heard they dont scale well when frequency increases, while amd would clock while increasing performance at a constant rate(i could be wrong tho idk much about the bitty details in gpu)
but in all honesty im betting these will arrive cheap and be below a 7950 in performance
The reason they used hot-clocks before was apparently to have lower latencies and better single threaded/light threaded performance, so that compute apps would benefit. Remember the first chips to have hot-clocks on shaders were running at 600 Mhz core clocks and below, so shaders run at <1200 Mhz. Now even without hot-clocks they will be running at 1000 Mhz so that's probably enough*. Latencies are further reduced with a shorter pipeline (due to lower clocks) and other means that are required for GPGPU anyway.
Fermi shaders running at 2000 Mhz would have been overkill for what it's really needed and consume more than two 1000 Mhz shaders. A compute GPU needs first and foremost multi-threaded performance, so long as single threaded is not crap, single threaded is only required up to a certain level, so that some minor tasks don't become a bottleneck.
I guess it is always going to be difficult for me to have a logical debate with someone who is not.
You have produced ZERO proof (I didn't expect that since nothing is fact), but also explained nothing (which I do expect) about why such a massive increase in computational power -that didn't came for free and suposed a 100% increment in transistor count- is not going to produce any performance gain.
You have not explained why a 2.9 TFlops card will not be able to beat the 1.5 Tflops card, and why if that'd be the case why didn't they just create a 1.5 TFlops (768 SP) card in the firt place. In the end that would have been easy, same architecture, half the SPs, 48 per SM. If going with 96 SPs is going to make the block 50% as (in)efficient as Fermi with 48 SP, you just don't make it 96 SP!!
So start by explaining something, anything, and stop calling fanboy as if that was any kind of argument in your favor, because it is not, it only makes you look like a 12 year old kid and an idiot. "It's going to be so, because (you think) it's going to be so, and if you think different you are a fanboy" is not an argument. More Logic:
GK104 is 340 mm2, so close to 4 billion transistor, twice as much as GF104 and 33% more than GF110, logic dictates that Nvidia did not sudenly create an architecture that is at least 33% less efficient than Fermi (70% compared to GF104), 25% higher clocks notwithstanding. Especially when they have been claiming better efficiency for almost 2 years now.
No doubt I will get called a Nvidia fanboy now despite running a HD7970 and Eyefinity.... :wtf:
One thing that does interest me about Kepler being a dieshrunk and "tweaked" Fermi is how much performance increase we can expect from future driver improvements? Driver improvements are a given with CGN as the architecture is realtively immature but what about Kepler? Could we end up with a case that Kepler comes out the gate faster than Thahiti but ends up slower in the long run due to a lack of driver improvements?
Obviously this is still conjecture but it is an interesting avenue to investigate as I have seen some pretty big boosts in BF3 (@3560*1920) with the latest HD79xx RC driver (25/01/2012).
"GK104 is 340 mm2, so close to 4 billion transistor" I am not aware of this information, where did you get 4 billion transistor? Did you estimate it off the 340mm2? in other words, building a case off speculative information?
@Xaser04 no need to struggle. Just read what I've posted thoroughly and comprehend it before venting off more steam.
Nothing matters till we see reviews, i dont care what kepler has in the wings its still smoke and mirrors, even then its hog wash if we go on specs and theoretical maximum calculations AMD has won every time in terms of theoretical output, yet it dosent actually win, so lets just save the arguments for when we see real performance numbers, then we can bitch moan and complain about whos the greatest EVAR! and whos a loser.