GK104 Block Diagram Explained

btarunr · Mar 16, 2012

Specifications sheets of NVIDIA's GK104 GPU left people dumbfounded at the CUDA core count, where it read 1536, a 3-fold increase over that of the GeForce GTX 580 (3x 512). The block-diagram of the GK104, photographed at the NVIDIA press-meet by an HKEPC photographer, reveals how it all adds up. The GK104 is built on 28 nm fab process, with a die area of around 295 mm², according to older reports. Its component hierarchy essentially an evolution of that of the Fermi architecture.

The hierarchy starts with the GigaThread Engine, which marshals all the unprocessed and processed information between the rest of the GPU and the PCI-Express 3.0 system interface, below this, are four graphics processing clusters (GPCs), which holds one common resource, the raster engine, and two streaming multiprocessors (SMs), only this time, innovation has gone into redesigning the SM, it is called SMX. Each SMX has one next-generation PolyMorph 2.0 engine, instruction cache, 192 CUDA cores, and other first-level caches. So four GPCs of two SMXs each, and 16 SMXs of 192 CUDA cores each, amount to the 1536 CUDA core count. There are four raster units (amounting to 32 ROPs), 8 geometry units (each with a tessellation unit), and some third-level cache. There's a 256-bit wide GDDR5 memory interface.

View at TechPowerUp Main Site

the54thvoid · Mar 16, 2012

Wish I knew what that all actually meant... :wtf:

KainXS · Mar 16, 2012

I think they made a good decision to cut individual core performance to fit more SP's in each SM, they did the same thing when they moved from the GT2XX series to the GT4XX series and it was worth it, and 1000Mhz stock is really suprising, I wonder what problems they ran into with the GK100 to prevent a release(Knowing Nvidia there just milking the market cause they know they have the faster cards maybe but oh well)

Casecutter · Mar 16, 2012

btarunr said:
GK104 GPU left people dumbfounded... essentially an evolution of that of the Fermi architecture. There's a 256-bit wide GDDR5 memory interface.

That's what I got from it! :eek:

MySchizoBuddy · Mar 16, 2012

it means it is slightly better than 7970

Shihab · Mar 16, 2012

the54thvoid said:
Wish I knew what that all actually meant...

Fermi is a V4 engine with 4 huge cylinders.
Kepler is a V8 with smaller ones..

I think....

Benetanegia · Mar 16, 2012

Hmm so it's 192 SPs per SM(X), that's how they got to bundle so many of them. Plenty of warp schedulers and dispatchers to feed them too. I think it's a very sleek design, we'll have to wait and see what the efficiency is though and one downside is a relatively very small L1 cache.

Casecutter · Mar 16, 2012

MySchizoBuddy said:
it means it is slightly better than 7970

But you need to include those new "star" (that btarunr's word for it, not mine) technologies TXAA, Adaptive V-Sync which gives that "organic" framerate feel! What kinds of fertilizers are used for organic farming?
What was the Cheech and Chong routine about "feel"… feels like…

btarunr · Mar 16, 2012

It's a kickass GPU.

I hope I put it simple enough.

Benetanegia said:
Hmm so it's 192 SPs per SM(X), that's how they got to bundle so many of them. Plenty of warp schedulers and dispatchers to feed them too. I think it's a very sleek design, we'll have to wait and see what the efficiency is though and one downside is a relatively very small L1 cache.

I'm hearing apart from high parallelization at the scheduler level, each small set of cores (lower level set than SMX) has a performance clock/voltage domain of its own. So not all 1536 CUDA cores will be running at the same clock speed (unless there's maximum or bare-minimum load). There will be hundreds of them running at countless combinations of clocks and voltages. It's as if the GPU knows exactly how much energy each single hardware resource needs at a given load.

TheoneandonlyMrK · Mar 16, 2012

looks to be better then i expected, 4x setup and geometry engines and 8x total polymorph sounds like they may indeed pip the AMD crew this round , very modular too which should allow them(through binning) to have a reasonable range out quite quickly ,no mention of tesselation engines(fermi had 16 i think) ,have they incorporated that into their polymorph v2 or something

Benetanegia · Mar 16, 2012

btarunr said:
I'm hearing apart from high parallelization at the scheduler level, each small set of cores (lower level set than SMX) has a performance clock/voltage domain of its own. So not all 1536 CUDA cores will be running at the same clock speed (unless there's maximum or bare-minimum load). There will be hundreds of them running at countless combinations of clocks and voltages. It's as if the GPU knows exactly how much energy each single hardware resource needs at a given load.

That's amazing. Any word on what's the minimum set that can be disabled? I thought that an entire SMX? But with such control over clocks and voltage it could be a lower level I guess.

BTW that opens up an amazing oportunity for harvesting parts for the second SKU, though I'm not sure they'd do it or if it is desirable for us. Instead of requiring to clock (and voltage) the entire chip to the lower common denominator, it may be posible for them to clock only the parts that do not meet requirements lower, while the ones that can clock "normally" (high) could remain at the highest clock. It could be hard to implement and maybe even harder to make a SKU out of it, but on the tech level it would be amazing.

Casecutter · Mar 16, 2012

btarunr said:
has a performance clock/voltage domain of its own. So not all 1536 CUDA cores will be running at the same clock speed (unless there's maximum or bare-minimum load)

Ok that's something... if they built into the chip level the idea of using Dynamic profiles or Adaptive V-Sync to shut down sections of cuda cores and are less dependent on just changing to core clock dramatically... this may as said be a "game changer" if Nvidia really implemented and made it integral at the chip level... they may have had this from the start, and not just some hocus pocus afterthought!

deleted · Mar 16, 2012

Benetanegia said:
That's amazing. Any word on what's the minimum set that can be disabled? I thought that an entire SMX? But with such control over clocks and voltage it could be a lower level I guess.

BTW that opens up an amazing oportunity for harvesting parts for the second SKU, though I'm not sure they'd do it or if it is desirable for us. Instead of requiring to clock (and voltage) the entire chip to the lower common denominator, it may be posible for them to clock only the parts that do not meet requirements lower, while the ones that can clock "normally" (high) could remain at the highest clock. It could be hard to implement and maybe even harder to make a SKU out of it, but on the tech level it would be amazing.

i doubt they will disable less than an sm at a time. if the difference in hardware between two skus is less than 10 percent, then the difference in performance is almost guaranteed to be even less than that. no one is going to spend another 50 bucks to get 3 or 4 more fps.

also, the chance of independently setting the max clock rate for each sm is exactly nil. it might make for marginally higher yields, but it would be a net loss in productivity because of all of the testing that would have to occur. it would also pretty much kill overclocking. imagine trying to oc a card with 20 different clock speed sliders 20 separate voltage tables.

Benetanegia · Mar 16, 2012

deleted said:
i doubt they will disable less than an sm at a time. if the difference in hardware between two skus is less than 10 percent, then the difference in performance is almost guaranteed to be even less than that. no one is going to spend another 50 bucks to get 3 or 4 more fps.

2 words: Lower clocks.

also, the chance of independently setting the max clock rate for each sm is exactly nil. it might make for marginally higher yields, but it would be a net loss in productivity because of all of the testing that would have to occur.

That is a valid point, requiring to have some kind of "profile" or qualification for each SM could be hard, which is why I did say it could be hard, but considering the huge amount of control that they already put there, I don't think it would be tremendously far-fetched to think about some kind of hardware automation for the next iteration, so that each SM can find (and report) its best clock and use the best voltage accordingly (if it does not do that already).

Anyway, I already questioned the feasibility of my comment regarding the posible SKUs. But tbh they could still make SKU based on "average" clock or average performance or something like that.

An example: imagine that the chip only had 2 SMs: 1 SM capable of 900 Mhz, 1 SM 1000 Mhz

1) Under normal conditions it would be a 900 Mhz SKU, because you have to limit the card to the lowest common denominator.
2) With dynamic clocking maybe it could be a 950 Mhz SKU, because that's the average clock both SMs would be running. Each chip would be different, but of course stock performance would be limited to a certain level, and that already occurs on current cards anyway.

it would also pretty much kill overclocking. imagine trying to oc a card with 20 different clock speed sliders 20 separate voltage tables.

Eehh... you didn't read what Btarunr said, right? You don't have to do anything, the chip does it by itself. You don't have to do 20 different sliders. There's just a main one like always and the chip finds which is best for each SM at any given time.

ensabrenoir · Mar 16, 2012

Sound simply amazing...... :eek:

...fingers crossed that they actually deliver and its not all theory.... Future looking much greener. So gonna build something needlessly and senselessly over powered because... THE TECH IS THERE

TheoneandonlyMrK · Mar 16, 2012

Benetanegia said:
BTW that opens up an amazing oportunity for harvesting parts for the second SKU, though I'm not sure they'd do it or if it is desirable for us. Instead of requiring to clock (and voltage) the entire chip to the lower common denominator, it may be posible for them to clock only the parts that do not meet requirements lower, while the ones that can clock "normally" (high) could remain at the highest clock. It could be hard to implement and maybe even harder to make a SKU out of it, but on the tech level it would be amazing.

thats what i had just said :rolleyes:

So let me get this right by all indications you can oc the gpu core parts (thatll just be the 4x setup and polymorphx8?) ,if there is any more oc headroom but in all likely hood wont be able to adjust shader speed :wtf:

or its likely to be ineffective in that they may downclock anyway, me personally im not so keen on redundancy, max all the way every day

jebus wizz ,throw us a bone gdam it ,thumbs up to that cookie or not

Crap Daddy · Mar 16, 2012

theoneandonlymrk said:
thats what i had just said

So let me get this right by all indications you can oc the gpu core parts (thatll just be the 4x setup and polymorphx8?) ,if there is any more oc headroom but in all likely hood wont be able to adjust shader speed or its likely to be ineffective in that they may downclock anyway, me personally im not so keen on redundancy, max all the way every day

jebus wizz ,throw us a bone gdam it ,thumbs up to that cookie or not

Have a look at this link:

http://imgur.com/a/aQmuA#6n7nC

Here you'll find slides that I think were not posted here such as something about... overclocking!

deleted · Mar 17, 2012

Benetanegia said:
Eehh... you didn't read what Btarunr said, right? You don't have to do anything, the chip does it by itself. You don't have to do 20 different sliders. There's just a main one like always and the chip finds which is best for each SM at any given time.

I did read what he said. What I was referring to is what you were talking about here:

An example: imagine that the chip only had 2 SMs: 1 SM capable of 900 Mhz, 1 SM 1000 Mhz

1) Under normal conditions it would be a 900 Mhz SKU, because you have to limit the card to the lowest common denominator.
2) With dynamic clocking maybe it could be a 950 Mhz SKU, because that's the average clock both SMs would be running. Each chip would be different, but of course stock performance would be limited to a certain level, and that already occurs on current cards anyway.

There's no way for the GPU to know at what clocks and voltages its stable. You have to test it and figure it out and tell it. If you're trying to overclock a card with asymmetrical maximum clock speeds and voltages, you're going to have to figure out the best clock and voltage for each SM. That's simply unfeasible. The way it's going to work is that you will determine a single max clock speed and voltage for the card, and it will underclock itself when it determines that it doesn't need the additional processing power.

Benetanegia · Mar 17, 2012

deleted said:
There's no way for the GPU to know at what clocks and voltages its stable. You have to test it and figure it out and tell it. If you're trying to overclock a card with asymmetrical maximum clock speeds and voltages, you're going to have to figure out the best clock and voltage for each SM. That's simply unfeasible.

I'm not talking about OC, as in users OCing the cards, I never did. I'm talking about factory profiles and yes they are feasible.

The way it's going to work is that you will determine a single max clock speed and voltage for the card, and it will underclock itself when it determines that it doesn't need the additional processing power.

Kepler cards already do much more than that according to the info revealed, which once again makes me think that you have not read about it. When the card detects that power consumption is lower than a previously set value, it overclocks/ovevlots itself until the limit is reached.

The user, yes, only sets a base clock and voltage and the GPu sets a maximum boost clock based on that, then it goes up or down as required by GPU load and power consumption.

NHKS · Mar 17, 2012

If u want to see some slides i found for nV's GPU boost follow this link to post

& a rough comparison of power consumption bet. 7970 & 680, follow this link

sergionography · Mar 17, 2012

thats very interesting, tho it seems on par with 7970 without this fancy dynamic clock thing
7970 has a good 30% overclock headroom but you have to do it manually , nvidia will do so when needed
im assuming you will be able to set maximum clock rate on the kepler and it will max out when needed
kinda similar to turbo mode in cpus
tho setting a certain clock at all times might change everything
that being said im sure its gonna be tricky to review this thing! but cant wait to see the real benchmarks and how the kepler cores perform without that dynamic clock trick

sic_doni · Mar 17, 2012

the54thvoid said:
Wish I knew what that all actually meant...

so am I..hope I can read what all that means... :cry:

Depth · Mar 17, 2012

btarunr said:
The hierarchy starts with the GigaThread Engine, which marshals all the unprocessed and processed information between the rest of the GPU and the PCI-Express 3.0 system interface, below this, are four graphics processing clusters (GPCs), which holds one common resource, the raster engine, and two streaming multiprocessors (SMs), only this time, innovation has gone into redesigning the SM, it is called SMX. Each SMX has one next-generation PolyMorph 2.0 engine, instruction cache, 192 CUDA cores, and other first-level caches. So four GPCs of two SMXs each, and 16 SMXs of 192 CUDA cores each, amount to the 1536 CUDA core count. There are four raster units (amounting to 32 ROPs), 8 geometry units (each with a tessellation unit), and some third-level cache.

Oh, right.

TheoneandonlyMrK · Mar 18, 2012

nearly,,, there :rockout:

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	Gigabyte B550 AORUS Elite V2
Cooling	DeepCool Gammax L240 V2
Memory	2x 16GB DDR4-3200
Video Card(s)	Galax RTX 4070 Ti EX
Storage	Samsung 990 1TB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

Processor	Ryzen 7800X3D
Motherboard	MSI MAG Mortar B650 (wifi)
Cooling	be quiet! Dark Rock Pro 4
Memory	32GB Kingston Fury
Video Card(s)	MSI RTX 5080 Vanguard SOC
Storage	Seagate FireCuda 530 M.2 1TB / Samsumg 960 Pro M.2 512Gb
Display(s)	LG 32" 165Hz 1440p GSYNC
Case	Asus Prime AP201
Audio Device(s)	On Board
Power Supply	be quiet! Pure POwer M12 850w Gold (ATX3.0)
Software	W10

Processor	AMD Ryzen 9 9950x / AMD Epyc 7773x
Motherboard	Gigabyte B850 Gaming X/ ASROCK ROME
Cooling	Be Quiet Dark Rock Pro 4(Custom) / Custom Air
Memory	64GB Crucial Pro 6400 / 384GB
Video Card(s)	MSI RTX5070Ti(Temporary)/ 4X RTX3090
Storage	Adata SX8200 1TB NVME/WD Black 1TB NVME
Display(s)	Dell 27 Inch 165Hz
Case	Lian Li A3 Mini
Audio Device(s)	IFI Zen Dac/JDS Labs Atom+/SMSL Amp+Rivers Audio
Power Supply	Corsair RM850x
Mouse	Logitech G502 SE Hero
Keyboard	Corsair K70 RGB Mk.2
VR HMD	Samsung Odyssey Plus/ Quest 3
Software	Windows 11

System Name	192.168.1.1~192.168.1.100
Processor	AMD Ryzen5 5600G.
Motherboard	Gigabyte B550m DS3H.
Cooling	AMD Wraith Stealth.
Memory	16GB Crucial DDR4.
Video Card(s)	Gigabyte GTX 1080 OC (Underclocked, underpowered).
Storage	Samsung 980 NVME 500GB && Assortment of SSDs.
Display(s)	ViewSonic VA2406-MH 75Hz
Case	Bitfenix Nova Midi
Audio Device(s)	On-Board.
Power Supply	SeaSonic CORE GM-650.
Mouse	Logitech G300s
Keyboard	Kingston HyperX Alloy FPS.
VR HMD	A pair of OP spectacles.
Software	Ubuntu 24.04 LTS.
Benchmark Scores	Me no know English. What bench mean? Bench like one sit on?

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	Gigabyte B550 AORUS Elite V2
Cooling	DeepCool Gammax L240 V2
Memory	2x 16GB DDR4-3200
Video Card(s)	Galax RTX 4070 Ti EX
Storage	Samsung 990 1TB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

GK104 Block Diagram Explained

btarunr

Editor & Senior Moderator

the54thvoid

Super Intoxicated Moderator

KainXS

Casecutter

MySchizoBuddy

New Member

Shihab

Benetanegia

New Member

Casecutter

btarunr

Editor & Senior Moderator

TheoneandonlyMrK

Benetanegia

New Member

Casecutter

deleted

New Member

Benetanegia

New Member

ensabrenoir

TheoneandonlyMrK

Crap Daddy

deleted

New Member

Benetanegia

New Member

NHKS

New Member

sergionography

sic_doni

Depth

TheoneandonlyMrK

Similar threads

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	Monolith
Processor	i5 2500K, 4.6 GHz at 1.30v
Motherboard	P8Z68-V Pro
Cooling	CM Hyper 212+
Memory	2x4 GB G.Skill Ripjaws 1333 MHz CL9
Video Card(s)	EVGA GTX 570, 920/1840/2050 at 1.100v
Storage	SanDisk Extreme 240 GB, WD Caviar Black 1 TB
Display(s)	LG W2363D
Case	Silverstone FT02B
Audio Device(s)	Creative Audigy 2
Power Supply	Kingwin LZG-850

System Name	iJayo
Processor	i7 14700k
Motherboard	Asus ROG STRIX z790-E wifi
Cooling	Pearless Assasi
Memory	32 gigs Corsair Vengence
Video Card(s)	Nvidia RTX 2070 Super
Storage	1tb 840 evo, Itb samsung M.2 ssd 1 & 3 tb seagate hdd, 120 gig Hyper X ssd
Display(s)	42" Nec retail display monitor/ 34" Dell curved 165hz monitor
Case	O11 mini
Audio Device(s)	M-Audio monitors
Power Supply	LIan li 750 mini
Mouse	corsair Dark Saber
Keyboard	Roccat Vulcan 121
Software	Window 11 pro
Benchmark Scores	meh... feel me on the battle field!

System Name	Old Fart / Young Dude
Processor	2500K / 6600K
Motherboard	ASRock P67Extreme4 / Gigabyte GA-Z170-HD3 DDR3
Cooling	CM Hyper TX3 / CM Hyper 212 EVO
Memory	16 GB Kingston HyperX / 16 GB G.Skill Ripjaws X
Video Card(s)	Gigabyte GTX 1050 Ti / INNO3D RTX 2060
Storage	SSD, some WD and lots of Samsungs
Display(s)	BenQ GW2470 / LG UHD 43" TV
Case	Cooler Master CM690 II Advanced / Thermaltake Core v31
Audio Device(s)	Asus Xonar D1/Denon PMA500AE/Wharfedale D 10.1/ FiiO D03K/ JBL LSR 305
Power Supply	Corsair TX650 / Corsair TX650M
Mouse	Steelseries Rival 100 / Rival 110
Keyboard	Sidewinder/ Steelseries Apex 150
Software	Windows 10 / Windows 10 Pro

System Name	Red Lightning
Processor	Intel i5-3570K @4.0GHz
Motherboard	Asrock Z77 Extreme4
Cooling	Cooler Master Seidon 120V
Memory	Gskill Ripjaws 8GB dual channel 1600MHz
Video Card(s)	ASUS GTX 650 Ti
Storage	PLextor M5S 128GB + WDC Black 1TB Sata3
Display(s)	Dell SE2417HG
Case	Cooler Master HAF 912 Advanced
Audio Device(s)	Asus Xonar DG
Power Supply	Seasonic M12-II 620W
Mouse	Steelseries Rival 95
Keyboard	Ozone Strike
Software	Windows 10 Home

Processor	Intel i7-940 @ 3.5Ghz
Motherboard	Asus P6X58D-E
Cooling	Corsair H70
Memory	12GB OCZ Platinum XTC DDR3 1600mhz CL7
Video Card(s)	EVGA GTX 780ti
Storage	Revodrive X2 240GB, 5TB HDD storage
Display(s)	Asus PB278Q 27''
Case	Antec Lanboy Air
Audio Device(s)	Asus Xonar D2X
Power Supply	Corsair HX850W
Software	Windows 7 x64