• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

The nVidia memory bandwidth myth explained.

newtekie1

Semi-Retired Folder
Joined
Nov 22, 2005
Messages
28,473 (4.00/day)
Location
Indiana, USA
Processor Intel Core i7 10850K@5.2GHz
Motherboard AsRock Z470 Taichi
Cooling Corsair H115i Pro w/ Noctua NF-A14 Fans
Memory 32GB DDR4-3600
Video Card(s) RTX 2070 Super
Storage 500GB SX8200 Pro + 8TB with 1TB SSD Cache
Display(s) Acer Nitro VG280K 4K 28"
Case Fractal Design Define S
Audio Device(s) Onboard is good enough for me
Power Supply eVGA SuperNOVA 1000w G3
Software Windows 10 Pro x64
Ok, I've seen a few people say that nVidia's cards have more memory bandwidth, and hence will perform better in applications that use more memory bandwidth. The reason behind this is the larger memory bus that the GTX400 series cards have. Well, this isn't really that true.

The easiest way to explain it is probably a pretty table:
[TABLE=head;sort=4d]Card | Memory Bus | Memory Clock | Memory Bandwidth
GTX480 | 384-bit | 924MHz | 177.4GB/s
GTX470 | 320-bit | 837MHz | 133.9GB/s
HD5870 | 256-bit | 1200MHz | 153.6GB/s
HD5850 | 256-bit | 1000MHz | 128.0GB/s[/TABLE]

Yes, the GTX480 is on top in memory bandwidth, but the lead is not as big as some would expect. The reason? Well, nVidia is using insanely slow GDDR5 memory, almost like they are using last generation's memory on this generation's cards, and they pushed the memory bus up to make up for it. Of course the larger memory bus means more memory chips, more power consumption, more to keep stable, which naturally leads to lower clock speeds also. And it is more expensive to use higher clocked memory, and their cards are already costing too much to produce. Also, the memory on the ATi cards overclock better, again because there are less chips and they are using higher quality memory, which means even the HD5870 can be overclocked to surpass an overclocked GTX480 in memory bandwidth and an HD5850 can be overclocked to surpass an overclocked GTX470 in memory bandwidth.
 
Would the difference in memory size matter as well or not? If it does, by how much?

GTX 480 - HD 5870
1536MB > 1024MB (There are 2048MB configurations but are not as common)

GTX 470 - HD 5850
1280MB > 1024MB (There are 2048MB configurations but are not as common)
 
the memory chips are actually the same on both ati and nvidia. it's the memory controller inside the gpu and the signal routing on the pcb that makes the difference in memory clock
 
the memory chips are actually the same on both ati and nvidia. it's the memory controller inside the gpu and the signal routing on the pcb that makes the difference in memory clock

So custom PCB's really makes a difference besides material(s) used and the length of the PCB?
 
So custom PCB's really makes a difference besides material(s) used and the length of the PCB?

yes of course, if there was no difference then the ref design would be the cheapest solution possible, leaving no reason for custom pcb designs
 
Would the difference in memory size matter as well or not? If it does, by how much?
Not entirely sure, but I think it goes like this: As long as you have more memory than you're using, the memory with the highest bandwidth has the advantage. Once you're using more memory than you have, things have to be loaded to and from that memory and the bandwidth takes an extra hit because of this. In the second case, more, not faster memory has the advantage.

I honestly can't thing of any game (except GTA IV perhaps?) that requires more than 1024MB of memory at reasonable resolutions, but that's mostly because I have no clue how much any game uses. Maybe there's a tool for that? Could be interesting for reviews.
 
Would the difference in memory size matter as well or not? If it does, by how much?

GTX 480 - HD 5870
1536MB > 1024MB (There are 2048MB configurations but are not as common)

GTX 470 - HD 5850
1280MB > 1024MB (There are 2048MB configurations but are not as common)

I think from what we've seen with W1z's reviews of the 2GB ASUS HD 5870, anything over 1GB makes next to no difference in current games, even at 2560x1600. We might see games in the future that show a difference, but I'm not counting on anything coming down the line any time soon.

the memory chips are actually the same on both ati and nvidia. it's the memory controller inside the gpu and the signal routing on the pcb that makes the difference in memory clock

Maybe on the GTX480, as they do use 1250MHz chips, but the GTX470 uses 1000MHz chips while even the HD5850 uses 1250MHz chips.

The memory controller does also play a big part in the clock speeds as well, I forgot to mention that, but it is very true, and from my understanding a larger bus makes the memory controller/s work harder also, which leads to lower clock speeds. It is similar to how a motherboard in single channel can clock the memory higher then when in dual channel, but in the end the single channel still has less memory bandwidth.
 
Not entirely sure, but I think it goes like this: As long as you have more memory than you're using, the memory with the highest bandwidth has the advantage. Once you're using more memory than you have, things have to be loaded to and from that memory and the bandwidth takes an extra hit because of this. In the second case, more, not faster memory has the advantage.

I honestly can't thing of any game (except GTA IV perhaps?) that requires more than 1024MB of memory at reasonable resolutions, but that's mostly because I have no clue how much any game uses. Maybe there's a tool for that? Could be interesting for reviews.

GPU-Z and MSI Afterburner show how much memory a video card is using; apply 4xAA or 8xAA on some games and you'll see. I've seen Stalker CoP using 1200MB and Flatout: Ultimate Carnage with 32xAA + 8x supersampling using 800MB :)
 
yes of course, if there was no difference then the ref design would be the cheapest solution possible, leaving no reason for custom pcb designs
So the reference design isn't the cheapest? Does that mean that custom PCBs (which you usually pay more for) are cheaper than the reference design?

If the above is true, why not make the reference design better and/or cheaper? I'm confused


Unrelated question: if memory chips are spec'ed at a certain frequency, why run them slower than that?
 
So the reference design isn't the cheapest? Does that mean that custom PCBs (which you usually pay more for) are cheaper than the reference design?

If the above is true, why not make the reference design better and/or cheaper? I'm confused


Unrelated question: if memory chips are spec'ed at a certain frequency, why run them slower than that?

Well there may be other details but in the case of 5850s at least the non-reference models have a cheaper voltage regulator (that does not support software voltage adjustment. That's a feature and features cost money). May be cheaper to design and produce their own coolers as well but I am speculating there...
 
So the reference design isn't the cheapest? Does that mean that custom PCBs (which you usually pay more for) are cheaper than the reference design?

If the above is true, why not make the reference design better and/or cheaper? I'm confused

The way I see it, there are two reasons to use a custom PCB.

1.) Use cheaper parts and a cheaper PCB layout to reduce costs.
2.) Use beefier parts and a better PCB layout to make the card better.

A good example of option 1 is most of the PCBs use in the HD5830 cards, where the manufacturers cheaped out to make cheaper cards and maximize profits. The results were cards that consumed more power, despite being slower, then HD5850 reference cards. Another example would be Powercolor PCS+ HD5850, where the components were cheaper, like a cheaper voltage controller that doesn't allow voltage control.

A good example of option 2 is the ASUS HD5870 Matrix Platinum, where everything on the PCB was beefed way the hell up.

Unrelated question: if memory chips are spec'ed at a certain frequency, why run them slower than that?

Well, one big reason is the limitations the memory controller has, as it can only handle such a big memory bus at certain speeds, as W1z mentioned. Then there is also the possibility that the memory is running at below specs voltages, to help save power consumption(which the GTX400 cards need every bit of help they can get). I'm sure there are probably other reasons, those are just two that come to mind.
 
GPU-Z and MSI Afterburner show how much memory a video card is using; apply 4xAA or 8xAA on some games and you'll see. I've seen Stalker CoP using 1200MB and Flatout: Ultimate Carnage with 32xAA + 8x supersampling using 800MB :)

Yeah, when there's enough memory, a game will take as much as it can, but that doesn't mean that it needs all that much, not when we are talking about 1024 MB and above. Most of that memory amount comes from textures and game data and not always that data is recent or will need to be reused anytime soon. Sometimes freeing up the memory may actually be less technically efficient than leaving it there.

Also loading something (i.e a texture) to memory takes orders of magnitudes less time than the time in which the texture will be in use. So even if data must be loaded for almost every frame, performance will not be degraded more than 1-5%. In the past the output buffers occupied a very big part of the vram, I'm talking about the days when we had 128-256 MB, and hence AA and high resolution could make video cards go to a crawl. With 512 MB the effects started to be smaller, because the size of the output buffers have not changed that much really and with more than 1 GB it starts to be almost negligible, especially considering that newer techniques allow for a more efficient use of the memory.

In that regards I'm really curious about what's going to happen in the future with Carmack's improved megatextures and the octree data representation he talks about (if he finally implements it on id tech 5/6). They can potentially make even 1GB of vram overkill.

Well, one big reason is the limitations the memory controller has, as it can only handle such a big memory bus at certain speeds, as W1z mentioned. Then there is also the possibility that the memory is running at below specs voltages, to help save power consumption(which the GTX400 cards need every bit of help they can get). I'm sure there are probably other reasons, those are just two that come to mind.

ECC memory comes to mind too. Aside from the fact that adding ECC support probably made the MC slower to boot, ECC memory is usually much slower and they might not want to see their consumer GPUs crushing their Tesla cards in those CUDA/OpenCL programs that both Tesla and GeForces will be able to run. If memory bandwidth was so much greater in GeForces (i.e 4800 Mhx vs 3200 Mhz) some/many CUDA apps would certainly run much much much better on the GPUs instead of on the Tesla cards. AMD and Intel have always done the same with their professional grade CPUs, leave the fastest dies for their Xeon and Opteron lines so that they look superior and can charge more. On stock they usually come in slower SKUs, but they almos invariably OC further and with less voltage than their consumer counterparts.
 
Last edited:
Yeah, when there's enough memory, a game will take as much as it can, but that doesn't mean that it needs all that much, not when we are talking about 1024 MB and above. Most of that memory amount comes from textures and game data and not always that data is recent or will need to be reused anytime soon. Sometimes freeing up the memory may actually be less technically efficient than leaving it there.

Also loading something (i.e a texture) to memory takes orders of magnitudes less time than the time in which the texture will be in use. So even if data must be loaded for almost every frame, performance will not be degraded more than 1-5%. In the past the output buffers occupied a very big part of the vram, I'm talking about the days when we had 128-256 MB, and hence AA and high resolution could make video cards go to a crawl. With 512 MB the effects started to be smaller, because the size of the output buffers have not changed that much really and with more than 1 GB it starts to be almost negligible, especially considering that newer techniques allow for a more efficient use of the memory.

In that regards I'm really curious about what's going to happen in the future with Carmack's improved megatextures and the octree data representation he talks about (if he finally implements it on id tech 5/6). They can potentially make even 1GB of vram overkill.



ECC memory comes to mind too. Aside from the fact that adding ECC support probably made the MC slower to boot, ECC memory is usually much slower and they might not want to see their consumer GPUs crushing their Tesla cards in those CUDA/OpenCL programs that both Tesla and GeForces will be able to run. If memory bandwidth was so much greater in GeForces (i.e 4800 Mhx vs 3200 Mhz) some/many CUDA apps would certainly run much much much better on the GPUs instead of on the Tesla cards. AMD and Intel have always done the same with their professional grade CPUs, leave the fastest dies for their Xeon and Opteron lines so that they look superior and can charge more. On stock they usually come in slower SKUs, but they almos invariably OC further and with less voltage than their consumer counterparts.


hmm i think GTX 4XX don't use ECC, because they said it will make G-force slower,

and btw have anyone see GT465 benchmark ?

it have more memory bandwidth, more SP than GTX 275 but the performance was same:ohwell:, do anyone know how that thing happen ?
 
hmm i think GTX 4XX don't use ECC, because they said it will make G-force slower,

and btw have anyone see GT465 benchmark ?

it have more memory bandwidth, more SP than GTX 275 but the performance was same:ohwell:, do anyone know how that thing happen ?

It doesn't indeed, but the memory controler does have ECC and ECC support is not a magical checkbox somewhere. It's actual transistors implemented in actual silicon and it's imposible to include more transistors between point A and point B without adding latencies or potentially making the whole thing slower, unstable or more sensitive to clock change.

But apart from that I was speaking of them deliberately crippling memory bandwidth on desktop cards so that they are not much faster than Tesla cards in certain CUDA applications. It just wouldn't make for a good marketing of the Tesla's and it's there where they will be making most money out of GF100. Later GF104, 106 and 108 and maybe even a GF102 will be released without all the extra things that GF100 has included solely for CUDA/OpenGL. That would potentially eliminate all the obstacles that they found with GF100 and allow much faster clocks.
 
Itit's imposible to include more transistors between point A and point B without adding latencies or potentially making the whole thing slower

actually you can do exactly that with more transistors, basically do things in parallel
 
ECC memory comes to mind too. Aside from the fact that adding ECC support probably made the MC slower to boot, ECC memory is usually much slower and they might not want to see their consumer GPUs crushing their Tesla cards in those CUDA/OpenCL programs that both Tesla and GeForces will be able to run.

Afaik, ECC is enabled only on Tesla; not GeForce and Quadro.
 
so, basically, all Radeon HD5000 and Geforce GTX400 have IMC embedded onto the GPU, just like today's CPU?
if so, I see ATI doing Intel-like and NVIDIA doing AMD-like with their IMC.Intel's IMC usually leads to higher memory overclock. we see more than 2000mhz is common with Intel, whereas AMD can't compete in memory clockspeed.

that is completely beyond my senses.nvidia had a long time to prepare GF100 and they only came out with lowspeed IMC.cmiiw
 
actually you can do exactly that with more transistors, basically do things in parallel

You are suggesting implementing twice the MC number? One for ECC and one for non-ECC? I don't understand what you mean, unless you are talking in general and not about this particular case. If you're talking in general, I agree, to an extent, but if you're not I can't agree/disagree unless I understand what you mean. :)
If you are saying what I think you are saying, I'm not sure that would help making fermi faster at all, adding more silicon etc.

For clarification, maybe I worded wrongly but with A/B I meant input and output of the machine, in this case the MC. Adding something in parallel is not erxactly adding something between A and B in the context I was speaking. It's introducing two machines, one from a C to D and one from E to F, both of which would go to a switch or something and it's the switch that has access to A and B. I hope it's clear.


It doesn't indeed, but the memory controler does have ECC and ECC support is not a magical checkbox somewhere. It's actual transistors implemented in actual silicon and it's imposible to include more transistors between point A and point B without adding latencies or potentially making the whole thing slower, unstable or more sensitive to clock change.
 
Last edited by a moderator:
the Geforce series dosnt have ECC... etleast its not turned on.



The latter is the most interesting, as under normal circumstances implementing ECC requires a wider bus and additional memory chips. The GTX 400 series will not be using ECC, but we went ahead and asked NVIDIA how ECC will work on Fermi products anyhow.



The short answer is that when NVIDIA wants to enable ECC they can just allocate RAM for the storage of ECC data. When ECC is enabled the available RAM will be reduced by 1/8th (to account for the 9th ECC bit) and then ECC data will be distributed among the RAM using that reserved space. This allows NVIDIA to implement ECC without the need for additional memory channels, at the cost of some RAM and some performance.


From WIKI
http://en.wikipedia.org/wiki/GeForce_400_Series

While the Fermi architecture includes support for the ECC feature on chip,[17] there is no option to enable ECC on GeForce GTX 470 and 480 cards.


yes the MC would most likely have the ECC transisters inside it, but there turned off and would have no effect on the bandwidth.
 
Last edited:
You are not getting what I'm saying guys. It's not bandwidth which is damaged. It's the fact that adding more transistors and more traces, even if they are not active, will make the travel from A to B longer, that's unavoidable, and that can potentially make the whole machine sitting between A and B slower. Light travels fast, but not at all when we are talking about such small distances and Mhz. It's probably the most essential concern in chip designing.


And next time don't use that tone with me. Reminding me of my position, etc.

First of all I think you missinterpreted the tone.

You know who I am and I'm sure you remember my story in the "what's wrong with our forums" thread. It was made very clear for me that mods are like any other members. Now, I've seen many mods, you included, verbally punisihng people for doing exactly what you did, post unnecessarily before reading.

If you want to punish me, do it, but please don't use the power of that position you don't want me to mention to threaten me.
 
Last edited:
i honestly doubt less then 1 MM will affect the latency enough to notice a change in performance.

its likely a distance of a few extra NM.. sure its better to be closer, but the difference will be less then a a 1 nano second.

take a look at ram for instance, its a good 10 CM away in terms of tracer distance, and it reaches 40NS latency.

also electricity in a wire dosnt flow as fast as light, 95% the speed of light would be a closer guess.
 
Last edited:
You are not getting what I'm saying guys. It's not bandwidth which is damaged. It's the fact that adding more transistors and more traces, even if they are not active, will make the travel from A to B longer, that's unavoidable, and that can potentially make the whole machine sitting between A and B slower. Light travels fast, but not at all when we are talking about such small distances and Mhz. It's probably the most essential concern in chip designing.

c = 3*10^5 km/s = 3*10^11 mm/s = 300 mm/nanosecond.

So to travel the 100 mm to the memory the signal needs 0.3 ns. where do you think the rest of the time is spent if not in the memory controller ?

even at 1 ghz clock speed the latency of a single request is not 1 ns, it's much much more
 
c = 3*10^5 km/s = 3*10^11 mm/s = 300 mm/nanosecond.

So to travel the 100 mm to the memory the signal needs 0.3 ns. where do you think the rest of the time is spent if not in the memory controller ?

Could you define what C is? :P

or is it just a letter representing the final number.
 
Back
Top