• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

The nVidia memory bandwidth myth explained.

c is the speed of light, as mentioned above, electrons dont travel at the speed of light but it's a usable approximation
 
so, basically, all Radeon HD5000 and Geforce GTX400 have IMC embedded onto the GPU, just like today's CPU?
if so, I see ATI doing Intel-like and NVIDIA doing AMD-like with their IMC.Intel's IMC usually leads to higher memory overclock. we see more than 2000mhz is common with Intel, whereas AMD can't compete in memory clockspeed.

that is completely beyond my senses.nvidia had a long time to prepare GF100 and they only came out with lowspeed IMC.cmiiw

Even though AMD was the first of the two to implement a memory controller directly on the CPU, they seem to stuggle doing so, always having problems getting a proper implementation.

First it was the DDR controller on 754 not allowing dual channel, then it was problems with the DDR controller on 939 not being able to handle 4 sticks at 1T and 400+Mhz, then it was problems with their DDR2 controller on AM2 not being able to handle 4 sticks at 1066, and then the problems they had getting a stable DDR3 controller for AM3, which led them to disabling it in the first batch of AM3 processor and releasing them as AM2 only.

With nVidia, the issue is probably more down to having a 384-bit memory controller. It is a little different, but to give you an idea, on motherboards single channel is 64-bit, so dual channel is 128-bit and triple channel is 192-bit. Graphics cards having been using 256-bit as a standard for a good long while now, and I forsee it being the standard for a long while still, simply because of what we are seeing in the nVidia cards with the memory controller not being able to handle the higher clock speeds.

They did this before with the G80 cards, and eventually went back to a 256-bit bus with G92. ATi tried it to with their HD2900 series, with a 512-bit bus, they obviously learned from their mistake. In fact, IIRC, the HD2900XT used GDDR3 rated for 1000MHz, but clocked it at only 825MHz over the 512-bit bus, and it only overclocked into the 970MHz range. While nVidia at the time was using GDDR3 rated for only 900MHz, clocking it at 900MHz over the 384-bit bus, but it overclocked to well over 1000MHz.
 
Last edited:
c = 3*10^5 km/s = 3*10^11 mm/s = 300 mm/nanosecond.

So to travel the 100 mm to the memory the signal needs 0.3 ns. where do you think the rest of the time is spent if not in the memory controller ?

You still don't want to get what I mean, any function added will introduce latencies, because it's going to be required to wait for that function to finish. Even if it's not enabled, you need that fuction that will tell if ECC is enabled or not. I highly doubt ECC is disabled in hardware. If it is and you know it 100% sure, then I retract my opinion, but otherwise, it is very posible that ECC can introduce some latencies that make the MC slower.

I'm not talking about how much the request takes, but about the fact that added functions/silicon always limits the maximum stable clock that anything can achieve.
 
you can implement any function to be executed in parallel without latency if you are willing to spend the transistors for it. from there on you can reduce your transistor count using several clock cycles to do it
 
ok, not to poke holes but electricity in a wire does not go the speed of light, 95% would be closer.


Edit, never mind i see you noted it :P
 
> because it's going to be required to wait for that function to finish.

just put a logic 1 on the "is ecc data good" gate and you are done, no computation needed

if ecc enabled, dont send the logic 1 but connect with the output of the magic ecc black box

dont think in terms of sequential programming for logic design, you can do everything at the same time there

ecc consumes storage in memory, so if it were always on you'd have less memory usable (this is the case for tesla cards, not for geforce)
 
Last edited:
@ wizz

yep that why we have actual memory size in g-force, where tesla card have less than actual memory
 
you can implement any function to be executed in parallel without latency if you are willing to spend the transistors for it. from there on you can reduce your transistor count using several clock cycles to do it

> because it's going to be required to wait for that function to finish.

just put a logic 1 on the "is ecc data good" gate and you are done, no computation needed

if ecc enabled, dont send the logic 1 but connect with the output of the magic ecc black box

dont think in terms of sequential programming for logic design, you can do everything at the same time there

ecc consumes storage in memory, so if it were always on you'd have less memory usable (this is the case for tesla cards, not for geforce)

You can do many things in parallel but even then that's going to add some internal latencies. You are adding stages and that adds complexity which can impact speed. You are not exchanging few complex stages by many simpler stages, which results in higher clock speeds. You are adding stages that didn't exist before. So you have to ensure interoperability, you have to ensure that all stages take the same time to execute without adding innecesary traces or waiting times, etc. Adding things in parallel can make the chip run faster, but will also make it bigger* and hotter and that will also limit the clocks. Sounds familiar...

* And that will make overall travelling time larger. You can't make 2 things occupy the same space as one. Yet things are done in parallel because two slightly slower things are faster than a single faster one. That doesn't change the fact that you are adding travelling times (and making the thing slower), because you are space constrained and having two (many more actually) things going from A to B at the same time will make the trip of each of them longer, and that will affect maximum attainable clock.

The fact of the matter is that the MC in Fermi is much slower than that on previous generations of cards and one of the most significant changes is ECC. Can you completely write off the posibility that adding ECC support had a slowing effect? That's all I'm saying. Many things can make a circuit slower and I think none of us is in the position to deny with certainty that something isn't slowing it down.
 
With nVidia, the issue is probably more down to having a 384-bit memory controller. It is a little different, but to give you an idea, on motherboards single channel is 64-bit, so dual channel is 128-bit and triple channel is 192-bit. Graphics cards having been using 256-bit as a standard for a good long while now, and I forsee it being the standard for a long while still, simply because of what we are seeing in the nVidia cards with the memory controller not being able to handle the higher clock speeds.

Although graphics card list 256-bit, 384-bit, and even 512-bit, it's still many 64-bit controllers. So, really, 256 bit is akin to 4-channel 64-bit, just like cpus.

I stole this pic from OC3D, but it illustrates this very clearly:

attachment.php


As you can see, the 384-bit of Fermi gpus is actually 6 seperate 64-bit controllers.

EDIT, here is HD5870(stolen from BitTech), same thing, 256 bit by 4x64-bit:
attachment.php
 

Attachments

  • 01185436923l.jpg
    01185436923l.jpg
    98.6 KB · Views: 337
  • flow.jpg
    flow.jpg
    168.2 KB · Views: 422
Last edited:
You are correct, they are 64-bit controllers strung together, but they all still must work together, and it is that working together that limits the speed they can run at and maintain stability.

On a different note, I wonder if upping the GPU voltage, and hence giving the memory controllers more voltage, would actually improve memory overclocks in some cases.
 
i don't think so newtikie, because intel IMC can clock memory a little higher than AMD beside having larger buss

i think what's really limit fermi was it's heat. just like HD 2900 XT,
 
You are correct, they are 64-bit controllers strung together, but they all still must work together, and it is that working together that limits the speed they can run at and maintain stability.

On a different note, I wonder if upping the GPU voltage, and hence giving the memory controllers more voltage, would actually improve memory overclocks in some cases.

HD5870 has seperate memory controller voltage supply from gpu voltage supply...so in that instance, no. I'm not sure about Fermi, but given seperate operating speed, I assume most modern gpus use different voltage supply from vGPU, as seperating them can only help bring stability.
 
Back
Top