Monday, April 19th 2021

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

Graphics cards have been developed over the years so that they feature multi-level cache hierarchies. These levels of cache have been engineered to fill in the gap between memory and compute, a growing problem that cripples the performance of GPUs in many applications. Different GPU vendors, like AMD and NVIDIA, have different sizes of register files, L1, and L2 caches, depending on the architecture. For example, the amount of L2 cache on NVIDIA's A100 GPU is 40 MB, which is seven times larger compared to the previous generation V100. That just shows how much new applications require bigger cache sizes, which is ever-increasing to satisfy the needs.

Today, we have an interesting report coming from Chips and Cheese. The website has decided to measure GPU memory latency of the latest generation of cards - AMD's RDNA 2 and NVIDIA's Ampere. By using simple pointer chasing tests in OpenCL, we get interesting results. RDNA 2 cache is fast and massive. Compared to Ampere, cache latency is much lower, while the VRAM latency is about the same. NVIDIA uses a two-level cache system consisting out of L1 and L2, which seems to be a rather slow solution. Data coming from Ampere's SM, which holds L1 cache, to the outside L2 is taking over 100 ns of latency.
AMD on the other hand has a three-level cache system. There are L0, L1, and L2 cache levels to complement the RDNA 2 design. The latency between the L0 and L2, even with L1 between them, is just 66 ns. Infinity Cache, which is an L3 cache essentially, is adding only additional 20 ns of additional latency, making it still faster compared to NVIDIA's cache solutions. NVIDIA's GA102 massive die seems to represent a big problem for the L2 cache to go around it and many cycles are taken. You can read more about the test here.
Source: Chips and Cheese
Add your own comment

91 Comments on GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

#26
lynx29
*licks my rock solid stable 6800*
Posted on Reply
#27
claylomax
Aquinus
Do you remember how much more power a 290/390 would consume when clocking up that 512-bit memory?
Yes, I had one :D
I agree that 128mb is plenty; I was talking about your comment on the other person, I believe he was referring to the space the cache takes; not the amount of cache.
Posted on Reply
#28
Aquinus
Resident Wat-man
claylomax
I believe he was referring to the space the cache takes; not the amount of cache.
I understand. I'm just saying that it's a worthwhile use of die space given the performance and power characteristics of it.
Posted on Reply
#29
HD64G
Kudos to AMD engineers that made it happen. And surely they used help from the Zen engineers. That was publicly declared from AMD almost 2 years ago about the Navi design combo efforts. I had many laughs with the posts that try to reduce the achievement of AMD on the memory department of their GPUs. That makes AMD's effort seem even more impressive. If AMD's arch is inferior to nVidia and this cache made it win over nVidia in 1440P or lower resolutions, while having less power consumption and smaller die, that's what I call a genious work from AMD's engineer department who use a small % of what the nVidia engineer use. As always, haters gonna hate.
Posted on Reply
#30
Punkenjoy
THANATOS
Wasn't that statement about the actual use of IC?
I never said to get rid of the whole IC, which was clearly stated in my post. What I wanted is to halve It(64MB instead of 128MB) and the saved up space would be used for more CU. BTW I would love to see a performance penalty graph for using smaller IC to know, If that much cache is really needed or It can be smaller.


With 64 MB, it would probably be fine in 1080p, but the hit rate will be much lower in 4K and the card would probably be memory starved. This graph show also why the cards perform so well in 1440p but start to fall behind in 4K. Probably 256 MB would be the perfect spot for 4K.

The thing is Cache are much easier to manufacture (less defect per area) than compute unit. They also consume way less. Also the shorter the data has to travel, the less power it take. The operation itself take very few power but it's moving all the data around that use power. Having a cache that limit the distance data have to travel greatly reduce power consumption.

Also, infinity cache is there to prepare for the next step, Multi chip GPU. but that is another story.
Posted on Reply
#32
Chrispy_
TBH the cache in RDNA2 is less about performance this gen and more about setting up for chiplets. It's not 100% useless but IPC differences between the 6700XT and similar 5700XT without the cache are really low. Sometimes zero, sometimes negligible. The performance uplift is almost entirely down to the 25-30% increase in clockspeeds.


It's a marketing point for now, that will lay the work for MCM GPUs next gen. Presumably it makes things smoother for raytracing two as the calculations now involve lookups for more data than that just relevant to the pixels any particular CU is working on, ergo more data being required - but for traditional raster based stuff the HWU video above proves how little it's of benefit to this generation.
Posted on Reply
#33
Punkenjoy
Chrispy_
TBH the cache in RDNA2 is less about performance this gen and more about setting up for chiplets. It's not 100% useless but IPC differences between the 6700XT and similar 5700XT without the cache are really low. Sometimes zero, sometimes negligible. The performance uplift is almost entirely down to the 25-30% increase in clockspeeds.

It's a marketing point for now, that will lay the work for MCM GPUs next gen. Presumably it makes things smoother for raytracing two as the calculations now involve lookups for more data than that just relevant to the pixels any particular CU is working on, ergo more data being required - but for traditional raster based stuff the HWU video above proves how little it's of benefit to this generation.
Well 5700 XT have a 256 bit bus where 6700 XT have a 192 bit bus. The fact that they both maintain similar performance mean that the 96 MB cache here is the "equivalent" of 64 bit bus more or less.

That is still significant since a smaller memory bus mean less space used by the memory controller, less pin on the chip, less trace on the cards, simpler card layout, etc...
Posted on Reply
#34
Kaotik
THANATOS
I think Nvidia adding FP32 functionality to Its INT units is a pretty good idea. Although I don't know how much transistors or power It cost gaming performance increased by ~25% and then there is the advantage in compute workload. I wouldn't mind If AMD did the same thing.
You got it reversed.
NVIDIAs "INT units" were stripped down CUDA cores, with Ampere they just cut a little less features (like FP32 capability was stripped out before, but not on Ampere) out from them and started calling them CUDA cores again.
AMD runs everything on same units, just like NVIDIAs full CUDA cores do.
Posted on Reply
#35
MxPhenom 216
ASIC Engineer
yeeeeman
AMD should thank a lot to TSMC for allowing them to add that much cache in such little space.
Using cache is in general the lazy man way of solving things.
TSMC wouldn't care. As long as AMDs design conforms to all DRC requirements of TSMC 7nm, timing closed, congestion, etc. TSMC doesn't care what's actually in the chip.

Also, i dont know where you get the idea that cache is the lazy man way of solving things, but it is not. Its actually one of the more critical parts of a chips performance.
Posted on Reply
#36
Kaotik
Chrispy_
TBH the cache in RDNA2 is less about performance this gen and more about setting up for chiplets. It's not 100% useless but IPC differences between the 6700XT and similar 5700XT without the cache are really low. Sometimes zero, sometimes negligible. The performance uplift is almost entirely down to the 25-30% increase in clockspeeds.
It's all about performance, not chiplets.
RDNA2 doesn't offer any IPC increases over RDNA in the Compute Unit department (new units like Ray Accelerators aside) except for what Infinity Cache brings to the table.
They said it outright on release that RDNA2 offers more performance thanks to three things: Higher clocks, Lower power (per clock) and Infinity Cache bandwidth.
Posted on Reply
#37
milewski1015
Steevo
Smoothness

www.techpowerup.com/review/amd-radeon-rx-6900-xt/39.html

See the effects of the Infinity Cache in the charts, 3090 & 6900 trade for FPS, but the 6900 has consistently higher frame rates and fewer low FPS frames, which equates to less laggy feeling, IE... smoothness
@1d10t This is what @TheinsanegamerN is talking about. More consistent frametimes (and therefore more consistent FPS) results in a smoother experience. The charts Steevo linked show examples of that.

Think about it like case fan hysteresis. If you have a fan curve set up so that they run really quiet and then ramp up significantly once a certain temperature is hit, that's a noticeable change in noise. The change in fan speed is noticeable. However, if you just set a curve that's maybe a little louder initially but a much smoother curve, there's a slow change in RPM, and therefore a less drastic change in noise, which makes it less noticeable to the ear. The same principle applies to frametimes and FPS.
Posted on Reply
#38
Punkenjoy
MxPhenom 216
TSMC wouldn't care. As long as AMDs design conforms to all DRC requirements of TSMC 7nm, timing closed, congestion, etc. TSMC doesn't care what's actually in the chip.

Also, i dont know where you get the idea that cache is the lazy man way of solving things, but it is not. Its actually one of the more critical parts of a chips performance.
You are right. Actually making fast cache is way more complicated than it look. You have to have mechanism that will check the cache to know if the data you are trying to access is there. The larger the cache, the larger is the amount of work you have to do to figure out if it contain the data you are looking for.

This can add latency. The fact that even with more layer of cache, AMD is able to get lower latency show how well they master the cache thing. They purposely made a lot of effort there because this is a key thing with multi chips modules.
Posted on Reply
#39
Colddecked
Chrispy_
TBH the cache in RDNA2 is less about performance this gen and more about setting up for chiplets. It's not 100% useless but IPC differences between the 6700XT and similar 5700XT without the cache are really low. Sometimes zero, sometimes negligible. The performance uplift is almost entirely down to the 25-30% increase in clockspeeds.
But there must be some architectural difference that allows RDNA2 to clock that much higher at the same voltage. It can't just be AMD just got that good at 7nm, can it?
Posted on Reply
#40
THANATOS
Kaotik
You got it reversed.
NVIDIAs "INT units" were stripped down CUDA cores, with Ampere they just cut a little less features (like FP32 capability was stripped out before, but not on Ampere) out from them and started calling them CUDA cores again.
AMD runs everything on same units, just like NVIDIAs full CUDA cores do.
As far as I know Turing had 64 FP32 and 64 INT32 Units per SM. Now those 64 INT32 units are capable of either INT32/FP32, but the original Cuda cores were or are not capable of INT32 as far as I know.
Posted on Reply
#41
1d10t
TheinsanegamerN
You presented an opinion, an opinion that is objectively incorrect. You presented the argument, if you cant prove your argument then all you are doing is shitting up the thread. "smoothness" IS a noun, per oxford's learner dictionary, and can be measured via frametime measurement.

Oxford: www.oxfordlearnersdictionaries.com/us/definition/english/smoothness#:~:text=smoothness-,noun,any rough areas or holes

I can present a new topic for depate too: "Does 1d10t live up to his username?".
"Ad hominem (Latin for 'to the person'), short for argumentum ad hominem, refers to several types of arguments, some but not all of which are fallacious. Typically this term refers to a rhetorical strategy where the speaker attacks the character, motive, or some other attribute of the person making an argument rather than attacking the substance of the argument itself. This avoids genuine debate by creating a diversion to some irrelevant but often highly charged issue. The most common form of this fallacy is "A makes a claim x, B asserts that A holds a property that is unwelcome, and hence B concludes that argument x is wrong".

I'm drawing conclusions based on claims by Linus and Anthony, whom both say in the video "I feel like, if my memory serve, etc." hence (my) term "placebo effect" originated.
I'm not going to delve far away from topic, and if you want to establish dominance over my username, by all means you are welcome.
Posted on Reply
#42
mtcn77
Something is up with these results. L0 is something Nvidia integrated first, notice their texture cache is also read-only. So, you misrepresented the results maybe in due part for not integrating texture cache's results.
Posted on Reply
#43
THANATOS
Punkenjoy
With 64 MB, it would probably be fine in 1080p, but the hit rate will be much lower in 4K and the card would probably be memory starved. This graph show also why the cards perform so well in 1440p but start to fall behind in 4K. Probably 256 MB would be the perfect spot for 4K.

The thing is Cache are much easier to manufacture (less defect per area) than compute unit. They also consume way less. Also the shorter the data has to travel, the less power it take. The operation itself take very few power but it's moving all the data around that use power. Having a cache that limit the distance data have to travel greatly reduce power consumption.

Also, infinity cache is there to prepare for the next step, Multi chip GPU. but that is another story.
1. If N21 is fine with 128MB for 4K or N23 with 32MB for 1080p I don't see why 64MB shouldn't be fine for 1440p. Each higher resolution needs 2x more IC to keep similar hitrate as shown in that graph.
3. It starts to fall behind? N21 has higher performance than Navi10 the higher the resolution is.
If you meant against Ampere, then isn't It actually because Ampere has a lot more Cuda and has a problem with utilization at lower resolutions?
4. Infinity cache has Its advantages, but It also uses up a lot of space. I think It would have been better If N22 had shaved off 32MB, kept only 64MB of IC and added 8 CU instead. N22 is quite Inefficient for an RDNA2 GPU, because It has too high clocks and adding more CU would mean you can clock It lower.
Here is a nice graph of N22 GPU power consumption at different clockspeeds made by uzzi38. Link and another Link
Increasing the clocks from 2295MHz to 2565Mhz caused the power consumption to increase by 59W!
Posted on Reply
#44
Steevo
mtcn77
Something is up with these results. L0 is something Nvidia integrated first, notice their texture cache is also read-only. So, you misrepresented the results maybe in due part for not integrating texture cache's results.
Its a simple call to cache with a timer set to count ticks of the clock, so no, unless there is some fundamental misunderstanding that you can explain better as a engineer the results are correct. Ford made the first mass produced automobile, but doesn't make the best one, so your thought process is flawed.

Also, who specifically "mispresented" what, based on your limited understanding?
Posted on Reply
#45
mtcn77
Steevo
Also, who specifically "mispresented" what, based on your limited understanding?
That means your code didn't use the texture cache.
Steevo
unless there is some fundamental misunderstanding that you can explain better as a engineer
Since I'm not an engineer, your loophole disregards anything else put forward, but I do indeed want to see AMD on top as a tech enthusiast. I just know that is not present at this time.
Posted on Reply
#46
Punkenjoy
THANATOS
1. If N21 is fine with 128MB for 4K or N23 with 32MB for 1080p I don't see why 64MB shouldn't be fine for 1440p. Each higher resolution needs 2x more IC to keep similar hitrate as shown in that graph.
3. It starts to fall behind? N21 has higher performance than Navi10 the higher the resolution is.
If you meant against Ampere, then isn't It actually because Ampere has a lot more Cuda and has a problem with utilization at lower resolutions?
4. Infinity cache has Its advantages, but It also uses up a lot of space. I think It would have been better If N22 had shaved off 32MB, kept only 64MB of IC and added 8 CU instead. N22 is quite Inefficient for an RDNA2 GPU, because It has too high clocks and adding more CU would mean you can clock It lower.
Here is a nice graph of N22 GPU power consumption at different clockspeeds made by uzzi38. Link and another Link
Increasing the clocks from 2295MHz to 2565Mhz caused the power consumption to increase by 59W!
having 64 MB instead of 96 MB could have mean that the card would end up with a 8 GB memory buffer instead if 12, also, it would have mean either a 256 Bit or 128 bit bus. There is a relation between the amount of memory on the card and the amount of infinity cache. This is also probably one of the tricks AMD use to lower memory latency by caching a specific amount of memory per MB of infinity cache. This simplify the caching algorithm. (meaning it take less time to run, ie lower latency.)

Also something that i don't have the data on, but since the relation to memory bus/memory size seems clear, it's quite possible that the 96MB block on NAVI 22 have less bandwidth than the 128 MB on Navi 21.

Also, all chip maker have simulator in house. They probably already tested the scenario you propose versus the scenario they choose in simulation and decided that it was not worth it. NAvi 22 aim for 1440P and not 1080P too
Posted on Reply
#47
Steevo
mtcn77
That means your code didn't use the texture cache.


Since I'm not an engineer, your loophole disregards anything else put forward, but I do indeed want to see AMD on top as a tech enthusiast. I just know that is not present at this time.
mtcn77
I think AMD is going to leverage Infinity Cache to compete with Nvidia because they have been behind in the cache bandwidth race since Maxwell.
AMD had been successively expanding the chip resources, albeit never found the medium to express what it can do unequivocally.
My code wasn't involved.

I appreciate the effort to look unbiased, but the facts are teh 6900XY is on par with the3090, both of which are unavailable for the masses, and the worst part is that the few who have/can get them will probably use them to mine crypto instead of game on, or test hypothesis like ours, or yours.

Also, if a texture cache is read only how is data ever written to it beyond a driver call for texture, which I am assuming the programmer knew about since they know how to test the cache hierarchy latency on modern GPU's.

Plus Samsungs node and lower clock rate for higher TDP may mean Nvidia had to sacifice latency for stability in higher clock speed with capacitive rolloff effects.

And at the end of the day, AMD is making a product that succeeds despite being 1nm less in process size ( if that actually means anything in the real world) that has significanltly better performance in the 99th percentile meaning less stuttering and better feeling overall performance.
Posted on Reply
#48
mtcn77
Steevo
My code wasn't involved.

I appreciate the effort to look unbiased, but the facts are teh 6900XY is on par with the3090, both of which are unavailable for the masses, and the worst part is that the few who have/can get them will probably use them to mine crypto instead of game on, or test hypothesis like ours, or yours.

Also, if a texture cache is read only how is data ever written to it beyond a driver call for texture, which I am assuming the programmer knew about since they know how to test the cache hierarchy latency on modern GPU's.

Plus Samsungs node and lower clock rate for higher TDP may mean Nvidia had to sacifice latency for stability in higher clock speed with capacitive rolloff effects.

And at the end of the day, AMD is making a product that succeeds despite being 1nm less in process size ( if that actually means anything in the real world) that has significanltly better performance in the 99th percentile meaning less stuttering and better feeling overall performance.
I never said Nvidia didn't have the same cache though, you are taking my last quote very differently.

AMD can go up and down, Nvidia still keeps their own. Adding +1 to AMD is not subtracting from the other, per se.

Truth be told, I would like to chime in to the developer side of the equation, but we aren't the who's who of CUDA development the two of us.

PS: I won't sacrifice my integrity for clout.
Posted on Reply
#49
Colddecked
Steevo
And at the end of the day, AMD is making a product that succeeds despite being 1nm less in process size ( if that actually means anything in the real world) that has significanltly better performance in the 99th percentile meaning less stuttering and better feeling overall performance.
did you mean Nvidia?
Posted on Reply
#50
Vya Domus
Punkenjoy
having 64 MB instead of 96 MB could have mean that the card would end up with a 8 GB memory buffer instead if 12, also, it would have mean either a 256 Bit or 128 bit bus. There is a relation between the amount of memory on the card and the amount of infinity cache. This is also probably one of the tricks AMD use to lower memory latency by caching a specific amount of memory per MB of infinity cache. This simplify the caching algorithm. (meaning it take less time to run, ie lower latency.)

Also something that i don't have the data on, but since the relation to memory bus/memory size seems clear, it's quite possible that the 96MB block on NAVI 22 have less bandwidth than the 128 MB on Navi 21.

Also, all chip maker have simulator in house. They probably already tested the scenario you propose versus the scenario they choose in simulation and decided that it was not worth it. NAvi 22 aim for 1440P and not 1080P too
You're mixing things up, there isn't necessarily a relation between bus width and memory size, there is a relationship between bus width and memory configuration, it's size can be anything as long as it is the correct multiple. There is a relation between the bus width and the memory controllers because each controller has to match with a connection to one chip.

There is no fundamental relation between global memory (VRMA) and infinity cache either, it's up to AMD to decide how much memory they want, it's just a victim cache.
Posted on Reply
Add your own comment