Monday, April 19th 2021

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

Graphics cards have been developed over the years so that they feature multi-level cache hierarchies. These levels of cache have been engineered to fill in the gap between memory and compute, a growing problem that cripples the performance of GPUs in many applications. Different GPU vendors, like AMD and NVIDIA, have different sizes of register files, L1, and L2 caches, depending on the architecture. For example, the amount of L2 cache on NVIDIA's A100 GPU is 40 MB, which is seven times larger compared to the previous generation V100. That just shows how much new applications require bigger cache sizes, which is ever-increasing to satisfy the needs.

Today, we have an interesting report coming from Chips and Cheese. The website has decided to measure GPU memory latency of the latest generation of cards - AMD's RDNA 2 and NVIDIA's Ampere. By using simple pointer chasing tests in OpenCL, we get interesting results. RDNA 2 cache is fast and massive. Compared to Ampere, cache latency is much lower, while the VRAM latency is about the same. NVIDIA uses a two-level cache system consisting out of L1 and L2, which seems to be a rather slow solution. Data coming from Ampere's SM, which holds L1 cache, to the outside L2 is taking over 100 ns of latency.
AMD on the other hand has a three-level cache system. There are L0, L1, and L2 cache levels to complement the RDNA 2 design. The latency between the L0 and L2, even with L1 between them, is just 66 ns. Infinity Cache, which is an L3 cache essentially, is adding only additional 20 ns of additional latency, making it still faster compared to NVIDIA's cache solutions. NVIDIA's GA102 massive die seems to represent a big problem for the L2 cache to go around it and many cycles are taken. You can read more about the test here.
Source: Chips and Cheese
Add your own comment

92 Comments on GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

#1
londiste
The slow uptick between cache level s on RDNA2 is interesting. While Ampere cache levels are quite clearly distinguished RDNA2 graph is much more smooth, including Infinity Cache past 32MB.
Posted on Reply
#2
john_
This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.
Posted on Reply
#3
AnarchoPrimitiv
john_This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.
Does it perform better across the board in every game? What GPUs are you comparing out of curiosity?
Posted on Reply
#4
yeeeeman
AMD should thank a lot to TSMC for allowing them to add that much cache in such little space.
Using cache is in general the lazy man way of solving things.
Posted on Reply
#5
nguyen
So Ampere is a compute/bandwidth monster and RDNA2 is a latency monster, in the end which solution grab the most market share will be the winner.
Posted on Reply
#6
THANATOS
john_This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.
How does It actually show that Ampere architecture is much faster? Care to elaborate how big impact latency has on a GPU performance?
BTW Nvidia has higher bandwidth than AMD and in high end(GA102) It's significantly higher, but you ignore this.
nguyenSo Ampere is a compute/bandwidth monster and RDNA2 is a latency monster, in the end which solution grab the most market share will be the winner.
I think Nvidia adding FP32 functionality to Its INT units is a pretty good idea. Although I don't know how much transistors or power It cost gaming performance increased by ~25% and then there is the advantage in compute workload. I wouldn't mind If AMD did the same thing.
yeeeemanAMD should thank a lot to TSMC for allowing them to add that much cache in such little space.
Using cache is in general the lazy man way of solving things.
For CPU yes, but for GPU what better alternative do we have? Super expensive HBM2 or expensive GDDR6x with wider memory controller?
So you can't really say Infinity cache was a bad move. I just wonder, If a smaller one(1/2 or 1/3 smaller) wouldn't be a good enough option, because honestly IC uses up a lot of space.
Posted on Reply
#7
londiste
THANATOSI just wonder, If a smaller one(1/2 or 1/3 smaller) wouldn't be a good enough option, because honestly IC uses up a lot of space.
128MB is not that much when it comes to caching for 16GB of VRAM.
Assuming die shots in AMD presentation is somewhat accurate, Infinity Cache is 15% of Navi21 die.
Posted on Reply
#8
evernessince
john_This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.
If only GPU architecture was a simple as a single factor determining performance.
THANATOSHow does It actually show that Ampere architecture is much faster? Care to elaborate how big impact latency has on a GPU performance?
BTW Nvidia has higher bandwidth than AMD and in high end(GA102) It's significantly higher, but you ignore this.


I think Nvidia adding FP32 functionality to Its INT units is a pretty good idea. Although I don't know how much transistors or power It cost gaming performance increased by ~25% and then there is the advantage in compute workload. I wouldn't mind If AMD did the same thing.


For CPU yes, but for GPU what better alternative do we have? Super expensive HBM2 or expensive GDDR6x with wider memory controller?
So you can't really say Infinity cache was a bad move. I just wonder, If a smaller one(1/2 or 1/3 smaller) wouldn't be a good enough option, because honestly IC uses up a lot of space.
It doesn't. The guy is just making an assumption and an incorrect one at that.
Posted on Reply
#9
THANATOS
londiste128MB is not that much when it comes to caching for 16GB of VRAM.
Assuming die shots in AMD presentation is somewhat accurate, Infinity Cache is 15% of Navi21 die.
I think somewhere It was mentioned It was ~20%. 20% from 520mm2 is 104mm2 and that's not a small number If we take into account that space could have been used for more CUs for example. BTW one RDNA1 WGP(2xCU) is only 4.1mm2 so I think RDNA2 WGP could be 5mm2 at most, so by halving Infinity cache and saving up 52mm2 you could put 25% more CU into N21. It would be great, If we could somehow disable a part of IC and see what kind of effect It has on performance.
Posted on Reply
#10
Wirko
londisteThe slow uptick between cache level s on RDNA2 is interesting. While Ampere cache levels are quite clearly distinguished RDNA2 graph is much more smooth, including Infinity Cache past 32MB.
Yes, that's interesting. The gradual increase above 4MB could indicate that the L3 cache is sectioned (with one part belonging to each memory controller?), and access time increases significantly when a CU needs to access data in a "distant" section. The gradual increase up to 4MB could mean that L2 is split into sections too, again with varying access time.
Posted on Reply
#11
W1zzard
THANATOSIf we take into account that space could have been used for more CUs for example
AMD made it clear in press briefings that given their power and thermal goals, the L3 cache was the better option
Posted on Reply
#12
Aquinus
Resident Wat-man
londiste128MB is not that much when it comes to caching for 16GB of VRAM.
Assuming die shots in AMD presentation is somewhat accurate, Infinity Cache is 15% of Navi21 die.
It's plenty. By that logic, the 64GB of memory in my laptop is gimped by the 16MB of cache on my CPU. It's not about the amount, it's about the hit ratio. Also the cache uses less power, so sure, you could replace it with CUs, but that's also more compute with more memory latency and more heat. That doesn't sound like a winning combo compared to what AMD has now.
Posted on Reply
#13
Mysteoa
yeeeemanAMD should thank a lot to TSMC for allowing them to add that much cache in such little space.
Using cache is in general the lazy man way of solving things.
This is just a stepping stone for when they go chiplet. They need a fast cache, so they don't need to access the VRAM often when it is across a IO die.
Posted on Reply
#14
1d10t
This explains why RX 6800 series is a serious competitor at 1080p and up to 1440p, even though the Ampere has a much wider GDDR6x memory bandwidth. Oh and some YouTubers have also said that playing on the RX 6800 is smoother, so there another perks you can't measure.

Posted on Reply
#15
RH92
1d10tThis explains why RX 6800 series is a serious competitor at 1080p and up to 1440p, even though the Ampere has a much wider GDDR6x memory bandwidth. Oh and some YouTubers have also said that playing on the RX 6800 is smoother, so there another perks you can't measure.
''Smoothness'' of a game can be measured with frametimes , there is nothing magic about it that can't be measured !
Posted on Reply
#16
claylomax
AquinusIt's plenty. By that logic, the 64GB of memory in my laptop is gimped by the 16MB of cache on my CPU. It's not about the amount, it's about the hit ratio. Also the cache uses less power, so sure, you could replace it with CUs, but that's also more compute with more memory latency and more heat. That doesn't sound like a winning combo compared to what AMD has now.
I think he's talking about how much space it takes on the chip.
Posted on Reply
#17
1d10t
RH92''Smoothness'' of a game can be measured with frametimes , there is nothing magic about it that can't be measured !
Have you watched the video? It's called placebo effect, have you invented a tool to measure it?
Posted on Reply
#18
TheinsanegamerN
1d10tHave you watched the video? It's called placebo effect, have you invented a tool to measure it?
Yeah, its called frametime measurement.
john_This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.
Faster? I mean outside of raytracing, the 3080 loses to the 6900xt and 6800xt at 1440p, but wins at 4K. Nvidia also requires significantly more power to do so. I know, samsung 8nm vs TSMC 7nm, but we've seen what happens when nvidia's arch is way ahead of AMD with the maxwell VS GCN era. Even if you look at SM count instead of core count the 3090 and 6900xt are not that different.
Posted on Reply
#19
1d10t
TheinsanegamerNYeah, its called frametime measurement.
Again, have you watched the video? There's also a frame counter in the top right corner. Here's link to save your time Linus
Posted on Reply
#20
TheinsanegamerN
1d10tAgain, have you watched the video? There's also a frame counter in the top right corner. Here's link to save your time Linus
Again, you miss the point. "smoother" is a descriptor that can be measured. If it's a benefit, then surely you can link some evidence of benchmarks done showing AMD has better frametimes, yeah?
Posted on Reply
#21
1d10t
TheinsanegamerNAgain, you miss the point. "smoother" is a descriptor that can be measured. If it's a benefit, then surely you can link some evidence of benchmarks done showing AMD has better frametimes, yeah?
Smooth is an adjective not a noun, and has no metrics associated with it. I don't need to prove anything cause I have already presented a topic for debate.
Posted on Reply
#22
Vya Domus
londiste128MB is not that much when it comes to caching for 16GB of VRAM.
For GPUs it is a ludicrous amount of cache. Just a few years ago you were looking at less than <1 KB of combined levels of cache per thread in a GPU. Now that amount has went up by at least an order of magnitude.
Posted on Reply
#23
THANATOS
W1zzardAMD made it clear in press briefings that given their power and thermal goals, the L3 cache was the better option
Wasn't that statement about the actual use of IC?
I never said to get rid of the whole IC, which was clearly stated in my post. What I wanted is to halve It(64MB instead of 128MB) and the saved up space would be used for more CU. BTW I would love to see a performance penalty graph for using smaller IC to know, If that much cache is really needed or It can be smaller.
Posted on Reply
#24
TheinsanegamerN
1d10tSmooth is an adjective not a noun, and has no metrics associated with it. I don't need to prove anything cause I have already presented a topic for debate.
You presented an opinion, an opinion that is objectively incorrect. You presented the argument, if you cant prove your argument then all you are doing is shitting up the thread. "smoothness" IS a noun, per oxford's learner dictionary, and can be measured via frametime measurement.

Oxford: www.oxfordlearnersdictionaries.com/us/definition/english/smoothness#:~:text=smoothness-,noun,any rough areas or holes

I can present a new topic for depate too: "Does 1d10t live up to his username?".
Posted on Reply
#25
Aquinus
Resident Wat-man
claylomaxI think he's talking about how much space it takes on the chip.
That's not really the only consideration though. Given the latency improvement and how it contributes less to heat than more CUs and/or faster or wider memory, I'd call it a win. This is a far better solution to the alternatives. Do you remember how much more power a 290/390 would consume when clocking up that 512-bit memory? Trust me, the infinity cache is a far better solution. This is actually why I advocate for HBM; power consumption figures are fantastic.
Posted on Reply
Add your own comment
Copyright © 2004-2021 www.techpowerup.com. All rights reserved.
All trademarks used are properties of their respective owners.