Monday, April 19th 2021

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

Graphics cards have been developed over the years so that they feature multi-level cache hierarchies. These levels of cache have been engineered to fill in the gap between memory and compute, a growing problem that cripples the performance of GPUs in many applications. Different GPU vendors, like AMD and NVIDIA, have different sizes of register files, L1, and L2 caches, depending on the architecture. For example, the amount of L2 cache on NVIDIA's A100 GPU is 40 MB, which is seven times larger compared to the previous generation V100. That just shows how much new applications require bigger cache sizes, which is ever-increasing to satisfy the needs.

Today, we have an interesting report coming from Chips and Cheese. The website has decided to measure GPU memory latency of the latest generation of cards - AMD's RDNA 2 and NVIDIA's Ampere. By using simple pointer chasing tests in OpenCL, we get interesting results. RDNA 2 cache is fast and massive. Compared to Ampere, cache latency is much lower, while the VRAM latency is about the same. NVIDIA uses a two-level cache system consisting out of L1 and L2, which seems to be a rather slow solution. Data coming from Ampere's SM, which holds L1 cache, to the outside L2 is taking over 100 ns of latency.
AMD on the other hand has a three-level cache system. There are L0, L1, and L2 cache levels to complement the RDNA 2 design. The latency between the L0 and L2, even with L1 between them, is just 66 ns. Infinity Cache, which is an L3 cache essentially, is adding only additional 20 ns of additional latency, making it still faster compared to NVIDIA's cache solutions. NVIDIA's GA102 massive die seems to represent a big problem for the L2 cache to go around it and many cycles are taken. You can read more about the test here.
Source: Chips and Cheese
Add your own comment

91 Comments on GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

#76
mtcn77
medi01
Given that it was an "army", i would want to see a test showing that dropping BW to 384GB/s does not drop performance.
I did these studies back in the day between 900 series and 290 series. The funny thing was, 900 series were utilizing their bandwidth better at lower resolutions and 290 series couldn't catch a break at higher resolutions because 900 series were actually rendering the scene quite faster since it takes less time to issue memory calls from L0 and L2 than the main memory.
TL;DR: they use their bandwidth differently now. Lessening it won't slow access unless the gpu is working on two kernels such as gpu+compute like Heterogeneous compute architecture. What bounds the performance isn't the bandwidth, but being cache strapped these days.
Posted on Reply
#77
1d10t
Doc-J
All information by "Paid" Linus channel is useless....
Only see his face when reviewing the RTX3000 cards said it all XD
So Linus is paid channel, who doesn't?
Kayotikz
........

Do you need someone to comprehend a sentence for you?

FPS ≠ Frametime

If you need me to break it down for you even more..... You should probably pick up a comprehension class
Great, would you help me that? I'm clueless.






and my original post are

RH92
:shadedshu: Have you read my response ? Im not talking about the placebo effect ( which describes the subjective perception ) , im talking about the objective perception of ''smoothness'' in games which can be accurately extrapolated from frametime measurements ! There is a reason all major reviewers have integrated frametime measurements in their reviews .........


Then why reply to my comments? I'm clearly describe that as placebo effect in term of smoothness and you just take specific word and took that out of context.
Posted on Reply
#78
Chrispy_
medi01
Given that it was an "army", i would want to see a test showing that dropping BW to 384GB/s does not drop performance.

A single one from a source at least remotely reputable would do.


You started with "it does not" and it appears you are assuming it with high confidence.
High confidence is cool, but let's not mix facts and fictions, shall we?
There is no reputable source, it's an army of users tweaking their cards not professional testers. Reddit, these forums, other forums. Underclocking the VRAM is almost pointless because it's doesn't let you drop the voltage by enough to really matter, though I've read plenty of instances where people have been chasing that last 1% and flashed modded BIOSes with lowered memory clocks in an effort to hit absolute limits of what undervolting is possible, with zero consequences on benchmark performance compared to the same card with faster RAM. I assumed that the excess of bandwidth of the 5700-series was just common knowledge at this point. The reason it gets 256-bit bus is for product segmentation - they wanted an 8GB card not a 6GB card, and the only sensible way to do it was to use an overkill (for the GPU shader ability) 256-bit bus, the alternative of having 128-bit with denser modules probably would have been too bandwidth starved.

If you want facts, look at any reviews of a 14Gbps 5600XT vs a 14Gbps 5700 - they are essentially the same card/board/vram but just with the 5600XT being limited to a 192-bit bus on 6GB of VRAM. If you cannot see that lopping off a quarter of the bandwidth makes negligible difference then I'm not going to spoonfeed you. I hoped that you'd have the intelligence to draw your own conclusions from that which is exactly why I mentioned it in that post earlier.
Posted on Reply
#79
medi01
The thought that AMD wastes resources with fast VRAM and wide mem bus that does nothing, is too remarkable to rely on "random dudes on the internet".
Posted on Reply
#80
mtcn77
medi01
The thought that AMD wastes resources with fast VRAM and wide mem bus that does nothing, is too remarkable to rely on "random dudes on the internet".
I bet you are the heterogeneous compute architecture expert, hi.
Posted on Reply
#81
Chrispy_
medi01
The thought that AMD wastes resources with fast VRAM and wide mem bus that does nothing, is too remarkable to rely on "random dudes on the internet".
It's no more wasted than the 12GB GDDR6 on the 3060 12GB. Nvidia didn't want to use 12GB but they chose that rather than offering a 6GB card because nobody would pay the MSRP for a 6GB model that was already considered "lower spec" in 2017.

AMD were competing with 8GB cards (2070, 2060S) and wouldn't have been willing to release a 6GB/192-bit card at the price they were targeting.

This is GPU basics 101. Are you seriously asking these questions for real or just trolling?
Posted on Reply
#82
mtcn77
Chrispy_
Are you seriously asking these questions for real or just trolling?
He is not trolling, however not understanding that memory devices provide just a portion of overall bandwidth now.

PS: I still chuckle at the individuals who said cache was the worst implementation of process gains. If it weren't for the availability of cache, these gpus would slow down at the slightest memory overload...
Posted on Reply
#83
Punkenjoy
Radeon 5600 XT game clock as per TPU DB is 1375 MHz where 5700 is 1625 MHz, As per TPU, 5700 is 7% faster despite having 9.5% more fillrate and TFLOPS.

But this is the cut down version of the RX 5700 XT. that have the same bandwidth but 10% more cores and 7.5% more frequency. the card is 14% faster so it's probably true that the 5700 non-XT have way more memory than it need. For the 5700XT i am not sure if that is as clear as that.

But anyway, the fact that the 5600XT is not memory starved or it is have very few to do to the comparaison of IPC between 5700XT and 6700XT. The 6700XT was downclocked to match the 5700XT. Maybe with 30% lower clock the 6700XT wouldn't need the equivalent of a 256 bit bus with 14 gbps memory, but this is isn't the speed the card is running right now.

Like i said, AMD engineer have simulators. they mostly know how multiples configuration work before finalizing the layout. If they went with 96 MB ic + 192 bit bus, it's probably because they found out that it was enough. Also raw bandwidth is only one part of the equation, Power consumption is another part. If they can spend less power budget on the memory, they can crank the clock higher in the GPU cores.

AMD claim that an infinity cache access use 1.3 pJ where a memory access use between 7 and 8 pJ. That is a huge difference. the cache it that they claim is above 50% at 4K for the 128MB block. should be similiar for 96MB at 1440P.

In the end, chip design is never perfect, but it's a balance. You balance so many factor in the hope of getting the best chips and i strongly thing the cache is there to stay. with MCM, die size might not be that big of a deal that it is right now anyway. You will be able to put many smaller chips into a single packaging.
Posted on Reply
#84
mtcn77
With past Nvidia references I think that is right on the money, yes.
Although I don't get how it is a seperate resource like Xbox ESRAM. Could AMD make their own after the fact?
Posted on Reply
#85
Kaotik
THANATOS
As far as I know Turing had 64 FP32 and 64 INT32 Units per SM. Now those 64 INT32 units are capable of either INT32/FP32, but the original Cuda cores were or are not capable of INT32 as far as I know.
Don't know where you got that idea from, of course they've always supported INT(32). If you can do FP, you can do INT, it's the other way around that's not guaranteed.
Here's a quote from Fermi's whitepaper. Fermi was the first GPU with CUDA cores.
Figure
 5. 
NVIDIA’s 
Fermi
 GPU 
architecture 
consists
 of 
multiple
 streaming

multiprocessors (SMs),
 each
 consisting
 of 
32
cores, 
each of
 which
 can
 execute 
one 
floating point 
or
 integer 
instruction 
per 
clock.
www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-TheFirstCompleteGPUComputingArchitecture.pdf

The Turing INT-units were essentially CUDA cores without FP capabilities, thus they didn't count them as CUDA cores. Now that they didn't strip FP capabilities, they started calling the CUDA cores again, because that's what they are.
Posted on Reply
#86
medi01
mtcn77
He is not trolling, however not understanding that memory devices provide just a portion of overall bandwidth now.
Chrispy_
AMD were competing with 8GB cards
So AMD couldn't have both 8GB and lower bandwidth.
Amazing stuff.
Posted on Reply
#87
mtcn77
medi01
So AMD couldn't have both 8GB and lower bandwidth.
Amazing stuff.
I had the same inclination as yourself, then a local developer who had worked on UE4 game development insured me that HSA had nothing to do with brute force, but deferred rendering pipeline management. They don't do it to execute two workloads simultaneously. The bandwidth doesn't matter anymore like it does with forward rendering.
Posted on Reply
#88
Chrispy_
medi01
So AMD couldn't have both 8GB and lower bandwidth.
Amazing stuff.
Correct, kind of. They could technically have made one but it would have been a disaster.

GDDR Memory modules are made in sizes that follow power-of-two capacity and they all currently have 32-bit access paths per BGA module. That means that you get 6 modules on a 192-bit card, 8 on a 256-bit card, 12 on a 384-bit card etc. Every experiment to deviate from the formula in recent history has caused problems, see the 550Ti and 660Ti which only had half bandwidth to half their memory, and a more severe example of that scenario was the infamous lawsuit Nvidia lost over "3.5GB of VRAM" on the GTX970. Asymmetric memory layouts are bad and only the VRAM that falls under symmetric access is realistically usable - so 3.5GB on the 4GB GTX 970, and 768MB of the 1GB on the 550Ti and 660Ti.

The market and current gaming requirements determines what amount of VRAM is acceptable at any given price point.

AMD's choices for the 5700-series were:
  • Cheaper silicon/PCB with 192-bit memory and 6GB - VRAM-limited and hurts performance of their highest performance part in 2019. Within 6 months some games were already using over 6GB.
  • Cheaper silicon/PCB with 192-bit memory and 12GB - GDDR6 was pretty expensive, that would have added $50-100 to every card at retail.
  • Very cheap silicon/PCB with 128-bit memory and 8GB - not enough bandwidth, likely no faster than the 256-bit RX580/RX590 at higher resolutions, so a commercial failure as a premium product.
  • Expensive silicon/PCB with 256-bit memory and 8GB - what they went for. Competitive with Nvidia, not too expensive on GDDR6, and with enough capacity for all current games on the market.
Sure, the 192-bit, 6GB 5600XT has proved that even 192-bit is enough bandwidth for the 5700-series but it's always a balancing act. The bandwidth and VRAM capacity need to match the capabilities of the GPU - there's no point putting 8GB of GDDR6 on a Geforce 710 because it'll never manage to run anything at high enough resolutions to use that much capacity/bandwidth. Meanwhile, people are worried that 10GB isn't enough on the 3080 for future 4K gaming and today's 8K gaming, the same way that people who bought expensive AMD Fury cards with only 4GB were burned by lack of VRAM long before their GPU was too slow to run new games.
Posted on Reply
#89
dragontamer5788
Chrispy_
The bandwidth and VRAM capacity need to match the capabilities of the GPU - there's no point putting 8GB of GDDR6 on a Geforce 710 because it'll never manage to run anything at high enough resolutions to use that much capacity/bandwidth
Even more confusing: GPU workloads change over time. Hypothetically, some memory-hard cryptocoin algorithm can come out that would make a Geforce 710 with 8GB of GDDR6x the best machine ever. (Cryptocoin people seem to have fun making really weird algorithms with weird attributes).

In more video game-oriented tasks: there are various workloads that have different compute and RAM requirements. For example: texture compression/decompression is heavy compute but light on the RAM bandwidth and vRAM capacity. Programmers have chosen to do texture compression as standard because GPU-compute strength is growing faster than VRAM bandwidth and VRAM capacity. And particular targets (such as game consoles or the iPhone) have an outsized affect on the programming styles / decisions of video game programmers.

So its a moving target. The society of GPU-programmers / video game programmers / artists create workloads based on the machine's capabilities. And then the GPU-designers create GPUs based off of the workloads that the programmers made. No one is really "in charge" of the future. It just sorta happens. Every now and then, the GPU hardware designers say "here's how we'll do things moving forward" (ex: Mantle), but then the low level software engineers still have to build something (ex: DirectX12), before it is adopted by the common shader / effects programmer.

---------
Every experiment to deviate from the formula in recent history has caused problems
XBox 360 had eDRAM that no one liked. Kinda different, but kinda the same.

There's lots of "good" designs assuming the programmers are willing to follow the new design. But if the programmers don't want to, you end up with bad performance. I'm sure hardware engineers would love it if they could wave a magic wand and get all programmers to agree to a new methodology. But similarly: video game programmers want to wave a magic wand and get hardware engineers to build GPUs in the way they're already programming, without any "funny business".
Posted on Reply
#90
Punkenjoy
Again, the fact that RX5600 XT and 5700 have different memory configuration but similar performance do not proove that the RX 5700 XT could have been fine with a 192 bit bus. RX 5700 is a cut down and lower clocked 5700 XT.

But the guy is right, They could have got 8 GB of VRAM and lower lattency by using 12 gbps GDDR6 instead of 14 gbps. They would probably be able to save some pennys there. You missing that option that they did not selected.

I suspect that the RX 5700XT require that 256 bit bus and 14 gbps memory. Maybe not in all scenario but in some case and the 5700 is just the left over that can't be qualified as 5700XT.

As for memory layout, he speak about weird memory layout like the 970, not necessarely eDRAM. Anyway in the current PC market, AMD and Nvidia would shoot themselves in the foot if they would require the game to have very specific optimisation for their architecture to perform. But luckily, you could optimise for Infinity Cache, but you would get very good performance even if you aren't doing anything.
Posted on Reply
#91
mtcn77
dragontamer5788
Every now and then, the GPU hardware designers say "here's how we'll do things moving forward" (ex: Mantle), but then the low level software engineers still have to build something (ex: DirectX12),
Even mantle had nothing to do with forward rendering despite AMD had much to do with its public perception as such. For it to be graphics plus another compute stream, it required tight manual controls and profiling in order to operate the execution mask in top condition which none of the developers did.
The simple explanation is, the gpu doesn't operate at register speed, it just shuffles registers at the wavefront rate. Thus it becomes a matter of always masking those execution resources which are due for execution. It heralded the discovery of scalarization which became so critical that 4 iterative threads became obsolete and shortest execution latency became the new paradigm. Its inherent simplification at the instruction scheduling is essentially the RDNA dual issuing dual schedulers, imo. Now they don't have to fiddle with the execution mask to find wavefronts with optimal workload, there is no more instruction waiting due to the workset being wider than the wavefront capacity. Every cycle 4 instructions fill the pipeline with no lag.
Posted on Reply
Add your own comment