Monday, April 19th 2021
GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture
Graphics cards have been developed over the years so that they feature multi-level cache hierarchies. These levels of cache have been engineered to fill in the gap between memory and compute, a growing problem that cripples the performance of GPUs in many applications. Different GPU vendors, like AMD and NVIDIA, have different sizes of register files, L1, and L2 caches, depending on the architecture. For example, the amount of L2 cache on NVIDIA's A100 GPU is 40 MB, which is seven times larger compared to the previous generation V100. That just shows how much new applications require bigger cache sizes, which is ever-increasing to satisfy the needs.
Today, we have an interesting report coming from Chips and Cheese. The website has decided to measure GPU memory latency of the latest generation of cards - AMD's RDNA 2 and NVIDIA's Ampere. By using simple pointer chasing tests in OpenCL, we get interesting results. RDNA 2 cache is fast and massive. Compared to Ampere, cache latency is much lower, while the VRAM latency is about the same. NVIDIA uses a two-level cache system consisting out of L1 and L2, which seems to be a rather slow solution. Data coming from Ampere's SM, which holds L1 cache, to the outside L2 is taking over 100 ns of latency.AMD on the other hand has a three-level cache system. There are L0, L1, and L2 cache levels to complement the RDNA 2 design. The latency between the L0 and L2, even with L1 between them, is just 66 ns. Infinity Cache, which is an L3 cache essentially, is adding only additional 20 ns of additional latency, making it still faster compared to NVIDIA's cache solutions. NVIDIA's GA102 massive die seems to represent a big problem for the L2 cache to go around it and many cycles are taken. You can read more about the test here.
Source:
Chips and Cheese
Today, we have an interesting report coming from Chips and Cheese. The website has decided to measure GPU memory latency of the latest generation of cards - AMD's RDNA 2 and NVIDIA's Ampere. By using simple pointer chasing tests in OpenCL, we get interesting results. RDNA 2 cache is fast and massive. Compared to Ampere, cache latency is much lower, while the VRAM latency is about the same. NVIDIA uses a two-level cache system consisting out of L1 and L2, which seems to be a rather slow solution. Data coming from Ampere's SM, which holds L1 cache, to the outside L2 is taking over 100 ns of latency.AMD on the other hand has a three-level cache system. There are L0, L1, and L2 cache levels to complement the RDNA 2 design. The latency between the L0 and L2, even with L1 between them, is just 66 ns. Infinity Cache, which is an L3 cache essentially, is adding only additional 20 ns of additional latency, making it still faster compared to NVIDIA's cache solutions. NVIDIA's GA102 massive die seems to represent a big problem for the L2 cache to go around it and many cycles are taken. You can read more about the test here.
92 Comments on GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture
Aka: my numbers are grossly wrong compared to whatever they're doing in their tests. It'd be nice if people shared their pointer-chasing code, to ensure that we're actually comparing apples-to-apples. Both NVidia and AMD provide "count clock tick" functions in CUDA / ROCm, so these tests are actually rather easy to conduct. But if our code differs, then we will have grossly different results.
EDIT: My methodology is to use (danluu.com/sattolo/), aka Sattolo's algorithm to randomize a perfect cycle (a random list of pointers, where the cycle-length is the length of the list). Ex: if you have a list of size 100, then node = node->next will cycle after exactly 100 items. This prevents prefetching shenanigans on CPUs (and maybe GPUs, even though I don't think that GPUs have prefetching). Since its random and 100% dependent on the RAM, there's no way for the prefetch-predictor to actually predict the pointer-fetch ahead of time.
But the fact that we have:
NAVI 21, 256 bit, 128 MB IC
NAVI 22, 192 bit, 96 MB IC and if rumours are true
NAVI 23, 128 bit and 64 MB IC seems to point that there is a direct link between IC and memory bus size.
Since IC runs at the same speed as core speed and it’s built in it makes sense AMD tuned its performance to match what the process and die needed, it’s all their own product.
I have two versions of the OpenCL kernel - one with the loop unrolled 10x (below), and a simple one. Timing is basically enqueue kernel, start timing, clFinish (block until command queue is finished), stop timing. Then divide elapsed time by number of chase iterations.
I filled the list to make the pointer chasing happen with a fixed stride. I observed that there was no prefetching at all on GPUs, so using a stride that exceeds the burst read (32B on Nvidia, 64B on AMD and Intel iGPUs) would be sufficient. I used a stride slightly over 4K to defeat CPU prefetchers, which won't prefetch past a 4K page boundary.
Some tests with varying (small) stride:
Interestingly enough there's no L3-level prefetching on Intel, as latency goes all the way up to memory latency once you use a stride greater than 64B. For CPUs I had to use a much bigger stride before I got results comparable to other sites/tests.
At first I thought that part of our methodology might make a difference, so I tried out your access pattern (sattolo). It's interesting how Nvidia gets a smoother rise in latency with your methodology, once cache capacity is exceeded. But the conclusion for cache/mem latency doesn't change, so that's not it:
I got someone to test (stride) with a Vega 64, and memory latency was around 338.57ns (without loop unrolling). Some things to try, probably in order:
- Run your code on a CPU as a sanity check, and make sure results are in the same neighborhood as that of other tests/review sites
- Unroll the loop. Some GPUs have very high branch overhead and it's not hidden by memory latency. Terascale went down from ~680 ns to ~560 ns with a 10x unrolled loop. RDNA2 saw a smaller improvement (around 10ns less with unrolling). Don't expect the driver/compiler to do it for you - I checked the generated Terascale assembly and it didn't. Strangely this doesn't affect the Nvidia GPUs I tested, so either Nvidia has low branch overhead, or is able to overlap branch latency with cache/mem latency.
- What region size are you testing? With very large test sizes you might be running into TLB miss penalties. Modern GPUs use virtual memory (you can even check out Pascal PTE fields here: gp100-mmu-format.pdf (nvidia.github.io))
- Are you using 32-bit or 64-bit offsets? GPUs might be slower with 64-bit arithmetic, while 32-bit stuff is generally fast. Remember address calculation is part of latency here.
- Check what the clock tick function is actually counting, if you want to use that? For example on CPUs, you have TSC (always counts at base clock), AMD APERF or Intel fixed counter 1 (counts number of unhalted cycles). Alternatively, run for a very large number of iterations to make the test run for at least a second or two, then use OS/library time measuring functions. That's what I did. It's more portable anyway.
- Make sure your GPU isn't stuck in a low power state and running at very low frequency
Oh and finally, it's great to see other people writing tests for this stuff! Keep at it! It's easy to make mistakes, and the more independent implementations we have, the less likely it is that we'll make the same mistake.* ROCm / HIP on Vega64
* 752 cycles (1 thread / workgroup), 1132 cycles (64 threads/workgroup). Suggesting a HBM2 latency of 461ns (1.63 GHz clock).
SCLK of 1630 MHz / 1.63 GHz. "clock()" returns the number of clock ticks in ROCm, measured by the GPU itself. buffer[0], buffer[1], and buffer[2] overwrite the original data but whatever, simple enough to do, lol.
"count" gives a precise count to the number of memory references. ptr = buffer[ptr] is the pointer-chasing code. buffer[] was setup using Sattolo's algorithm, as described before.
--------- Check. CPU was ~10ns for L3 cache, the number I expected. I did not unroll the loop. I checked the assembly, and no unrolling / optimization seems to have been done. Based on my understanding of memory-latency, the AMD Vega64 should have been mostly stalling on the "s_waitcnt" instruction (wait for memory). I expect GPUs to overlap branch latency with cache/mem latency, but you're right that its not necessarily guaranteed. But you're right in that the loop could very well be part of the measured latency equation. I'll give that a shot later. Various regions, from 1kB to 4GBs. I forgot the size for the 752 cycle number precisely. 32-bit offset from a pointer. I'll have to check the assembly instructions to see if it compiles into a 64-bit offset. clock() compiles into the Vega-ISA clock-tick function, either S_MEMTIME or S_MEMREALTIME. I admit that I forget off the top of my head, but it was a single-instruction.
As you can see from the ROCm code, I run the "clock" over the entire set of iterations, and then divide later (end-start) / count in some later code. I checked the status of the GPU during the benchmark. The GPU goes up to 1630 MHz during the benchmark, and drops down to 1000MHz or so after the benchmark.
Fortnight
Dirt 5
WoW RT
Godfall is a close call. You are comapring wrong things.
Mem bandwidth on 5700XT is 448GB/s
Mem bandwidth on 6700XT is just 384GB/s
See, what that cache is doing?
I did a batch of frame captures via Nvidia Nsight, and here's some data:
In all cases there's a lot of warps waiting on memory, but clearly that's not the whole story. Superposition for example still has decent throughput despite having so many warps stalled on memory latency, because there are a lot of pending warps the scheduler can choose from to feed the execution units. Ungine Heaven stalls less on mem latency especially at 4K, but gets texture bound.
Anyway it's very hard to say for sure without profiling a workload, and even then interpreting the results is tricky. I suspect that GPUs like GA102 and Navi 21 are big enough that some workloads (particularly at 1080P) might not provide enough parallelism. But that's a guess in the end, since I don't have either of those GPUs to play with. Ugh, accidentally edited my previous reply out because TPU combined the posts and I thought it double posted.
Anyway, Vega_Shader_ISA_28July2017.pdf (amd.com) says S_MEMREALTIME doesn't count at core clocks. If clock() gets compiled to that, you'll need to determine what it's counting in proportion to. S_MEMTIME appears to be core clocks?
The 338.57 ns Vega 64 figure was for a 128 MB region. That person also overclocked to 1682 core, 1060 HBM and got 313.29 ns.
Our pointer chasing loops differ in that your loop branch condition depends on the memory load's result, meaning it can't overlap with memory latency. It'd have to be part of measured latency in your test, even on a CPU with OOO execution. But on a CPU it's not going to make more than a cycle or two of difference. I don't know about GPUs.
And this dropped the cycle count down to 577 cycles, or 353 nanoseconds. I needed to add "buffer[3] = ptr" to prevent the loop from being optimized away (time == 0 for a few tries, lol). Running multiple times makes this loop 60-clock ticks (~36 nanoseconds) faster on the average.
Which is pretty close to your result for a 16MB region. I'm still using clock() on the actual GPU instead of the CPU-clock, so I'll start investigating / disassembling the code to see if its S_MEMTIME or whatever.
Its not yet perfectly replicated, but maybe there's a fundamental clock-speed difference (either HBM2 or GPU-clock) which could very well be the last 15-nanoseconds worth of difference.
-----------
Notes:
1. My initial 700+ cycle / 500 nanosecond result was from ROCm 3.x last year. Today, with a change to ROCm 4.0, it seems like the code might be faster (consistently under 700 cycles when I run the unedited HIP code). So maybe AMD had a few device driver tweaks that made things faster under the hood?
2. The loop is clearly unrolled now.
You can see the "global_load_dword" call, as well as the repeated s_waitcnt vmcnt(0) (wait for outstanding memory-count zero). The global_load is asynchronous: it starts the memory load "in the background", while the s_waitcnt actually forces the GPU-core to wait for RAM.
So maybe the device driver change was just the compiler-upgrade to better unroll this loop automatically?
3. "global_load_dword v1, v[5:6], off" shows that the memory load is a 32-bit operation + a 64-bit base register v[5:6] is the base register (myBuffer). So that answers that question we had earlier...
---------------
Notes:
* Today I'm running ROCm 4.0
* extractkernel tool to grab the disassembly
* constexpr uint64_t runSize = (1 << 22); (4-million numbers x 4-bytes each == 16MBs tested). Agreed. But here's some notes.
* Vega (and older) are 16 x 4 architecturally. That is: each vALU of the CU has 16-wide execution units, and those execution units repeat EVERY instruction 4-times, resulting in your 64-wide programming model.
* NVidia and RDNA are 32x1. The CUs are 32-wide naturally, and do none of the "repetition" stuff. As such, NVidia and RDNA are better tailored for low-occupancy situations.
* NVidia and RDNA both have stalls: NVidia tracks stalls at the assembly level. RDNA handles stalls automatically.
* GPUs are all 32-wide or above from a programming model. That means even the smallest of reasonable workloads is 32-wide. If not, no one would even bother with the whole shader / DirectX / HLSL crap and just program the stuff on your big, beefy CPU already optimized for single-threaded data. Additional CUDA blocks or OpenCL workgroups can provide additional latency hiding, but it should be noted that a single-warp / single-wavefront (with a good enough compiler and enough Instruction-level-parallelism at the C/C++ level) can in fact hide latency.
As such, occupancy doesn't really mean much unless we know exactly the code that was being run. In fact, higher occupancy may be a bad thing: GPUs have to share their vGPRs / registers with other "occupants". So going occupancy 2, or occupancy 4+ means having 1/2 or 1/4th the registers. So the optimization of code (registers vs occupancy) is a very complicated problem.
16 MB worries me a bit, since Vega 64 has a 4 MB L2. With a random access pattern, you may get (un)lucky 1/4 of the time and hit cache, which is why I used 128 MB as a reference point for mem latency (or 1 GB for RDNA 2 since it went that far before mem latency leveled off). That seems to be more of a problem with Sattolo's algorithm than using a fixed stride, judging by how the GTX 980 Ti saw a slower jump in latency after going past 3 MB:
For variable latency (memory) stalls, both use barriers like s_waitcnt. It's just part of static scheduling info on Nvidia. Yes that's true, you can get ILP within a single wavefront/warp through nonblocking loads and pipelined execution. Also Intel Gen is unique here - it can use variable SIMD widths from 4-wide to 32-wide.
I got achieved occupancy through Nsight. If occupancy is limited by vGPR or shmem(lds) pressure, I'd just see lower occupancy. That may be why Ungine Heaven gets lower occupancy at 4K than 1080P, even though there should be more parallelism available at 4K.
-------
I'm a bit curious about the slowdown at 2kB.
I would guess that the slowdown of Sattolo's at 2kB is because maybe... the code-cache may share some cache with the data-cache? So maybe the 4kB region is data+code?? Just guessing here.
I don't think code and data share the same cache until L2. On Nvidia, constant data and code share a L1.5 cache, which is 32 KB per SM on GP104 according to 1903.07486.pdf (arxiv.org), but it looks like OpenCL sees none of that (thanks NV). That's also a fun read by the way. They say GP104 L2 hit latency is ~216c using CUDA, handwritten SASS, and %%clock for timing. I get ~266c from my higher level OpenCL test.
Switching it to a malloc/free probably would have worked. But alas, its late and time for bed, lol. Needless to say, the code is messy and primarily geared at the x64 vs x1 scenario (which is what I was most interested in: how the memory bandwidth / latency scaled with higher work-item counts).
Anyway, maybe I'll try that tomorrow.
Ah, I see my mistake. That's 2MB vs 4MBs. Gotcha. I thought it was 2kB and 4kB there, but I guess I didn't read the "KB" correctly.
So yeah, the 3MB test-point, where Sattolo's method is slightly slower. Maybe that's instruction-cache taking up some space? I dunno what else it could be.
I agree with you. Nvidia's texture cache isn't so easily harnessed, even developers ask around for that.
The L1 texture cache on Kepler/Maxwell/Pascal not showing up in my tests is a different issue. You have to opt-in to use it in CUDA by passing "-Xptxas -dlcm=ca" (without quotes) on the nvcc command line. I tried for a while to make it work in CUDA with zero success. I got zero L1 tex cache hits even with those parameters passed to nvcc. And there's no way to use the L1 tex cache at all from OpenCL.
;) All information by "Paid" Linus channel is useless....
Only see his face when reviewing the RTX3000 cards said it all XD
Do you need someone to comprehend a sentence for you?
FPS ≠ Frametime
If you need me to break it down for you even more..... You should probably pick up a comprehension class
Been there done that for others that consider AMD the best. AMD has a lot to go in order to make themselves in the lead. They have a crosslicense with Nvidia which means anything Nvidia has put forward in the past, they have access to. Like the same, AMD will grant access to Nvidia for their own innovations in the future.
First, we have to see AMD lead and then stay there are as Nvidia starts to pick up the spoils of AMD's success. I wouldn't say they didn't do much, but a lot has come from their Nvidia partners to discount the influence, imo. It is different. You can use compression to target buffered writes which reduce bandwidth call, or improve bandwidth utilisation to memory devices; however way you want to look at it. It cannot buffer random color data, but for static bits of color information, ROP units can churn more pixels.
The difference with the way Nvidia does it is, up until recently - I mean until Vega debuted - they were the first with tile based rendering and buffered memory access. When accesses to memory are buffered it means direct access "immediate mode" is gone and the gpu won't access memory and instead accesses the L2. You have to store such data in buffers order to send them in packets to memory and thats where texture cache comes in. L0 buffers textures, then sends them over to L2, ROP units then issue writes to memory, always in the same buffering protocol, so there is no sparse access, the code bubbles, or rasterizer bubbles I should say, don't cause memory access bubbles, since the interface is detached from the execution units and only access fully buffered orders.
Not an engineer, this is the best I can make of it.
To put it into perspective, the 5600XT has a 25% reduction in memory bandwidth over the vanilla 5700 (most of them use the same 14Gbps GDDR6 as the 5700), and it also shares the same 2304 shader configuration. Yet, despite similar game clocks, it is within 5% of the performance, and only at 4K where the GPU doesn't have enough shader strength anyway - it struggles at 4K even in 2019 games of its day. So the extra bandwidth of the 5700 and 5700XT is academic in pretty much any game past and present. You could double it to 900GB/s and the 5700XT still wouldn't be any faster.
So let's be clear, if you underclocked the VRAM on a 5700XT to bring its bandwidth down to the 384GB/s of the 6700XT, I'd bet money on the performance drop being almost undetectable - within margin of error most likely. The 448GX/s of the 5700XT is a paper spec that only rears its head in synthetic VRAM bandwidth-measuring tests or Etherium mining where bandwidth is all that matters (and miners would turn the GPU core clocks down to 1MHz to save power if that were even possible). I'm not a GPU architect but from having read several interviews of GPU architects and deep-dives into architecture, it's clear that you can design more efficient architecture that does more stuff at a given clockspeed - this is what drives architectural process, and you can tweak previous generation architecture layouts with the hindsight of which areas were limiting clockspeeds in that generation.
One is a whole new architecture designed to be better than the old architecture, the other is a band-aid that refines the old architecture by picking the low-hanging fruit. I'm not saying that either approach is bad, but you can only take the band-aid approach once; If you tweak the architecture properly to optimise layout based on previous gen experience, there's no low-hanging fruit to pick for a second go around.
A single one from a source at least remotely reputable would do. You started with "it does not" and it appears you are assuming it with high confidence.
High confidence is cool, but let's not mix facts and fictions, shall we?