Monday, April 19th 2021

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

Graphics cards have been developed over the years so that they feature multi-level cache hierarchies. These levels of cache have been engineered to fill in the gap between memory and compute, a growing problem that cripples the performance of GPUs in many applications. Different GPU vendors, like AMD and NVIDIA, have different sizes of register files, L1, and L2 caches, depending on the architecture. For example, the amount of L2 cache on NVIDIA's A100 GPU is 40 MB, which is seven times larger compared to the previous generation V100. That just shows how much new applications require bigger cache sizes, which is ever-increasing to satisfy the needs.

Today, we have an interesting report coming from Chips and Cheese. The website has decided to measure GPU memory latency of the latest generation of cards - AMD's RDNA 2 and NVIDIA's Ampere. By using simple pointer chasing tests in OpenCL, we get interesting results. RDNA 2 cache is fast and massive. Compared to Ampere, cache latency is much lower, while the VRAM latency is about the same. NVIDIA uses a two-level cache system consisting out of L1 and L2, which seems to be a rather slow solution. Data coming from Ampere's SM, which holds L1 cache, to the outside L2 is taking over 100 ns of latency.
AMD on the other hand has a three-level cache system. There are L0, L1, and L2 cache levels to complement the RDNA 2 design. The latency between the L0 and L2, even with L1 between them, is just 66 ns. Infinity Cache, which is an L3 cache essentially, is adding only additional 20 ns of additional latency, making it still faster compared to NVIDIA's cache solutions. NVIDIA's GA102 massive die seems to represent a big problem for the L2 cache to go around it and many cycles are taken. You can read more about the test here.
Source: Chips and Cheese
Add your own comment

92 Comments on GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

#51
dragontamer5788
I tested pointer-chasing on Vega 64 with ROCm, and got numbers in excess of 700 cycles (single-thread) and 1100 cycles (single-thread) to HBM2 RAM. Given that Vega is ~1.5 GHz, that's basically 500+ns of latency to main memory.

Aka: my numbers are grossly wrong compared to whatever they're doing in their tests. It'd be nice if people shared their pointer-chasing code, to ensure that we're actually comparing apples-to-apples. Both NVidia and AMD provide "count clock tick" functions in CUDA / ROCm, so these tests are actually rather easy to conduct. But if our code differs, then we will have grossly different results.

EDIT: My methodology is to use (danluu.com/sattolo/), aka Sattolo's algorithm to randomize a perfect cycle (a random list of pointers, where the cycle-length is the length of the list). Ex: if you have a list of size 100, then node = node->next will cycle after exactly 100 items. This prevents prefetching shenanigans on CPUs (and maybe GPUs, even though I don't think that GPUs have prefetching). Since its random and 100% dependent on the RAM, there's no way for the prefetch-predictor to actually predict the pointer-fetch ahead of time.
Posted on Reply
#52
MxPhenom 216
ASIC Engineer
PunkenjoyYou are right. Actually making fast cache is way more complicated than it look. You have to have mechanism that will check the cache to know if the data you are trying to access is there. The larger the cache, the larger is the amount of work you have to do to figure out if it contain the data you are looking for.

This can add latency. The fact that even with more layer of cache, AMD is able to get lower latency show how well they master the cache thing. They purposely made a lot of effort there because this is a key thing with multi chips modules.
Right, cache hit miss logic
Posted on Reply
#53
Punkenjoy
Vya DomusYou're mixing things up, there isn't necessarily a relation between bus width and memory size, there is a relationship between bus width and memory configuration, it's size can be anything as long as it is the correct multiple. There is a relation between the bus width and the memory controllers because each controller has to match with a connection to one chip.

There is no fundamental relation between global memory (VRMA) and infinity cache either, it's up to AMD to decide how much memory they want, it's just a victim cache.
note that my comment was an hypothesis because neither you or I really know if there is a link between these two. You say there is none but you have nothing that prove your assessment.

But the fact that we have:
NAVI 21, 256 bit, 128 MB IC
NAVI 22, 192 bit, 96 MB IC and if rumours are true
NAVI 23, 128 bit and 64 MB IC seems to point that there is a direct link between IC and memory bus size.
Posted on Reply
#54
Steevo
Colddeckeddid you mean Nvidia?
I did, but AMD does have the upper hand in 99th percent frame rates, IC improves their frame time significantly compared to their old idea of “more cores”.

Since IC runs at the same speed as core speed and it’s built in it makes sense AMD tuned its performance to match what the process and die needed, it’s all their own product.
Posted on Reply
#55
chlamchowder
dragontamer5788I tested pointer-chasing on Vega 64 with ROCm, and got numbers in excess of 700 cycles (single-thread) and 1100 cycles (single-thread) to HBM2 RAM. Given that Vega is ~1.5 GHz, that's basically 500+ns of latency to main memory.

Aka: my numbers are grossly wrong compared to whatever they're doing in their tests. It'd be nice if people shared their pointer-chasing code, to ensure that we're actually comparing apples-to-apples. Both NVidia and AMD provide "count clock tick" functions in CUDA / ROCm, so these tests are actually rather easy to conduct. But if our code differs, then we will have grossly different results.

EDIT: My methodology is to use (danluu.com/sattolo/), aka Sattolo's algorithm to randomize a perfect cycle (a random list of pointers, where the cycle-length is the length of the list). Ex: if you have a list of size 100, then node = node->next will cycle after exactly 100 items. This prevents prefetching shenanigans on CPUs (and maybe GPUs, even though I don't think that GPUs have prefetching). Since its random and 100% dependent on the RAM, there's no way for the prefetch-predictor to actually predict the pointer-fetch ahead of time.
I'm the author of the Chips and Cheese article. To summarize the code (it's a bit of a mess, lots of opencl scaffolding):

I have two versions of the OpenCL kernel - one with the loop unrolled 10x (below), and a simple one. Timing is basically enqueue kernel, start timing, clFinish (block until command queue is finished), stop timing. Then divide elapsed time by number of chase iterations.
__kernel void unrolled_latency_test(__global const int* A, int count, __global int* ret) {
int current = A[0];
int result;
for (int i = 0; i < count; i += 10)
{
result += current;
current = A[current];
... repeat the two lines above, 9 more times
}

ret[0] = result; // prevent compiler from optimizing it away
}

I filled the list to make the pointer chasing happen with a fixed stride. I observed that there was no prefetching at all on GPUs, so using a stride that exceeds the burst read (32B on Nvidia, 64B on AMD and Intel iGPUs) would be sufficient. I used a stride slightly over 4K to defeat CPU prefetchers, which won't prefetch past a 4K page boundary.
uint32_t stride = 1211;
int* A = (int*)malloc(sizeof(int) * list_size);
for (int i = 0; i < list_size; i++)
{
A = (i + stride) % list_size;
}


Some tests with varying (small) stride:

Interestingly enough there's no L3-level prefetching on Intel, as latency goes all the way up to memory latency once you use a stride greater than 64B. For CPUs I had to use a much bigger stride before I got results comparable to other sites/tests.

At first I thought that part of our methodology might make a difference, so I tried out your access pattern (sattolo). It's interesting how Nvidia gets a smoother rise in latency with your methodology, once cache capacity is exceeded. But the conclusion for cache/mem latency doesn't change, so that's not it:


I got someone to test (stride) with a Vega 64, and memory latency was around 338.57ns (without loop unrolling). Some things to try, probably in order:
  1. Run your code on a CPU as a sanity check, and make sure results are in the same neighborhood as that of other tests/review sites
  2. Unroll the loop. Some GPUs have very high branch overhead and it's not hidden by memory latency. Terascale went down from ~680 ns to ~560 ns with a 10x unrolled loop. RDNA2 saw a smaller improvement (around 10ns less with unrolling). Don't expect the driver/compiler to do it for you - I checked the generated Terascale assembly and it didn't. Strangely this doesn't affect the Nvidia GPUs I tested, so either Nvidia has low branch overhead, or is able to overlap branch latency with cache/mem latency.
  3. What region size are you testing? With very large test sizes you might be running into TLB miss penalties. Modern GPUs use virtual memory (you can even check out Pascal PTE fields here: gp100-mmu-format.pdf (nvidia.github.io))
  4. Are you using 32-bit or 64-bit offsets? GPUs might be slower with 64-bit arithmetic, while 32-bit stuff is generally fast. Remember address calculation is part of latency here.
  5. Check what the clock tick function is actually counting, if you want to use that? For example on CPUs, you have TSC (always counts at base clock), AMD APERF or Intel fixed counter 1 (counts number of unhalted cycles). Alternatively, run for a very large number of iterations to make the test run for at least a second or two, then use OS/library time measuring functions. That's what I did. It's more portable anyway.
  6. Make sure your GPU isn't stuck in a low power state and running at very low frequency
Oh and finally, it's great to see other people writing tests for this stuff! Keep at it! It's easy to make mistakes, and the more independent implementations we have, the less likely it is that we'll make the same mistake.
Posted on Reply
#56
dragontamer5788
chlamchowderI got someone to test (stride) with a Vega 64, and memory latency was around 338.57ns (without loop unrolling). Some things to try, probably in order:
I appreciate that. Here's my numbers and methodology.

* ROCm / HIP on Vega64
* 752 cycles (1 thread / workgroup), 1132 cycles (64 threads/workgroup). Suggesting a HBM2 latency of 461ns (1.63 GHz clock).

__global__ void test(uint32_t* buffer){
clock_t start, end;
uint32_t count=0;
uint32_t ptr = buffer[0];

start = clock();

while(ptr != 0){
count++;
ptr = buffer[ptr];
}

end = clock();

buffer[0] = count;
buffer[1] = ((end-start) >> 32) & 0xFFFFFFFF;
buffer[2] = (end-start) & 0xFFFFFFFF;
}

========================ROCm System Management Interface========================
================================================================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 47.0c 61.0W 1630Mhz 945Mhz 17.65% auto 220.0W 8% 100%
================================================================================

SCLK of 1630 MHz / 1.63 GHz. "clock()" returns the number of clock ticks in ROCm, measured by the GPU itself. buffer[0], buffer[1], and buffer[2] overwrite the original data but whatever, simple enough to do, lol.

"count" gives a precise count to the number of memory references. ptr = buffer[ptr] is the pointer-chasing code. buffer[] was setup using Sattolo's algorithm, as described before.

---------
Run your code on a CPU as a sanity check, and make sure results are in the same neighborhood as that of other tests/review sites
Check. CPU was ~10ns for L3 cache, the number I expected.
Unroll the loop. Some GPUs have very high branch overhead and it's not hidden by memory latency. Terascale went down from ~680 ns to ~560 ns with a 10x unrolled loop. RDNA2 saw a smaller improvement (around 10ns less with unrolling). Don't expect the driver/compiler to do it for you - I checked the generated Terascale assembly and it didn't. Strangely this doesn't affect the Nvidia GPUs I tested, so either Nvidia has low branch overhead, or is able to overlap branch latency with cache/mem latency.
I did not unroll the loop. I checked the assembly, and no unrolling / optimization seems to have been done. Based on my understanding of memory-latency, the AMD Vega64 should have been mostly stalling on the "s_waitcnt" instruction (wait for memory). I expect GPUs to overlap branch latency with cache/mem latency, but you're right that its not necessarily guaranteed. But you're right in that the loop could very well be part of the measured latency equation. I'll give that a shot later.
What region size are you testing? With very large test sizes you might be running into TLB miss penalties. Modern GPUs use virtual memory (you can even check out Pascal PTE fields here: gp100-mmu-format.pdf (nvidia.github.io))
Various regions, from 1kB to 4GBs. I forgot the size for the 752 cycle number precisely.
Are you using 32-bit or 64-bit offsets? GPUs might be slower with 64-bit arithmetic, while 32-bit stuff is generally fast. Remember address calculation is part of latency here.
32-bit offset from a pointer. I'll have to check the assembly instructions to see if it compiles into a 64-bit offset.
Check what the clock tick function is actually counting, if you want to use that? For example on CPUs, you have TSC (always counts at base clock), AMD APERF or Intel fixed counter 1 (counts number of unhalted cycles). Alternatively, run for a very large number of iterations to make the test run for at least a second or two, then use OS/library time measuring functions. That's what I did. It's more portable anyway.
clock() compiles into the Vega-ISA clock-tick function, either S_MEMTIME or S_MEMREALTIME. I admit that I forget off the top of my head, but it was a single-instruction.

As you can see from the ROCm code, I run the "clock" over the entire set of iterations, and then divide later (end-start) / count in some later code.
Make sure your GPU isn't stuck in a low power state and running at very low frequency
I checked the status of the GPU during the benchmark. The GPU goes up to 1630 MHz during the benchmark, and drops down to 1000MHz or so after the benchmark.
Posted on Reply
#57
medi01
This is exactly the title I'd expect when AMD is wiping the floor with NV at mem latency.
TheinsanegamerNI mean outside of raytracing, the 3080 loses to the 6900xt and 6800xt at 1440p,
It tends to lose in RT in newer games, created with RDNA2 in mind:

Fortnight
Dirt 5
WoW RT

Godfall is a close call.
Chrispy_TBH the cache in RDNA2 is less about performance this gen and more about setting up for chiplets. It's not 100% useless but IPC differences between the 6700XT and similar 5700XT without the cache are really low. Sometimes zero, sometimes negligible. The performance uplift is almost entirely down to the 25-30% increase in clockspeeds.


It's a marketing point for now, that will lay the work for MCM GPUs next gen. Presumably it makes things smoother for raytracing two as the calculations now involve lookups for more data than that just relevant to the pixels any particular CU is working on, ergo more data being required - but for traditional raster based stuff the HWU video above proves how little it's of benefit to this generation.
You are comapring wrong things.

Mem bandwidth on 5700XT is 448GB/s
Mem bandwidth on 6700XT is just 384GB/s

See, what that cache is doing?
Posted on Reply
#58
chlamchowder
THANATOSCare to elaborate how big impact latency has on a GPU performance?
BTW Nvidia has higher bandwidth than AMD and in high end(GA102) It's significantly higher, but you ignore this.
As I mentioned in the original article (Measuring GPU Memory Latency – Chips and Cheese), latency starts to matter when occupancy (available parallelism) is too low to hide latency. Exactly how often that happens and to what extent is going to vary a lot with workload.

I did a batch of frame captures via Nvidia Nsight, and here's some data:



In all cases there's a lot of warps waiting on memory, but clearly that's not the whole story. Superposition for example still has decent throughput despite having so many warps stalled on memory latency, because there are a lot of pending warps the scheduler can choose from to feed the execution units. Ungine Heaven stalls less on mem latency especially at 4K, but gets texture bound.

Anyway it's very hard to say for sure without profiling a workload, and even then interpreting the results is tricky. I suspect that GPUs like GA102 and Navi 21 are big enough that some workloads (particularly at 1080P) might not provide enough parallelism. But that's a guess in the end, since I don't have either of those GPUs to play with.
dragontamer5788I appreciate that. Here's my numbers and methodology.

* ROCm / HIP on Vega64
* 752 cycles (1 thread / workgroup), 1132 cycles (64 threads/workgroup). Suggesting a HBM2 latency of 461ns (1.63 GHz clock).

__global__ void test(uint32_t* buffer){
clock_t start, end;
uint32_t count=0;
uint32_t ptr = buffer[0];

start = clock();

while(ptr != 0){
count++;
ptr = buffer[ptr];
}

end = clock();

buffer[0] = count;
buffer[1] = ((end-start) >> 32) & 0xFFFFFFFF;
buffer[2] = (end-start) & 0xFFFFFFFF;
}

========================ROCm System Management Interface========================
================================================================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 47.0c 61.0W 1630Mhz 945Mhz 17.65% auto 220.0W 8% 100%
================================================================================

SCLK of 1630 MHz / 1.63 GHz. "clock()" returns the number of clock ticks in ROCm, measured by the GPU itself. buffer[0], buffer[1], and buffer[2] overwrite the original data but whatever, simple enough to do, lol.

"count" gives a precise count to the number of memory references. ptr = buffer[ptr] is the pointer-chasing code. buffer[] was setup using Sattolo's algorithm, as described before.

---------



Check. CPU was ~10ns for L3 cache, the number I expected.



I did not unroll the loop. I checked the assembly, and no unrolling / optimization seems to have been done. Based on my understanding of memory-latency, the AMD Vega64 should have been mostly stalling on the "s_waitcnt" instruction (wait for memory). I expect GPUs to overlap branch latency with cache/mem latency, but you're right that its not necessarily guaranteed. But you're right in that the loop could very well be part of the measured latency equation. I'll give that a shot later.



Various regions, from 1kB to 4GBs. I forgot the size for the 752 cycle number precisely.



32-bit offset from a pointer. I'll have to check the assembly instructions to see if it compiles into a 64-bit offset.



clock() compiles into the Vega-ISA clock-tick function, either S_MEMTIME or S_MEMREALTIME. I admit that I forget off the top of my head, but it was a single-instruction.

As you can see from the ROCm code, I run the "clock" over the entire set of iterations, and then divide later (end-start) / count in some later code.



I checked the status of the GPU during the benchmark. The GPU goes up to 1630 MHz during the benchmark, and drops down to 1000MHz or so after the benchmark.
Ugh, accidentally edited my previous reply out because TPU combined the posts and I thought it double posted.

Anyway, Vega_Shader_ISA_28July2017.pdf (amd.com) says S_MEMREALTIME doesn't count at core clocks. If clock() gets compiled to that, you'll need to determine what it's counting in proportion to. S_MEMTIME appears to be core clocks?

The 338.57 ns Vega 64 figure was for a 128 MB region. That person also overclocked to 1682 core, 1060 HBM and got 313.29 ns.

Our pointer chasing loops differ in that your loop branch condition depends on the memory load's result, meaning it can't overlap with memory latency. It'd have to be part of measured latency in your test, even on a CPU with OOO execution. But on a CPU it's not going to make more than a cycle or two of difference. I don't know about GPUs.
Posted on Reply
#59
dragontamer5788
chlamchowderThe 338.57 ns Vega 64 figure was for a 128 MB region. That person also overclocked to 1682 core, 1060 HBM and got 313.29 ns.
I just tested 634 clock ticks on a 16MB region, or a ~388ns figure. Which is at least in the same region that you've tested. I tried with -O2 and no-optimizations, didn't change anything.
Our pointer chasing loops differ in that your loop branch condition depends on the memory load's result, meaning it can't overlap with memory latency. It'd have to be part of measured latency in your test, even on a CPU with OOO execution. But on a CPU it's not going to make more than a cycle or two of difference. I don't know about GPUs.
That's probably it. To provide for more adequate "loop-hiding", I changed the code into the following:

start = clock();

while(count != runSize){
count++;
ptr = myBuffer[ptr];
}

end = clock();

And this dropped the cycle count down to 577 cycles, or 353 nanoseconds. I needed to add "buffer[3] = ptr" to prevent the loop from being optimized away (time == 0 for a few tries, lol). Running multiple times makes this loop 60-clock ticks (~36 nanoseconds) faster on the average.

Which is pretty close to your result for a 16MB region. I'm still using clock() on the actual GPU instead of the CPU-clock, so I'll start investigating / disassembling the code to see if its S_MEMTIME or whatever.

Its not yet perfectly replicated, but maybe there's a fundamental clock-speed difference (either HBM2 or GPU-clock) which could very well be the last 15-nanoseconds worth of difference.

-----------

Notes:

1. My initial 700+ cycle / 500 nanosecond result was from ROCm 3.x last year. Today, with a change to ROCm 4.0, it seems like the code might be faster (consistently under 700 cycles when I run the unedited HIP code). So maybe AMD had a few device driver tweaks that made things faster under the hood?

2. The loop is clearly unrolled now.

0000000000001038 <BB0_1>:
s_waitcnt vmcnt(0) // 000000001038: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000103C: D28F0005 00020282
s_sub_i32 s4, s4, 32 // 000000001044: 8184A004
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001048: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000104C: 380C0D04
global_load_dword v1, v[5:6], off // 000000001050: DC508000 017F0005
s_cmp_lg_u32 s4, 0 // 000000001058: BF078004
s_waitcnt vmcnt(0) // 00000000105C: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001060: D28F0005 00020282
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001068: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000106C: 380C0D04
global_load_dword v1, v[5:6], off // 000000001070: DC508000 017F0005
s_waitcnt vmcnt(0) // 000000001078: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000107C: D28F0005 00020282
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001084: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 000000001088: 380C0D04
global_load_dword v1, v[5:6], off // 00000000108C: DC508000 017F0005
s_waitcnt vmcnt(0) // 000000001094: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001098: D28F0005 00020282

You can see the "global_load_dword" call, as well as the repeated s_waitcnt vmcnt(0) (wait for outstanding memory-count zero). The global_load is asynchronous: it starts the memory load "in the background", while the s_waitcnt actually forces the GPU-core to wait for RAM.

So maybe the device driver change was just the compiler-upgrade to better unroll this loop automatically?

3. "global_load_dword v1, v[5:6], off" shows that the memory load is a 32-bit operation + a 64-bit base register v[5:6] is the base register (myBuffer). So that answers that question we had earlier...

---------------

Notes:

* Today I'm running ROCm 4.0
* extractkernel tool to grab the disassembly
* constexpr uint64_t runSize = (1 << 22); (4-million numbers x 4-bytes each == 16MBs tested).
chlamchowderAs I mentioned in the original article (Measuring GPU Memory Latency – Chips and Cheese), latency starts to matter when occupancy (available parallelism) is too low to hide latency. Exactly how often that happens and to what extent is going to vary a lot with workload.

[snip]

Anyway it's very hard to say for sure without profiling a workload, and even then interpreting the results is tricky. I suspect that GPUs like GA102 and Navi 21 are big enough that some workloads (particularly at 1080P) might not provide enough parallelism. But that's a guess in the end, since I don't have either of those GPUs to play with.
Agreed. But here's some notes.

* Vega (and older) are 16 x 4 architecturally. That is: each vALU of the CU has 16-wide execution units, and those execution units repeat EVERY instruction 4-times, resulting in your 64-wide programming model.

* NVidia and RDNA are 32x1. The CUs are 32-wide naturally, and do none of the "repetition" stuff. As such, NVidia and RDNA are better tailored for low-occupancy situations.

* NVidia and RDNA both have stalls: NVidia tracks stalls at the assembly level. RDNA handles stalls automatically.

* GPUs are all 32-wide or above from a programming model. That means even the smallest of reasonable workloads is 32-wide. If not, no one would even bother with the whole shader / DirectX / HLSL crap and just program the stuff on your big, beefy CPU already optimized for single-threaded data. Additional CUDA blocks or OpenCL workgroups can provide additional latency hiding, but it should be noted that a single-warp / single-wavefront (with a good enough compiler and enough Instruction-level-parallelism at the C/C++ level) can in fact hide latency.

As such, occupancy doesn't really mean much unless we know exactly the code that was being run. In fact, higher occupancy may be a bad thing: GPUs have to share their vGPRs / registers with other "occupants". So going occupancy 2, or occupancy 4+ means having 1/2 or 1/4th the registers. So the optimization of code (registers vs occupancy) is a very complicated problem.
Posted on Reply
#60
chlamchowder
dragontamer5788I just tested 634 clock ticks on a 16MB region, or a ~388ns figure. Which is at least in the same region that you've tested. I tried with -O2 and no-optimizations, didn't change anything.



That's probably it. To provide for more adequate "loop-hiding", I changed the code into the following:

start = clock();

while(count != runSize){
count++;
ptr = myBuffer[ptr];
}

end = clock();

And this dropped the cycle count down to 577 cycles, or 353 nanoseconds. I needed to add "buffer[3] = ptr" to prevent the loop from being optimized away (time == 0 for a few tries, lol). Running multiple times makes this loop 60-clock ticks (~36 nanoseconds) faster on the average.
Oh goodie. at least we're in the same ballpark now. That last bit could be a clock speed difference, yeah.

16 MB worries me a bit, since Vega 64 has a 4 MB L2. With a random access pattern, you may get (un)lucky 1/4 of the time and hit cache, which is why I used 128 MB as a reference point for mem latency (or 1 GB for RDNA 2 since it went that far before mem latency leveled off). That seems to be more of a problem with Sattolo's algorithm than using a fixed stride, judging by how the GTX 980 Ti saw a slower jump in latency after going past 3 MB:
dragontamer5788Which is pretty close to your result for a 16MB region. I'm still using clock() on the actual GPU instead of the CPU-clock, so I'll start investigating / disassembling the code to see if its S_MEMTIME or whatever.

Its not yet perfectly replicated, but maybe there's a fundamental clock-speed difference (either HBM2 or GPU-clock) which could very well be the last 15-nanoseconds worth of difference.

-----------

Notes:

1. My initial 700+ cycle / 500 nanosecond result was from ROCm 3.x last year. Today, with a change to ROCm 4.0, it seems like the code might be faster (consistently under 700 cycles when I run the unedited HIP code). So maybe AMD had a few device driver tweaks that made things faster under the hood?

2. The loop is clearly unrolled now.

0000000000001038 <BB0_1>:
s_waitcnt vmcnt(0) // 000000001038: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000103C: D28F0005 00020282
s_sub_i32 s4, s4, 32 // 000000001044: 8184A004
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001048: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000104C: 380C0D04
global_load_dword v1, v[5:6], off // 000000001050: DC508000 017F0005
s_cmp_lg_u32 s4, 0 // 000000001058: BF078004
s_waitcnt vmcnt(0) // 00000000105C: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001060: D28F0005 00020282
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001068: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000106C: 380C0D04
global_load_dword v1, v[5:6], off // 000000001070: DC508000 017F0005
s_waitcnt vmcnt(0) // 000000001078: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000107C: D28F0005 00020282
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001084: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 000000001088: 380C0D04
global_load_dword v1, v[5:6], off // 00000000108C: DC508000 017F0005
s_waitcnt vmcnt(0) // 000000001094: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001098: D28F0005 00020282

You can see the "global_load_dword" call, as well as the repeated s_waitcnt vmcnt(0) (wait for outstanding memory-count zero). The global_load is asynchronous: it starts the memory load "in the background", while the s_waitcnt actually forces the GPU-core to wait for RAM.

So maybe the device driver change was just the compiler-upgrade to better unroll this loop automatically?

3. "global_load_dword v1, v[5:6], off" shows that the memory load is a 32-bit operation + a 64-bit base register v[5:6] is the base register (myBuffer). So that answers that question we had earlier...
Cool. Good to see ROCm unrolling the loop. Maybe Nvidia does that under the hood as well because I get identical results with and without unrolling, but I have no way to extract assembly from compiled OpenCL kernels. Only AMD CodeXL does that, and I can only get that working on Terascale. Ugh.
dragontamer5788Agreed. But here's some notes.

* Vega (and older) are 16 x 4 architecturally. That is: each vALU of the CU has 16-wide execution units, and those execution units repeat EVERY instruction 4-times, resulting in your 64-wide programming model.
The way I saw it was each 16-wide SIMD can execute a 64-wide wavefront with latency = reciprocal throughput = 4 cycles.
dragontamer5788* NVidia and RDNA are 32x1. The CUs are 32-wide naturally, and do none of the "repetition" stuff. As such, NVidia and RDNA are better tailored for low-occupancy situations.
Yep, if you only have one wavefront active in a CU, you're limited to 1 wavefront every 4 cycles. On RDNA and Nvidia Pascal/Ampere, you get one warp every cycle. On Nvidia Turing, you get one warp every two cycles.
dragontamer5788* NVidia and RDNA both have stalls: NVidia tracks stalls at the assembly level. RDNA handles stalls automatically.
You mean execution stalls? Nvidia encodes stall cycles into static scheduling info, visible in disassembly in words without associated instructions for Kepler/Maxwell/Pascal, and tacked on to instructions in Turing/Ampere. Fermi/GCN/RDNA track execution dependencies in hardware.

For variable latency (memory) stalls, both use barriers like s_waitcnt. It's just part of static scheduling info on Nvidia.
dragontamer5788* GPUs are all 32-wide or above from a programming model. That means even the smallest of reasonable workloads is 32-wide. If not, no one would even bother with the whole shader / DirectX / HLSL crap and just program the stuff on your big, beefy CPU already optimized for single-threaded data. Additional CUDA blocks or OpenCL workgroups can provide additional latency hiding, but it should be noted that a single-warp / single-wavefront (with a good enough compiler and enough Instruction-level-parallelism at the C/C++ level) can in fact hide latency.

As such, occupancy doesn't really mean much unless we know exactly the code that was being run. In fact, higher occupancy may be a bad thing: GPUs have to share their vGPRs / registers with other "occupants". So going occupancy 2, or occupancy 4+ means having 1/2 or 1/4th the registers. So the optimization of code (registers vs occupancy) is a very complicated problem.
Yes that's true, you can get ILP within a single wavefront/warp through nonblocking loads and pipelined execution. Also Intel Gen is unique here - it can use variable SIMD widths from 4-wide to 32-wide.

I got achieved occupancy through Nsight. If occupancy is limited by vGPR or shmem(lds) pressure, I'd just see lower occupancy. That may be why Ungine Heaven gets lower occupancy at 4K than 1080P, even though there should be more parallelism available at 4K.
Posted on Reply
#61
Minus Infinity
No doubt Lovelace will have a big cache after seeing how well infinity cache did on RDNA2.
Posted on Reply
#62
dragontamer5788
chlamchowder16 MB worries me a bit, since Vega 64 has a 4 MB L2.
Agreed. But my code fails to compile for some reason with 128MB, lol. So I just did a test at 16MB and didn't feel like debugging. Its imperfect, but at least I got ya an answer quicker :)
That seems to be more of a problem with Sattolo's algorithm than using a fixed stride, judging by how the GTX 980 Ti saw a slower jump in latency after going past 3 MB:
Yeah. Agreed.

-------

I'm a bit curious about the slowdown at 2kB.

I would guess that the slowdown of Sattolo's at 2kB is because maybe... the code-cache may share some cache with the data-cache? So maybe the 4kB region is data+code?? Just guessing here.
Posted on Reply
#63
chlamchowder
dragontamer5788Agreed. But my code fails to compile for some reason with 128MB, lol. So I just did a test at 16MB and didn't feel like debugging. Its imperfect, but at least I got ya an answer quicker :)
Did you statically allocate memory? You could try playing with mcmodel if you're using gcc. Or just do malloc/free.
dragontamer5788Yeah. Agreed.

-------

I'm a bit curious about the slowdown at 2kB.

I would guess that the slowdown of Sattolo's at 2kB is because maybe... the code-cache may share some cache with the data-cache? So maybe the 4kB region is data+code?? Just guessing here.
2 KB? I didn't test with any size smaller than 8 KB because no GPU has a L1 cache smaller than that. The slightly higher latency at 8 KB is probably noise.

I don't think code and data share the same cache until L2. On Nvidia, constant data and code share a L1.5 cache, which is 32 KB per SM on GP104 according to 1903.07486.pdf (arxiv.org), but it looks like OpenCL sees none of that (thanks NV). That's also a fun read by the way. They say GP104 L2 hit latency is ~216c using CUDA, handwritten SASS, and %%clock for timing. I get ~266c from my higher level OpenCL test.
Posted on Reply
#64
dragontamer5788
chlamchowderDid you statically allocate memory? You could try playing with mcmodel if you're using gcc. Or just do malloc/free.
Statically allocated and then had a x64 all over the place, to handle the scenario where 64-threads (erm... work-items?) work in parallel. Twas a magic number, so I didn't feel like double-checking all the "* 64" all over my code to make sure which ones were necessary to change.

Switching it to a malloc/free probably would have worked. But alas, its late and time for bed, lol. Needless to say, the code is messy and primarily geared at the x64 vs x1 scenario (which is what I was most interested in: how the memory bandwidth / latency scaled with higher work-item counts).

Anyway, maybe I'll try that tomorrow.
chlamchowder2 KB? I didn't test with any size smaller than 8 KB because no GPU has a L1 cache smaller than that. The slightly higher latency at 8 KB is probably noise.


Ah, I see my mistake. That's 2MB vs 4MBs. Gotcha. I thought it was 2kB and 4kB there, but I guess I didn't read the "KB" correctly.

So yeah, the 3MB test-point, where Sattolo's method is slightly slower. Maybe that's instruction-cache taking up some space? I dunno what else it could be.
Posted on Reply
#65
Jism
john_This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.
Wrong. But anyway: nvidia uses texture compression to proberly compensate for the bit higher latency.
Posted on Reply
#66
mtcn77
JismWrong. But anyway: nvidia uses texture compression to proberly compensate for the bit higher latency.
I see an Indian if I have ever seen one. :)
I agree with you. Nvidia's texture cache isn't so easily harnessed, even developers ask around for that.
Posted on Reply
#67
chlamchowder
mtcn77I see an Indian if I have ever seen one. :)
I agree with you. Nvidia's texture cache isn't so easily harnessed, even developers ask around for that.
Those two aren't directly related. Texture compression saves memory bandwidth, but doesn't change latency (and might increase it because compression/decompression may take extra cycles).

The L1 texture cache on Kepler/Maxwell/Pascal not showing up in my tests is a different issue. You have to opt-in to use it in CUDA by passing "-Xptxas -dlcm=ca" (without quotes) on the nvcc command line. I tried for a while to make it work in CUDA with zero success. I got zero L1 tex cache hits even with those parameters passed to nvcc. And there's no way to use the L1 tex cache at all from OpenCL.
Posted on Reply
#68
Jism
The texture compression compensates for the little bit bigger latency.
Posted on Reply
#69
Vya Domus
JismWrong. But anyway: nvidia uses texture compression to proberly compensate for the bit higher latency.
You mean color compression ? Every GPU uses color compression these days, even the one in your phone.
Posted on Reply
#70
Doc-J
SteevoSmoothness

www.techpowerup.com/review/amd-radeon-rx-6900-xt/39.html

See the effects of the Infinity Cache in the charts, 3090 & 6900 trade for FPS, but the 6900 has consistently higher frame rates and fewer low FPS frames, which equates to less laggy feeling, IE... smoothness
Yes, "placebo" effect, 2 of 10 games runs better smooth on 6900, the other 8 runs equal or better on RTX3000 (Including heavy game to render like RDR2).
;)
1d10tThis explains why RX 6800 series is a serious competitor at 1080p and up to 1440p, even though the Ampere has a much wider GDDR6x memory bandwidth. Oh and some YouTubers have also said that playing on the RX 6800 is smoother, so there another perks you can't measure.

All information by "Paid" Linus channel is useless....
Only see his face when reviewing the RTX3000 cards said it all XD
Posted on Reply
#71
Kayotikz
1d10tAgain, have you watched the video? There's also a frame counter in the top right corner. Here's link to save your time Linus
........

Do you need someone to comprehend a sentence for you?

FPS ≠ Frametime

If you need me to break it down for you even more..... You should probably pick up a comprehension class
Posted on Reply
#72
mtcn77
chlamchowderThose two aren't directly related. Texture compression saves memory bandwidth, but doesn't change latency (and might increase it because compression/decompression may take extra cycles).

The L1 texture cache on Kepler/Maxwell/Pascal not showing up in my tests is a different issue. You have to opt-in to use it in CUDA by passing "-Xptxas -dlcm=ca" (without quotes) on the nvcc command line. I tried for a while to make it work in CUDA with zero success. I got zero L1 tex cache hits even with those parameters passed to nvcc. And there's no way to use the L1 tex cache at all from OpenCL.
Thanks for your feedback. I knew it to be true because it isn't the first time a developer had difficulty with texture cache - which is L0 - and it is better to be open about it than to assume, like other members here, that the results are final. They are not and there is nothing wrong about it since we are only trying to cast light into a matter, not aiming at unsubstantiated fanfare.
Been there done that for others that consider AMD the best. AMD has a lot to go in order to make themselves in the lead. They have a crosslicense with Nvidia which means anything Nvidia has put forward in the past, they have access to. Like the same, AMD will grant access to Nvidia for their own innovations in the future.
First, we have to see AMD lead and then stay there are as Nvidia starts to pick up the spoils of AMD's success. I wouldn't say they didn't do much, but a lot has come from their Nvidia partners to discount the influence, imo.
Vya DomusYou mean color compression ? Every GPU uses color compression these days, even the one in your phone.
It is different. You can use compression to target buffered writes which reduce bandwidth call, or improve bandwidth utilisation to memory devices; however way you want to look at it. It cannot buffer random color data, but for static bits of color information, ROP units can churn more pixels.
The difference with the way Nvidia does it is, up until recently - I mean until Vega debuted - they were the first with tile based rendering and buffered memory access. When accesses to memory are buffered it means direct access "immediate mode" is gone and the gpu won't access memory and instead accesses the L2. You have to store such data in buffers order to send them in packets to memory and thats where texture cache comes in. L0 buffers textures, then sends them over to L2, ROP units then issue writes to memory, always in the same buffering protocol, so there is no sparse access, the code bubbles, or rasterizer bubbles I should say, don't cause memory access bubbles, since the interface is detached from the execution units and only access fully buffered orders.
Not an engineer, this is the best I can make of it.
Posted on Reply
#73
RH92
1d10tHave you watched the video? It's called placebo effect, have you invented a tool to measure it?
:shadedshu: Have you read my response ? Im not talking about the placebo effect ( which describes the subjective perception ) , im talking about the objective perception of ''smoothness'' in games which can be accurately extrapolated from frametime measurements ! There is a reason all major reviewers have integrated frametime measurements in their reviews .........

Posted on Reply
#74
Chrispy_
medi01You are comapring wrong things.

Mem bandwidth on 5700XT is 448GB/s
Mem bandwidth on 6700XT is just 384GB/s

See, what that cache is doing?
Not really, it's been proven by a whole army of undervolters and underclockers (of which I'm one) that the 5700XT has far more bandwidth than it can use.

To put it into perspective, the 5600XT has a 25% reduction in memory bandwidth over the vanilla 5700 (most of them use the same 14Gbps GDDR6 as the 5700), and it also shares the same 2304 shader configuration. Yet, despite similar game clocks, it is within 5% of the performance, and only at 4K where the GPU doesn't have enough shader strength anyway - it struggles at 4K even in 2019 games of its day. So the extra bandwidth of the 5700 and 5700XT is academic in pretty much any game past and present. You could double it to 900GB/s and the 5700XT still wouldn't be any faster.

So let's be clear, if you underclocked the VRAM on a 5700XT to bring its bandwidth down to the 384GB/s of the 6700XT, I'd bet money on the performance drop being almost undetectable - within margin of error most likely. The 448GX/s of the 5700XT is a paper spec that only rears its head in synthetic VRAM bandwidth-measuring tests or Etherium mining where bandwidth is all that matters (and miners would turn the GPU core clocks down to 1MHz to save power if that were even possible).
ColddeckedBut there must be some architectural difference that allows RDNA2 to clock that much higher at the same voltage. It can't just be AMD just got that good at 7nm, can it?
I'm not a GPU architect but from having read several interviews of GPU architects and deep-dives into architecture, it's clear that you can design more efficient architecture that does more stuff at a given clockspeed - this is what drives architectural process, and you can tweak previous generation architecture layouts with the hindsight of which areas were limiting clockspeeds in that generation.

One is a whole new architecture designed to be better than the old architecture, the other is a band-aid that refines the old architecture by picking the low-hanging fruit. I'm not saying that either approach is bad, but you can only take the band-aid approach once; If you tweak the architecture properly to optimise layout based on previous gen experience, there's no low-hanging fruit to pick for a second go around.
Posted on Reply
#75
medi01
Chrispy_Not really, it's been proven by a whole army of undervolters and underclockers (of which I'm one) that the 5700XT has far more bandwidth than it can use.
Given that it was an "army", i would want to see a test showing that dropping BW to 384GB/s does not drop performance.

A single one from a source at least remotely reputable would do.
Chrispy_I'd bet money on
You started with "it does not" and it appears you are assuming it with high confidence.
High confidence is cool, but let's not mix facts and fictions, shall we?
Posted on Reply
Add your own comment
Copyright © 2004-2021 www.techpowerup.com. All rights reserved.
All trademarks used are properties of their respective owners.