GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

AleksandarK · Apr 19, 2021

Graphics cards have been developed over the years so that they feature multi-level cache hierarchies. These levels of cache have been engineered to fill in the gap between memory and compute, a growing problem that cripples the performance of GPUs in many applications. Different GPU vendors, like AMD and NVIDIA, have different sizes of register files, L1, and L2 caches, depending on the architecture. For example, the amount of L2 cache on NVIDIA's A100 GPU is 40 MB, which is seven times larger compared to the previous generation V100. That just shows how much new applications require bigger cache sizes, which is ever-increasing to satisfy the needs.

Today, we have an interesting report coming from Chips and Cheese. The website has decided to measure GPU memory latency of the latest generation of cards - AMD's RDNA 2 and NVIDIA's Ampere. By using simple pointer chasing tests in OpenCL, we get interesting results. RDNA 2 cache is fast and massive. Compared to Ampere, cache latency is much lower, while the VRAM latency is about the same. NVIDIA uses a two-level cache system consisting out of L1 and L2, which seems to be a rather slow solution. Data coming from Ampere's SM, which holds L1 cache, to the outside L2 is taking over 100 ns of latency.

AMD on the other hand has a three-level cache system. There are L0, L1, and L2 cache levels to complement the RDNA 2 design. The latency between the L0 and L2, even with L1 between them, is just 66 ns. Infinity Cache, which is an L3 cache essentially, is adding only additional 20 ns of additional latency, making it still faster compared to NVIDIA's cache solutions. NVIDIA's GA102 massive die seems to represent a big problem for the L2 cache to go around it and many cycles are taken. You can read more about the test here.

View at TechPowerUp Main Site

londiste · Apr 19, 2021

The slow uptick between cache level s on RDNA2 is interesting. While Ampere cache levels are quite clearly distinguished RDNA2 graph is much more smooth, including Infinity Cache past 32MB.

john_ · Apr 19, 2021

This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.

AnarchoPrimitiv · Apr 19, 2021

john_ said:
This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.

Does it perform better across the board in every game? What GPUs are you comparing out of curiosity?

yeeeeman · Apr 19, 2021

AMD should thank a lot to TSMC for allowing them to add that much cache in such little space.
Using cache is in general the lazy man way of solving things.

nguyen · Apr 19, 2021

So Ampere is a compute/bandwidth monster and RDNA2 is a latency monster, in the end which solution grab the most market share will be the winner.

THANATOS · Apr 19, 2021

john_ said:
This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.

How does It actually show that Ampere architecture is much faster? Care to elaborate how big impact latency has on a GPU performance?
BTW Nvidia has higher bandwidth than AMD and in high end(GA102) It's significantly higher, but you ignore this.

nguyen said:
So Ampere is a compute/bandwidth monster and RDNA2 is a latency monster, in the end which solution grab the most market share will be the winner.

I think Nvidia adding FP32 functionality to Its INT units is a pretty good idea. Although I don't know how much transistors or power It cost gaming performance increased by ~25% and then there is the advantage in compute workload. I wouldn't mind If AMD did the same thing.

yeeeeman said:
AMD should thank a lot to TSMC for allowing them to add that much cache in such little space.
Using cache is in general the lazy man way of solving things.

For CPU yes, but for GPU what better alternative do we have? Super expensive HBM2 or expensive GDDR6x with wider memory controller?
So you can't really say Infinity cache was a bad move. I just wonder, If a smaller one(1/2 or 1/3 smaller) wouldn't be a good enough option, because honestly IC uses up a lot of space.

londiste · Apr 19, 2021

THANATOS said:
I just wonder, If a smaller one(1/2 or 1/3 smaller) wouldn't be a good enough option, because honestly IC uses up a lot of space.

128MB is not that much when it comes to caching for 16GB of VRAM.
Assuming die shots in AMD presentation is somewhat accurate, Infinity Cache is 15% of Navi21 die.

evernessince · Apr 19, 2021

john_ said:
This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.

If only GPU architecture was a simple as a single factor determining performance.

THANATOS said:
How does It actually show that Ampere architecture is much faster? Care to elaborate how big impact latency has on a GPU performance?
BTW Nvidia has higher bandwidth than AMD and in high end(GA102) It's significantly higher, but you ignore this.

I think Nvidia adding FP32 functionality to Its INT units is a pretty good idea. Although I don't know how much transistors or power It cost gaming performance increased by ~25% and then there is the advantage in compute workload. I wouldn't mind If AMD did the same thing.

For CPU yes, but for GPU what better alternative do we have? Super expensive HBM2 or expensive GDDR6x with wider memory controller?
So you can't really say Infinity cache was a bad move. I just wonder, If a smaller one(1/2 or 1/3 smaller) wouldn't be a good enough option, because honestly IC uses up a lot of space.

It doesn't. The guy is just making an assumption and an incorrect one at that.

THANATOS · Apr 19, 2021

londiste said:
128MB is not that much when it comes to caching for 16GB of VRAM.
Assuming die shots in AMD presentation is somewhat accurate, Infinity Cache is 15% of Navi21 die.

I think somewhere It was mentioned It was ~20%. 20% from 520mm2 is 104mm2 and that's not a small number If we take into account that space could have been used for more CUs for example. BTW one RDNA1 WGP(2xCU) is only 4.1mm2 so I think RDNA2 WGP could be 5mm2 at most, so by halving Infinity cache and saving up 52mm2 you could put 25% more CU into N21. It would be great, If we could somehow disable a part of IC and see what kind of effect It has on performance.

Wirko · Apr 19, 2021

londiste said:
The slow uptick between cache level s on RDNA2 is interesting. While Ampere cache levels are quite clearly distinguished RDNA2 graph is much more smooth, including Infinity Cache past 32MB.

Yes, that's interesting. The gradual increase above 4MB could indicate that the L3 cache is sectioned (with one part belonging to each memory controller?), and access time increases significantly when a CU needs to access data in a "distant" section. The gradual increase up to 4MB could mean that L2 is split into sections too, again with varying access time.

W1zzard · Apr 19, 2021

THANATOS said:
If we take into account that space could have been used for more CUs for example

AMD made it clear in press briefings that given their power and thermal goals, the L3 cache was the better option

Aquinus · Apr 19, 2021

londiste said:
128MB is not that much when it comes to caching for 16GB of VRAM.
Assuming die shots in AMD presentation is somewhat accurate, Infinity Cache is 15% of Navi21 die.

It's plenty. By that logic, the 64GB of memory in my laptop is gimped by the 16MB of cache on my CPU. It's not about the amount, it's about the hit ratio. Also the cache uses less power, so sure, you could replace it with CUs, but that's also more compute with more memory latency and more heat. That doesn't sound like a winning combo compared to what AMD has now.

Mysteoa · Apr 19, 2021

yeeeeman said:
AMD should thank a lot to TSMC for allowing them to add that much cache in such little space.
Using cache is in general the lazy man way of solving things.

This is just a stepping stone for when they go chiplet. They need a fast cache, so they don't need to access the VRAM often when it is across a IO die.

1d10t · Apr 19, 2021

This explains why RX 6800 series is a serious competitor at 1080p and up to 1440p, even though the Ampere has a much wider GDDR6x memory bandwidth. Oh and some YouTubers have also said that playing on the RX 6800 is smoother, so there another perks you can't measure.

RH92 · Apr 19, 2021

1d10t said:
This explains why RX 6800 series is a serious competitor at 1080p and up to 1440p, even though the Ampere has a much wider GDDR6x memory bandwidth. Oh and some YouTubers have also said that playing on the RX 6800 is smoother, so there another perks you can't measure.

''Smoothness'' of a game can be measured with frametimes , there is nothing magic about it that can't be measured !

claylomax · Apr 19, 2021

Aquinus said:
It's plenty. By that logic, the 64GB of memory in my laptop is gimped by the 16MB of cache on my CPU. It's not about the amount, it's about the hit ratio. Also the cache uses less power, so sure, you could replace it with CUs, but that's also more compute with more memory latency and more heat. That doesn't sound like a winning combo compared to what AMD has now.

I think he's talking about how much space it takes on the chip.

1d10t · Apr 19, 2021

RH92 said:
''Smoothness'' of a game can be measured with frametimes , there is nothing magic about it that can't be measured !

Have you watched the video? It's called placebo effect, have you invented a tool to measure it?

TheinsanegamerN · Apr 19, 2021

1d10t said:
Have you watched the video? It's called placebo effect, have you invented a tool to measure it?

Yeah, its called frametime measurement.

john_ said:
This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.

Faster? I mean outside of raytracing, the 3080 loses to the 6900xt and 6800xt at 1440p, but wins at 4K. Nvidia also requires significantly more power to do so. I know, samsung 8nm vs TSMC 7nm, but we've seen what happens when nvidia's arch is way ahead of AMD with the maxwell VS GCN era. Even if you look at SM count instead of core count the 3090 and 6900xt are not that different.

1d10t · Apr 19, 2021

TheinsanegamerN said:
Yeah, its called frametime measurement.

Again, have you watched the video? There's also a frame counter in the top right corner. Here's link to save your time Linus

TheinsanegamerN · Apr 19, 2021

1d10t said:
Again, have you watched the video? There's also a frame counter in the top right corner. Here's link to save your time Linus

Again, you miss the point. "smoother" is a descriptor that can be measured. If it's a benefit, then surely you can link some evidence of benchmarks done showing AMD has better frametimes, yeah?

1d10t · Apr 19, 2021

TheinsanegamerN said:
Again, you miss the point. "smoother" is a descriptor that can be measured. If it's a benefit, then surely you can link some evidence of benchmarks done showing AMD has better frametimes, yeah?

Smooth is an adjective not a noun, and has no metrics associated with it. I don't need to prove anything cause I have already presented a topic for debate.

Vya Domus · Apr 19, 2021

londiste said:
128MB is not that much when it comes to caching for 16GB of VRAM.

For GPUs it is a ludicrous amount of cache. Just a few years ago you were looking at less than <1 KB of combined levels of cache per thread in a GPU. Now that amount has went up by at least an order of magnitude.

THANATOS · Apr 19, 2021

W1zzard said:
AMD made it clear in press briefings that given their power and thermal goals, the L3 cache was the better option

Wasn't that statement about the actual use of IC?
I never said to get rid of the whole IC, which was clearly stated in my post. What I wanted is to halve It(64MB instead of 128MB) and the saved up space would be used for more CU. BTW I would love to see a performance penalty graph for using smaller IC to know, If that much cache is really needed or It can be smaller.

TheinsanegamerN · Apr 19, 2021

1d10t said:
Smooth is an adjective not a noun, and has no metrics associated with it. I don't need to prove anything cause I have already presented a topic for debate.

You presented an opinion, an opinion that is objectively incorrect. You presented the argument, if you cant prove your argument then all you are doing is shitting up the thread. "smoothness" IS a noun, per oxford's learner dictionary, and can be measured via frametime measurement.

Oxford: https://www.oxfordlearnersdictionaries.com/us/definition/english/smoothness#:~:text=smoothness-,noun,any rough areas or holes

I can present a new topic for depate too: "Does 1d10t live up to his username?".

Processor	Ryzen 7800X3D
Motherboard	ROG STRIX B650E-F GAMING WIFI
Memory	2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s)	INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage	2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s)	42" LG C2 OLED, 27" ASUS PG279Q
Case	Thermaltake Core P5
Power Supply	Fractal Design Ion+ Platinum 760W
Mouse	Corsair Dark Core RGB Pro SE
Keyboard	Corsair K100 RGB
VR HMD	HTC Vive Cosmos

System Name	3 desktop systems: Gaming / Internet / HTPC
Processor	Ryzen 5 7600 / Ryzen 5 4600G / Ryzen 5 5500
Motherboard	X670E Gaming Plus WiFi / MSI X470 Gaming Plus Max (1) / MSI X470 Gaming Plus Max (2)
Cooling	Snowman / Segotep T4 / Νoctua U12S
Memory	Kingston FURY Beast 32GB DDR5 6000 / 16GB JUHOR / 32GB G.Skill RIPJAWS 3600 + Aegis 3200
Video Card(s)	ASRock RX 6600+GTX 1660 / Vega 7 integrated / Radeon RX 580+GTX 1050
Storage	NVMes, ONLY NVMes / NVMes, SATA Storage / NVMe, SATA, external storage
Display(s)	Philips 43PUS8857/12 UHD TV (120Hz, HDR, FreeSync Premium) / 19'' HP monitor + BlitzWolf BW-V5
Case	Sharkoon Rebel 12 / CoolerMaster Elite 361 / Xigmatek Midguard
Audio Device(s)	onboard
Power Supply	Chieftec 850W / Silver Power 400W / Sharkoon 650W
Mouse	CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Keyboard	CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Software	Windows 10 / Windows 10&Windows 11 / Windows 10

System Name	Lightbringer
Processor	Ryzen 7 2700X
Motherboard	Asus ROG Strix X470-F Gaming
Cooling	Enermax Liqmax Iii 360mm AIO
Memory	G.Skill Trident Z RGB 32GB (8GBx4) 3200Mhz CL 14
Video Card(s)	Sapphire RX 5700XT Nitro+
Storage	Hp EX950 2TB NVMe M.2, HP EX950 1TB NVMe M.2, Samsung 860 EVO 2TB
Display(s)	LG 34BK95U-W 34" 5120 x 2160
Case	Lian Li PC-O11 Dynamic (White)
Power Supply	BeQuiet Straight Power 11 850w Gold Rated PSU
Mouse	Glorious Model O (Matte White)
Keyboard	Royal Kludge RK71
Software	Windows 10

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus Astral 5090 LC OC
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX1200
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

Processor	Ryzen 7800X3D
Motherboard	ROG STRIX B650E-F GAMING WIFI
Memory	2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s)	INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage	2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s)	42" LG C2 OLED, 27" ASUS PG279Q
Case	Thermaltake Core P5
Power Supply	Fractal Design Ion+ Platinum 760W
Mouse	Corsair Dark Core RGB Pro SE
Keyboard	Corsair K100 RGB
VR HMD	HTC Vive Cosmos

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

AleksandarK

News Editor

londiste

john_

AnarchoPrimitiv

yeeeeman

nguyen

THANATOS

londiste

evernessince

THANATOS

Wirko

W1zzard

Administrator

Aquinus

Resident Wat-man

Mysteoa

1d10t

RH92

claylomax

1d10t

TheinsanegamerN

1d10t

TheinsanegamerN

1d10t

Vya Domus

THANATOS

TheinsanegamerN

Processor	Ryzen 9800X3D
Motherboard	ASRock X670E Taichi
Cooling	Noctua NH-D15 Chromax
Memory	64GB DDR5 6000 CL26
Video Card(s)	MSI RTX 4090 Trio
Storage	P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s)	Acer Predator XB3 27" 240 Hz
Case	Thermaltake Core X9
Audio Device(s)	JDS Element IV, DCA Aeon II
Power Supply	Seasonic Prime Titanium 850w
Mouse	PMM P-305
Keyboard	Wooting HE60
VR HMD	Valve Index
Software	Win 10

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

Processor	Ryzen 7 5700X
Memory	48 GB
Video Card(s)	RTX 4080
Storage	2x HDD RAID 1, 3x M.2 NVMe
Display(s)	30" 2560x1600 + 19" 1280x1024
Software	Windows 10 64-bit

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, AirPods Max
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.5

System Name	Poor Man's PC
Processor	Ryzen 7 7700
Motherboard	MSI B650M Mortar WiFi
Cooling	FSP NP-5 Black
Memory	32GB GSkill Flare X5 DDR5 6000Mhz
Video Card(s)	XFX Merc 310 Radeon RX 7900 XT
Storage	XPG Gammix S70 Blade 2TB + 8 TB WD Ultrastar DC HC320
Display(s)	Xiaomi G Pro 27i MiniLED
Case	Asus A21 Case
Audio Device(s)	MPow Air Wireless + Mi Soundbar
Power Supply	Enermax Revolution DF 650W Gold
Mouse	Logitech MX Anywhere 3
Keyboard	Logitech Pro X + Kailh box heavy pale blue switch + Durock stabilizers
VR HMD	Meta Quest 2
Benchmark Scores	Who need bench when everything already fast?

Processor	RYZEN 7 5800X3D
Motherboard	Aorus B-550I Pro AX
Cooling	HEATKILLER IV PRO , EKWB Vector FTW3 3080/3090 , Barrow res + Xylem DDC 4.2, SE 240 + Dabel 20b 240
Memory	Viper Steel 4000 PVS416G400C6K
Video Card(s)	EVGA 3080Ti FTW3
Storage	XPG SX8200 Pro 512 GB NVMe + Samsung 980 1TB
Display(s)	ROG Strix OLED XG27AQDMG
Case	NR 200
Power Supply	CORSAIR SF750
Mouse	Logitech G PRO
Keyboard	Meletrix Zoom 75 GT Silver
Software	Windows 11 22H2

System Name	Jaspe
Processor	Ryzen 1500X
Motherboard	Asus ROG Strix X370-F Gaming
Cooling	Stock
Memory	16Gb Corsair 3000mhz
Video Card(s)	EVGA GTS 450
Storage	Crucial M500
Display(s)	Philips 1080 24'
Case	NZXT
Audio Device(s)	Onboard
Power Supply	Enermax 425W
Software	Windows 10 Pro

System Name	Skunkworks 3.0
Processor	5800x3d
Motherboard	x570 unify
Cooling	Noctua NH-U12A
Memory	32GB 3600 mhz
Video Card(s)	asrock 6800xt challenger D
Storage	Sabarent rocket 4.0 2TB, MX 500 2TB
Display(s)	Asus 1440p144 27"
Case	Old arse cooler master 932
Power Supply	Corsair 1200w platinum
Mouse	squeak
Keyboard	Some old office thing
Software	Manjaro

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C