GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

medi01 · Apr 21, 2021

Chrispy_ said:
Not really, it's been proven by a whole army of undervolters and underclockers (of which I'm one) that the 5700XT has far more bandwidth than it can use.

Given that it was an "army", i would want to see a test showing that dropping BW to 384GB/s does not drop performance.

A single one from a source at least remotely reputable would do.

Chrispy_ said:
I'd bet money on

You started with "it does not" and it appears you are assuming it with high confidence.
High confidence is cool, but let's not mix facts and fictions, shall we?

mtcn77 · Apr 21, 2021

medi01 said:
Given that it was an "army", i would want to see a test showing that dropping BW to 384GB/s does not drop performance.

I did these studies back in the day between 900 series and 290 series. The funny thing was, 900 series were utilizing their bandwidth better at lower resolutions and 290 series couldn't catch a break at higher resolutions because 900 series were actually rendering the scene quite faster since it takes less time to issue memory calls from L0 and L2 than the main memory.
TL;DR: they use their bandwidth differently now. Lessening it won't slow access unless the gpu is working on two kernels such as gpu+compute like Heterogeneous compute architecture. What bounds the performance isn't the bandwidth, but being cache strapped these days.

1d10t · Apr 21, 2021

Doc-J said:
All information by "Paid" Linus channel is useless....
Only see his face when reviewing the RTX3000 cards said it all XD

So Linus is paid channel, who doesn't?

Kayotikz said:
........

Do you need someone to comprehend a sentence for you?

FPS ≠ Frametime

If you need me to break it down for you even more..... You should probably pick up a comprehension class

Great, would you help me that? I'm clueless.

and my original post are

RH92 said:
Have you read my response ? Im not talking about the placebo effect ( which describes the subjective perception ) , im talking about the objective perception of ''smoothness'' in games which can be accurately extrapolated from frametime measurements ! There is a reason all major reviewers have integrated frametime measurements in their reviews .........

Then why reply to my comments? I'm clearly describe that as placebo effect in term of smoothness and you just take specific word and took that out of context.

Chrispy_ · Apr 21, 2021

medi01 said:
Given that it was an "army", i would want to see a test showing that dropping BW to 384GB/s does not drop performance.

A single one from a source at least remotely reputable would do.

You started with "it does not" and it appears you are assuming it with high confidence.
High confidence is cool, but let's not mix facts and fictions, shall we?

There is no reputable source, it's an army of users tweaking their cards not professional testers. Reddit, these forums, other forums. Underclocking the VRAM is almost pointless because it's doesn't let you drop the voltage by enough to really matter, though I've read plenty of instances where people have been chasing that last 1% and flashed modded BIOSes with lowered memory clocks in an effort to hit absolute limits of what undervolting is possible, with zero consequences on benchmark performance compared to the same card with faster RAM. I assumed that the excess of bandwidth of the 5700-series was just common knowledge at this point. The reason it gets 256-bit bus is for product segmentation - they wanted an 8GB card not a 6GB card, and the only sensible way to do it was to use an overkill (for the GPU shader ability) 256-bit bus, the alternative of having 128-bit with denser modules probably would have been too bandwidth starved.

If you want facts, look at any reviews of a 14Gbps 5600XT vs a 14Gbps 5700 - they are essentially the same card/board/vram but just with the 5600XT being limited to a 192-bit bus on 6GB of VRAM. If you cannot see that lopping off a quarter of the bandwidth makes negligible difference then I'm not going to spoonfeed you. I hoped that you'd have the intelligence to draw your own conclusions from that which is exactly why I mentioned it in that post earlier.

medi01 · Apr 21, 2021

The thought that AMD wastes resources with fast VRAM and wide mem bus that does nothing, is too remarkable to rely on "random dudes on the internet".

mtcn77 · Apr 21, 2021

medi01 said:
The thought that AMD wastes resources with fast VRAM and wide mem bus that does nothing, is too remarkable to rely on "random dudes on the internet".

I bet you are the heterogeneous compute architecture expert, hi.

Chrispy_ · Apr 21, 2021

medi01 said:
The thought that AMD wastes resources with fast VRAM and wide mem bus that does nothing, is too remarkable to rely on "random dudes on the internet".

It's no more wasted than the 12GB GDDR6 on the 3060 12GB. Nvidia didn't want to use 12GB but they chose that rather than offering a 6GB card because nobody would pay the MSRP for a 6GB model that was already considered "lower spec" in 2017.

AMD were competing with 8GB cards (2070, 2060S) and wouldn't have been willing to release a 6GB/192-bit card at the price they were targeting.

This is GPU basics 101. Are you seriously asking these questions for real or just trolling?

mtcn77 · Apr 21, 2021

Chrispy_ said:
Are you seriously asking these questions for real or just trolling?

He is not trolling, however not understanding that memory devices provide just a portion of overall bandwidth now.

PS: I still chuckle at the individuals who said cache was the worst implementation of process gains. If it weren't for the availability of cache, these gpus would slow down at the slightest memory overload...

Punkenjoy · Apr 21, 2021

Radeon 5600 XT game clock as per TPU DB is 1375 MHz where 5700 is 1625 MHz, As per TPU, 5700 is 7% faster despite having 9.5% more fillrate and TFLOPS.

But this is the cut down version of the RX 5700 XT. that have the same bandwidth but 10% more cores and 7.5% more frequency. the card is 14% faster so it's probably true that the 5700 non-XT have way more memory than it need. For the 5700XT i am not sure if that is as clear as that.

But anyway, the fact that the 5600XT is not memory starved or it is have very few to do to the comparaison of IPC between 5700XT and 6700XT. The 6700XT was downclocked to match the 5700XT. Maybe with 30% lower clock the 6700XT wouldn't need the equivalent of a 256 bit bus with 14 gbps memory, but this is isn't the speed the card is running right now.

Like i said, AMD engineer have simulators. they mostly know how multiples configuration work before finalizing the layout. If they went with 96 MB ic + 192 bit bus, it's probably because they found out that it was enough. Also raw bandwidth is only one part of the equation, Power consumption is another part. If they can spend less power budget on the memory, they can crank the clock higher in the GPU cores.

AMD claim that an infinity cache access use 1.3 pJ where a memory access use between 7 and 8 pJ. That is a huge difference. the cache it that they claim is above 50% at 4K for the 128MB block. should be similiar for 96MB at 1440P.

In the end, chip design is never perfect, but it's a balance. You balance so many factor in the hope of getting the best chips and i strongly thing the cache is there to stay. with MCM, die size might not be that big of a deal that it is right now anyway. You will be able to put many smaller chips into a single packaging.

mtcn77 · Apr 21, 2021

With past Nvidia references I think that is right on the money, yes.
Although I don't get how it is a seperate resource like Xbox ESRAM. Could AMD make their own after the fact?

Kaotik · Apr 21, 2021

THANATOS said:
As far as I know Turing had 64 FP32 and 64 INT32 Units per SM. Now those 64 INT32 units are capable of either INT32/FP32, but the original Cuda cores were or are not capable of INT32 as far as I know.

Don't know where you got that idea from, of course they've always supported INT(32). If you can do FP, you can do INT, it's the other way around that's not guaranteed.
Here's a quote from Fermi's whitepaper. Fermi was the first GPU with CUDA cores.

Figure  5.  NVIDIA’s  Fermi  GPU  architecture  consists  of  multiple  streaming 
multiprocessors (SMs),  each  consisting  of  32 cores,  each of  which  can  execute  one  floating point  or  integer  instruction  per  clock.

https://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-TheFirstCompleteGPUComputingArchitecture.pdf

The Turing INT-units were essentially CUDA cores without FP capabilities, thus they didn't count them as CUDA cores. Now that they didn't strip FP capabilities, they started calling the CUDA cores again, because that's what they are.

medi01 · Apr 22, 2021

mtcn77 said:
He is not trolling, however not understanding that memory devices provide just a portion of overall bandwidth now.

Chrispy_ said:
AMD were competing with 8GB cards

So AMD couldn't have both 8GB and lower bandwidth.
Amazing stuff.

mtcn77 · Apr 22, 2021

medi01 said:
So AMD couldn't have both 8GB and lower bandwidth.
Amazing stuff.

I had the same inclination as yourself, then a local developer who had worked on UE4 game development insured me that HSA had nothing to do with brute force, but deferred rendering pipeline management. They don't do it to execute two workloads simultaneously. The bandwidth doesn't matter anymore like it does with forward rendering.

Chrispy_ · Apr 22, 2021

medi01 said:
So AMD couldn't have both 8GB and lower bandwidth.
Amazing stuff.

Correct, kind of. They could technically have made one but it would have been a disaster.

GDDR Memory modules are made in sizes that follow power-of-two capacity and they all currently have 32-bit access paths per BGA module. That means that you get 6 modules on a 192-bit card, 8 on a 256-bit card, 12 on a 384-bit card etc. Every experiment to deviate from the formula in recent history has caused problems, see the 550Ti and 660Ti which only had half bandwidth to half their memory, and a more severe example of that scenario was the infamous lawsuit Nvidia lost over "3.5GB of VRAM" on the GTX970. Asymmetric memory layouts are bad and only the VRAM that falls under symmetric access is realistically usable - so 3.5GB on the 4GB GTX 970, and 768MB of the 1GB on the 550Ti and 660Ti.

The market and current gaming requirements determines what amount of VRAM is acceptable at any given price point.

AMD's choices for the 5700-series were:

Cheaper silicon/PCB with 192-bit memory and 6GB - VRAM-limited and hurts performance of their highest performance part in 2019. Within 6 months some games were already using over 6GB.
Cheaper silicon/PCB with 192-bit memory and 12GB - GDDR6 was pretty expensive, that would have added $50-100 to every card at retail.
Very cheap silicon/PCB with 128-bit memory and 8GB - not enough bandwidth, likely no faster than the 256-bit RX580/RX590 at higher resolutions, so a commercial failure as a premium product.
Expensive silicon/PCB with 256-bit memory and 8GB - what they went for. Competitive with Nvidia, not too expensive on GDDR6, and with enough capacity for all current games on the market.

Sure, the 192-bit, 6GB 5600XT has proved that even 192-bit is enough bandwidth for the 5700-series but it's always a balancing act. The bandwidth and VRAM capacity need to match the capabilities of the GPU - there's no point putting 8GB of GDDR6 on a Geforce 710 because it'll never manage to run anything at high enough resolutions to use that much capacity/bandwidth. Meanwhile, people are worried that 10GB isn't enough on the 3080 for future 4K gaming and today's 8K gaming, the same way that people who bought expensive AMD Fury cards with only 4GB were burned by lack of VRAM long before their GPU was too slow to run new games.

dragontamer5788 · Apr 22, 2021

Chrispy_ said:
The bandwidth and VRAM capacity need to match the capabilities of the GPU - there's no point putting 8GB of GDDR6 on a Geforce 710 because it'll never manage to run anything at high enough resolutions to use that much capacity/bandwidth

Even more confusing: GPU workloads change over time. Hypothetically, some memory-hard cryptocoin algorithm can come out that would make a Geforce 710 with 8GB of GDDR6x the best machine ever. (Cryptocoin people seem to have fun making really weird algorithms with weird attributes).

In more video game-oriented tasks: there are various workloads that have different compute and RAM requirements. For example: texture compression/decompression is heavy compute but light on the RAM bandwidth and vRAM capacity. Programmers have chosen to do texture compression as standard because GPU-compute strength is growing faster than VRAM bandwidth and VRAM capacity. And particular targets (such as game consoles or the iPhone) have an outsized affect on the programming styles / decisions of video game programmers.

So its a moving target. The society of GPU-programmers / video game programmers / artists create workloads based on the machine's capabilities. And then the GPU-designers create GPUs based off of the workloads that the programmers made. No one is really "in charge" of the future. It just sorta happens. Every now and then, the GPU hardware designers say "here's how we'll do things moving forward" (ex: Mantle), but then the low level software engineers still have to build something (ex: DirectX12), before it is adopted by the common shader / effects programmer.

---------

Every experiment to deviate from the formula in recent history has caused problems

XBox 360 had eDRAM that no one liked. Kinda different, but kinda the same.

There's lots of "good" designs assuming the programmers are willing to follow the new design. But if the programmers don't want to, you end up with bad performance. I'm sure hardware engineers would love it if they could wave a magic wand and get all programmers to agree to a new methodology. But similarly: video game programmers want to wave a magic wand and get hardware engineers to build GPUs in the way they're already programming, without any "funny business".

Punkenjoy · Apr 22, 2021

Again, the fact that RX5600 XT and 5700 have different memory configuration but similar performance do not proove that the RX 5700 XT could have been fine with a 192 bit bus. RX 5700 is a cut down and lower clocked 5700 XT.

But the guy is right, They could have got 8 GB of VRAM and lower lattency by using 12 gbps GDDR6 instead of 14 gbps. They would probably be able to save some pennys there. You missing that option that they did not selected.

I suspect that the RX 5700XT require that 256 bit bus and 14 gbps memory. Maybe not in all scenario but in some case and the 5700 is just the left over that can't be qualified as 5700XT.

As for memory layout, he speak about weird memory layout like the 970, not necessarely eDRAM. Anyway in the current PC market, AMD and Nvidia would shoot themselves in the foot if they would require the game to have very specific optimisation for their architecture to perform. But luckily, you could optimise for Infinity Cache, but you would get very good performance even if you aren't doing anything.

mtcn77 · Apr 22, 2021

dragontamer5788 said:
Every now and then, the GPU hardware designers say "here's how we'll do things moving forward" (ex: Mantle), but then the low level software engineers still have to build something (ex: DirectX12),

Even mantle had nothing to do with forward rendering despite AMD had much to do with its public perception as such. For it to be graphics plus another compute stream, it required tight manual controls and profiling in order to operate the execution mask in top condition which none of the developers did.
The simple explanation is, the gpu doesn't operate at register speed, it just shuffles registers at the wavefront rate. Thus it becomes a matter of always masking those execution resources which are due for execution. It heralded the discovery of scalarization which became so critical that 4 iterative threads became obsolete and shortest execution latency became the new paradigm. Its inherent simplification at the instruction scheduling is essentially the RDNA dual issuing dual schedulers, imo. Now they don't have to fiddle with the execution mask to find wavefronts with optimal workload, there is no more instruction waiting due to the workset being wider than the wavefront capacity. Every cycle 4 instructions fill the pipeline with no lag.

Deleted member 193792 · May 18, 2021

chlamchowder said:
You mean execution stalls? Nvidia encodes stall cycles into static scheduling info, visible in disassembly in words without associated instructions for Kepler/Maxwell/Pascal, and tacked on to instructions in Turing/Ampere. Fermi/GCN/RDNA track execution dependencies in hardware.

So Fermi is on par with GCN, but Turing/Ampere (Volta derivatives) are inferior?

I thought nVidia added back the hardware scheduler in their Volta microarchitecture... isn't that the reason Async Compute has actual gains on newer nVidia GPUs?

Kaotik said:
Fermi was the first GPU with CUDA cores.

Wasn't G80 (GeForce 8800 GTX) the first GPGPU with unified (FP32) shader cores and CUDA support?

System Name	M3401 notebook
Processor	5600H
Motherboard	NA
Memory	16GB
Video Card(s)	3050
Storage	500GB SSD
Display(s)	14" OLED screen of the laptop
Software	Windows 10
Benchmark Scores	3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.

System Name	Poor Man's PC
Processor	AMD Ryzen 5 7500F
Motherboard	MSI B650M Mortar WiFi
Cooling	ID Cooling SE 206 XT
Memory	32GB GSkill Flare X5 DDR5 6000Mhz
Video Card(s)	Sapphire Pulse RX 6800 XT
Storage	XPG Gammix S70 Blade 2TB + 8 TB WD Ultrastar DC HC320
Display(s)	Mi Gaming Curved 3440x1440 144Hz
Case	Cougar MG120-G
Audio Device(s)	MPow Air Wireless + Mi Soundbar
Power Supply	Enermax Revolution DF 650W Gold
Mouse	Logitech MX Anywhere 3
Keyboard	Logitech Pro X + Kailh box heavy pale blue switch + Durock stabilizers
VR HMD	Meta Quest 2
Benchmark Scores	Who need bench when everything already fast?

System Name	Bragging Rights
Processor	Atom Z3735F 1.33GHz
Motherboard	It has no markings but it's green
Cooling	No, it's a 2.2W processor
Memory	2GB DDR3L-1333
Video Card(s)	Gen7 Intel HD (4EU @ 311MHz)
Storage	32GB eMMC and 128GB Sandisk Extreme U3
Display(s)	10" IPS 1280x800 60Hz
Case	Veddha T2
Audio Device(s)	Apparently, yes
Power Supply	Samsung 18W 5V fast-charger
Mouse	MX Anywhere 2
Keyboard	Logitech MX Keys (not Cherry MX at all)
VR HMD	Samsung Oddyssey, not that I'd plug it into this though....
Software	W10 21H1, barely
Benchmark Scores	I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.

System Name	M3401 notebook
Processor	5600H
Motherboard	NA
Memory	16GB
Video Card(s)	3050
Storage	500GB SSD
Display(s)	14" OLED screen of the laptop
Software	Windows 10
Benchmark Scores	3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.

System Name	Bragging Rights
Processor	Atom Z3735F 1.33GHz
Motherboard	It has no markings but it's green
Cooling	No, it's a 2.2W processor
Memory	2GB DDR3L-1333
Video Card(s)	Gen7 Intel HD (4EU @ 311MHz)
Storage	32GB eMMC and 128GB Sandisk Extreme U3
Display(s)	10" IPS 1280x800 60Hz
Case	Veddha T2
Audio Device(s)	Apparently, yes
Power Supply	Samsung 18W 5V fast-charger
Mouse	MX Anywhere 2
Keyboard	Logitech MX Keys (not Cherry MX at all)
VR HMD	Samsung Oddyssey, not that I'd plug it into this though....
Software	W10 21H1, barely
Benchmark Scores	I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.

Processor	Ryzen 7 5800X3D
Motherboard	Asus Prime X570 Pro
Cooling	Deepcool LS-720
Memory	32 GB (4x 8GB) DDR4-3600 CL16
Video Card(s)	Gigabyte Radeon RX 6800 XT Gaming OC
Storage	Samsung PM9A1 (980 Pro OEM) + 960 Evo NVMe SSD + 830 SATA SSD + Toshiba & WD HDD's
Display(s)	Samsung C32HG70
Case	Lian Li O11D Evo
Audio Device(s)	Sound Blaster Zx
Power Supply	Seasonic 750W Focus+ Platinum
Mouse	Logitech G703 Lightspeed
Keyboard	SteelSeries Apex Pro
Software	Windows 11 Pro

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

Deleted member 193792

Guest