Friday, March 17th 2017

AMD Ryzen Infinity Fabric Ticks at Memory Speed

Memory clock speeds will go a long way in improving the performance of an AMD Ryzen processor, according to new information by the company, which reveals that Infinity Fabric, the high-bandwidth interconnect used to connect the two quad-core complexes (CCXs) on 6-core and 8-core Ryzen processors with other uncore components, such as the PCIe root-complex, and the integrated southbridge; is synced with the memory clock. AMD made this revelation in a response to a question posed by Reddit user CataclysmZA.

Infinity Fabric, a successor to HyperTransport, is AMD's latest interconnect technology that connects the various components on the Ryzen "Summit Ridge" processor, and on the upcoming "Vega" GPU family. According to AMD, it is a 256-bit wide bi-directional crossbar. Think of it as town-square for the chip, where tagged data and instructions change hands between the various components. Within the CCX, the L3 cache performs some inter-core connectivity. The speed of the Infinity Fabric crossbar on a "Summit Ridge" Ryzen processor is determined by the memory clock. When paired with DDR4-2133 memory, for example, the crossbar ticks at 1066 MHz (SDR, actual clock). Using faster memory, according to AMD, hence has a direct impact on the bandwidth of this interconnect.
Source: CataclysmZA on Reddit
Add your own comment

95 Comments on AMD Ryzen Infinity Fabric Ticks at Memory Speed

#26
uuuaaaaaa
IceScreamer
Yea, never thought about that actually, makes sense now that you mention it.

Also a question, could this Infinity Fabric in theory enable on-die Crossfire/SLI connection between two GPUs, removing (or reducing) the need for software?
I think that is exactly what they want to do, since Vega also supports connection to this infinite fabric thing. It also applies to inter cpu connection on their Naples server platform which sports an healthy 8 channel memory configuration.
Posted on Reply
#27
fynxer
So if you building Ryzen gaming rig on a budget

less is more

better to use like 8 GB of expensive super fast memory to get more performance then.

What AMD should do is to revive their Radeon memory brand and sell super fast DDR4 memory with only Ryzen profiles at very low cost to push their Ryzen cpu business.

This way gamers are more inclined to upgrade to Ryzen if the can get maximum performance at a reasonable price. What they don't make in the memory business they will gain 10 fold in the cpu business.
Posted on Reply
#28
RejZoR
This also explains stability issues with super high clocked RAM. It also clocked the Infinity Fabric bus very high...
Posted on Reply
#29
bug
Legacy-ZA
Dual Channel seems to be one of the problems, if they brought it out with Triple / Quad, these would have performed way better.
The number of channels (or bandwidth) is not the issue here. The issue is the crossbar switch operates at the same frequency as the RAM. With slower RAM, the crossbars switch has higher latency -> interconnect is slower.

Not a big issue per se, but it depends whether memory speeds can be fixed with a simple BIOS update or they require hardware changes.
Posted on Reply
#30
Hood
So, the gist of this thread is, Ryzen should have been designed like an Intel CPU, with more memory channels, a monolithic core design, and a more capable IMC. Of course, it would cost more to make (just like Intel). So let's just make Ryzen into a clone of Intel's HEDT chips, and at the same price level. But hey, at least we can put on an AMD case badge, to let everyone know how much we hate Intel...
Posted on Reply
#31
papupepo
The majority of game developers think all the cores of a processor are linked through the L3 cache because Intel's processors are so. They might carelessly share critical data and pay enormous cost in Ryzen.

Programmers must know about cache mechanisms. They can calculate concurrently but must update in one thread. If data are shared by many cores and they update it, there will be total mess, especially in Ryzen.
Posted on Reply
#32
bug
papupepo
Programmers must know about cache mechanisms.
No, not really. Cache is there to assist while the programmers do their thing. In a world of virtualization, the software rarely knows what it's running on anyway.
Optimizing data chunks wrt cache size is required in some instances, but knowing the intricacies of cache's implementation is certainly not a requirement for a programmer.
Posted on Reply
#33
deu
mastrdrver
Good video showing how talking across the CCXs through the fabric hurts performance. This also shows that MS Windows 10 scheduler need some tweaking.


Good video to illustrate the issue! :) Now we just need people to understand that this is something that is fixable :) (like really fixable.)
Posted on Reply
#34
papupepo
bug
Optimizing data chunks wrt cache size is required in some instances, but knowing the intricacies of cache's implementation is certainly not a requirement for a programmer.
This is a principle, not a technical detail. And this principle is obvious for anyone who have studied any kind of cache mechanism before.
Posted on Reply
#35
bug
papupepo
This is a principle, not a technical detail. And this principle is obvious for anyone who have studied any kind of cache mechanism before.
You lost me.
Posted on Reply
#36
cdawall
where the hell are my stars
chaosmassive
AMD need to drop this "CPU block style", interface between between 'group' of CPUs tend to be bottlenecked by bandwidth
look at back, Intel C2Q, Pentium D linked via FSB speed, but ultimately dropped it
AMD need to make real 'individual' cores, with shared L3 cache across 8 cores like Intel do

I dont know, maybe AMD try to save R&D cost by making 'blue print' of 4 cores configuration and simply 'copy-paste' cores to silicon
You mean like the athlon x2, phenom and phenom II...
Posted on Reply
#37
r9
So again the bottleneck is the connection between the two CCX and L3.
So if the Windows scheduler handles the threads and L3 cache properly, not moving threads between the two CCX this Infinity Fabric should not be an issue, and AMD said that Windows Scheduler is aware of the Ryzen Architecture.
I'm confused.
And wishful thinking but maybe in the next BIOS updates we could unlink the bus from the memory and overclock it.
Posted on Reply
#38
papupepo
bug
You lost me.
Did I? Parallel programming is extremely difficult. You must know many principles for it. You can't benefit by multicore processors if you write a program freely.

If you are not skilled programmer and don't know much about parallel programming, you should write single-threaded programs. Caches always helps you in there.

And you should know the details of hardware if you write performance-critical software, like a game.
Posted on Reply
#39
mcraygsx
deu
Good video to illustrate the issue! :) Now we just need people to understand that this is something that is fixable :) (like really fixable.)
That was fantastic Video. This also means each time benchmarking results can/will vary depending on how Windows is scheduling threads between two CCX.
Posted on Reply
#40
bug
r9
So again the bottleneck is the connection between the two CCX and L3.
So if the Windows scheduler handles the threads and L3 cache properly, not moving threads between the two CCX this Infinity Fabric should not be an issue, and AMD said that Windows Scheduler is aware of the Ryzen Architecture.
I'm confused.
And wishful thinking but maybe in the next BIOS updates we could unlink the bus from the memory and overclock it.
AMD themselves said Win scheduler is not the issue, but what do they know? http://www.windowscentral.com/amd-says-windows-scheduler-isnt-blame-ryzen-performance

papupepo
Did I? Parallel programming is extremely difficult. You must know many principles for it. You can't benefit by multicore processors if you write a program freely.

If you are not skilled programmer and don't know much about parallel programming, you should write single-threaded programs. Caches always helps you in there.

And you should know the details of hardware if you write performance-critical software, like a game.
Parallel programming is not extremely difficult. In fact, it can be fairly easy to do (look at Erlang or Go's goroutines). But most of the time it is more tedious to write and harder to test/maintain.
Caching has nothing to do with multi-threading. Caching is there to avoid memory read/writes, it doesn't actually care whether the CPU is running 1 or 1,000 threads.
L1 and L2 caches are always split and I know of no one trying to write multithreaded code in order not to upset L1 and L2 caches. If anything, that's a compiler's or a scheduler's job. I don't see why things would be any different when we're talking about L3 cache.
Posted on Reply
#41
r9
bug
AMD themselves said Win scheduler is not the issue, but what do they know? http://www.windowscentral.com/amd-says-windows-scheduler-isnt-blame-ryzen-performance



Parallel programming is not extremely difficult. In fact, it can be fairly easy to do (look at Erlang or Go's goroutines). But most of the time it is more tedious to write and harder to test/maintain.
Caching has nothing to do with multi-threading. Caching is there to avoid memory read/writes, it doesn't actually care whether the CPU is running 1 or 1,000 threads.
L1 and L2 caches are always split and I know of no one trying to write multithreaded code in order not to upset L1 and L2 caches. If anything, that's a compiler's or a scheduler's job. I don't see why things would be any different when we're talking about L3 cache.
Ryzen L3 cache is split between the two CCX. So it has 2x8MB instead of 1x16MB. Which means when if thread is moved from one CCX to the other the cached information needs to be moved to the appropriate L3.
That's where the Infinity Fabric bottleneck takes place and heavily affects performance.
Posted on Reply
#42
bug
r9
Ryzen L3 cache is split between the two CCX. So it has 2x8MB instead of 1x16MB. Which means when if thread is moved from one CCX to the other the cached information needs to be moved to the appropriate L3.
That's where the Infinity Fabric bottleneck takes place and heavily affects performance.
I know that. But that's AMD's design decision. And when going for max performance, programmers will need to account for that. But I don't think it's fair to say programmers should (much less must) take into account cache implementation details when writing a game.
And even so, thread core affinity is typically the responsibility of the OS.
Posted on Reply
#43
r9
bug
I know that. But that's AMD's design decision. And when going for max performance, programmers will need to account for that. But I don't think it's fair to say programmers should (much less must) take into account cache implementation details when writing a game.
And even so, thread core affinity is typically the responsibility of the OS.
And calling the interconnect Infinite Fabric is like putting race stripe on a car and expecting it to go faster.
Something is not adding up here. From the information that was floating around it sounded like the Infinite Fabric is the bottleneck due to threads moving between CCX.
But with AMD releasing that statement that nothing wrong with the Windows scheduler it looks like that bus is the bottleneck in all scenarios.
And its sounds like all the memory issues are related to the bus being in sync with the memory.
Looks like a huge overlook on AMD side.
But I'm willing to bet they will offer significant IPC improvement on Zen 2.0 and it will be largely due to addressing the bus speed.
Posted on Reply
#44
Captain_Tom
chaosmassive
AMD need to drop this "CPU block style", interface between between 'group' of CPUs tend to be bottlenecked by bandwidth
look at back, Intel C2Q, Pentium D linked via FSB speed, but ultimately dropped it
AMD need to make real 'individual' cores, with shared L3 cache across 8 cores like Intel do

I dont know, maybe AMD try to save R&D cost by making 'blue print' of 4 cores configuration and simply 'copy-paste' cores to silicon
It is incredibly cheaper, and nearly infinitely scalable for AMD to do it this way.


You can thank this new Tech for Ryzen's low cost and massive 32-core brethren. In fact I hope (And expect) AMD to apply this to their GPU archs within a year. Imagine a 1200mm^2 10,000-SP monster gaming card.
Posted on Reply
#45
OSdevr
Different CPUs have different cache designs, and they have become quite complicated. A game developer may be able to slightly improve performance using good programming habits, but they could just as easily hinder it. Caches are designed to improve performance for the average program. Also memory allocation isn't the program's job. It's done by the OS and the OS indeed plays tricks with caches (page coloring for example).

It is possible to play the caches like a fiddle (Memtest86+ doesn't disable them), but it's quite difficult and not something that can be done under an OS.

BTW hasn't AMD done something like this before? I think they once had a FSB that was synced with main memory, or had to be for good performance.
Posted on Reply
#46
prtskg
erek
Why did they even decide against a Monolithic design? Can't believe we're talking about two separate modules called CCXs (CPU Complex)... just seems like an obsolete design back to the first dual cores that had to reach out to the FSB to communicate between each other. This is unbelievable to me, I know it's better than going out to the FSB, but it imagine how crazy Ryzen could of been with a Monolithic design... it'd be crazy fast I imagine...

Tired of anything related to modules with slow interconnects.
I thought the reason was obvious. They don't have enough money and human resource for monolithic design to make cpus from 2 to 8 cores, apus from 2 to 4 cores, gpus from small to big size, custom chips for consoles, other embedded designs, etc. They also needed some interconnect for server as well as hpc apu. So they chose the best compromise for AMD, decided to choose the design that will help them with computational tasks aka server cpus and apus over gaming. And I think they did well. I never expected them to come so close to Intel. Zen cpus should serve them well in servers and this will give them enough money for better products down the line. I'm now happy enough with their product to assemble some am4 systems down the line, something I didn't do with their BD products.
Posted on Reply
#47
AcesNDueces
The Improvements seen have very little to do w actual memory bandwidth and more to with the side benefit of the higher memory speed increasing the speed of the infinity fabric clock. Essentially faster ram overclocks the Uncore/SouthBridge/cache speeds. Thats where the jump is coming from.
Posted on Reply
#48
Legacy-ZA
bug
The number of channels (or bandwidth) is not the issue here. The issue is the crossbar switch operates at the same frequency as the RAM. With slower RAM, the crossbars switch has higher latency -> interconnect is slower.

Not a big issue per se, but it depends whether memory speeds can be fixed with a simple BIOS update or they require hardware changes.
I did say "one" of the issues. *sigh*
Posted on Reply
#49
bug
Legacy-ZA
I did say "one" of the issues. *sigh*
You were still wrong.
Posted on Reply
#50
bug
r9
And calling the interconnect Infinite Fabric is like putting race stripe on a car and expecting it to go faster.
Something is not adding up here. From the information that was floating around it sounded like the Infinite Fabric is the bottleneck due to threads moving between CCX.
But with AMD releasing that statement that nothing wrong with the Windows scheduler it looks like that bus is the bottleneck in all scenarios.
And its sounds like all the memory issues are related to the bus being in sync with the memory.
Looks like a huge overlook on AMD side.
But I'm willing to bet they will offer significant IPC improvement on Zen 2.0 and it will be largely due to addressing the bus speed.
From what I've read, AMD consciously compromised on the memory performance front to get the product out. My guess is they'll enable faster DDR for this generation and come up with an improved solution in the next iteration.

But this compromise is just like when we "compromise" and buy whatever CPU we can, even if we know a better one is just around the corner. If we'd wait for the perfect CPU, we'd never buy anything. The same as AMD, if they wanted to fix everything, they'd never release. Because once fixed, the bottleneck would simply move somewhere else, and once that was fixed the bottleneck would move again and so on.
Posted on Reply
Add your own comment