Monday, March 6th 2017

AMD's Ryzen Cache Analyzed - Improvements; Improveable; CCX Compromises

AMD's Ryzen 7 lower than expected performance in some applications seems to stem from a particular problem: memory. Before AMD's Ryzen chips were even out, reports pegged AMD as having confirmed that most of the tweaks and programming for the new architecture had been done in order to improve core performance to its max - at the expense of memory compatibility and performance. Apparently, and until AMD's entire Ryzen line-up is completed with the upcoming Ryzen 5 and Ryzen 3 processors, the company will be hard at work on improving Ryzen's cache handling and memory latency.

Hardware.fr has done a pretty good job in exploring Ryzen's cache and memory subsystem deficiencies through the use of AIDA 64, in what would otherwise be an exceptional processor design. Namely, the fact that there seems to be some problem with Ryzen's L3 cache and memory subsystem implementation. Paired with the same memory configuration and at the same 3 GHz clocks, for instance, Ryzen's memory tests show memory latency results that are up to 30 ns higher (at 90 ns) than the average latency found on Intel's i7 6900K or even AMD's FX 8350 (both at around 60 ns).
Update: The lack of information regarding the test system could have elicited some gray areas in the interpretation of the results. Hardware.fr tests, and below results, were obtained by setting the 8-core chips at 3 GHz, with SMT and HT deactivated. Memory for the Ryzen and Intel platforms was DDR4-2400 with 15-15-15-35 timings, and memory for the AMD FX platform was DDR3-1600 operating at 9-9-9-24 timings. Both memory configurations were set at 4x 4 GB, totaling 16 GB of memory.

From some more testing results, we see that Intel's L1 cache is still leagues ahead from AMD's implementation; that AMD's L2 is overall faster than Intel's, though it does incur on a roughly 2 ns latency penalty; and that AMD's L3 memory is very much behind Intel's in all metrics but L3 cache copies, with latency being almost 3x greater than on Intel's 6900K.
The problem is revealed through an increasing work size. In the case of the 6900K, which has a 32 KB L1 cache, performance is greatest until that workload size. Higher-sized workloads that don't fit on the L1 cache then "spill" towards the 6900K's 256 KB L2 cache; workloads higher than 256 KB and lower than 16 MB are then submitted to the 6900 K's 20 MB L3 cache, with any workloads larger than 16 MB then forcing the processor to access the main system memory, with increasing latency in access times until it reaches the RAM's ~70 ns access times.
However, on AMD's Ryzen 1800X, latency times are a wholly different beast. Everything is fine in the L1 and L2 caches (32 KB and 512 KB, respectively). However, when moving towards the 1800X's 16 MB L3 cache, the behavior is completely different. Up to 4 MB cache utilization, we see an expected increase in latency; however, latency goes through the roof way before the chip's 16 MB of L3 cache is completely filled. This clearly derives from AMD's Ryzen modularity, with each CCX complex (made up of 4 cores and 8 MB L3 cache, besides all the other duplicated logic) being able to access only 8 MB of L3 cache at any point in time.
The difference in access speeds between 4 MB and 8 MB workloads can be explained through AMD's own admission that Ryzen's core design incurs in different access times depending on which parts of the L3 cache are accessed by the CCX. The fact that this memory is "mostly exclusive" - which means that other information may be stored on it that's not of immediate use to the task at hand - can be responsible for some memory accesses on its own. Since the L3 cache is essentially a victim cache, meaning that it is filled with the information that isn't able to fit onto the chips' L1 or L2 cache levels, this would mean that each CCX can only access up to 8 MB of L3 cache if any given workload uses no more than 4 cores from a given CCX. However, even if we were to distribute workload in-between two different cores from each CCX, so as to be able to access the entirety of the 1800X's 16 MB cache... we'd still be somewhat constrained by the inter-CCX bandwidth achieved by AMD's Data Fabric interconnect... 22 GB/s, which is much lower than the L3 cache's 175 GB/s - and even lower than RAM bandwidth. That the Data Fabric interconnect also has to carry data from AMD's IO Hub PCIe lanes also potentially interferes with the (already meagre) available bandwidth

AMD's Zen architecture is surely an interesting beast, and these kinds of results really go to show the amount of work, of give-and-take design that AMD had to go through in order to achieve a cost-effective, scalable, and at the same time performant architecture through its CCX modules. However, this kind of behavior may even go so far as to give us some answers with regards to Ryzen's lower than expected gaming performance, since games are well-known to be sensitive to a processor's cache performance profile.
Source: Hardware.fr
Add your own comment

120 Comments on AMD's Ryzen Cache Analyzed - Improvements; Improveable; CCX Compromises

#26
ssdpro
EarthDogI wonder if aida64 was updated... we were told directly from FinalWire not to use it for data until they updated it... AMD didn't send them ryzen pre launch...
See: forums.aida64.com/topic/3768-aida64-compatibility-with-amd-ryzen-processors/
3) L1 cache bandwidth and latency scores, as well as memory bandwidth and latency scores are already accurately measured.
1800x sits right between a Celeron J1900 (2013) and a Opteron 2378 (2008).
Posted on Reply
#27
XiGMAKiD
medi01Huh?
69.3 vs 98 is... 3 times?

PS
Are they testing "Core from the left quad accessing L3 of the right quad" scenario? (CCX in the title hints at that, but nothing in the chaotic text of OP talks about it.
You're looking at the wrong table, that's system memory latency. What OP means is L3 cache latency, 17.3 vs 46.6
Posted on Reply
#28
Assimilator
C_WizL3 is split in half and communication between the two CCX is thru the same link that links the CCX to the memory controller, PCIe, etc, at a much lower speed.
Interesting - this should mean the 4c/8t Ryzen parts won't suffer from this penalty, so their performance should be correspondingly better.
Posted on Reply
#29
lexluthermiester
ssdpro1800x sits right between a Celeron J1900 (2013) and a Opteron 2378 (2008).
Citation please.
Posted on Reply
#30
ssdpro
lexluthermiesterCitation please.
Download AIDA64's most recent beta and run the latency benchmark for yourself. If you do not have a 1800X or an AIDA64 license you can take the 98ns figure cited in this article, download the trial of AIDA64 (beta), and view the data yourself. ;)
Posted on Reply
#31
C_Wiz
hardware.fr
ssdproDownload AIDA64's most recent beta and run the latency benchmark for yourself. If you do not have a 1800X or an AIDA64 license you can take the 98ns figure cited in this article, download the trial of AIDA64 (beta), and view the data yourself. ;)
Nope, I'm sorry but this summary of our article is missing so many facts/have so many innacuracies that it's very misleading at this point. I hope it will be fixed soon.

Again :
- You can't compare L3 values (especially L3 latency), they are wrong (in orange, for a reason)
- FYI, the table that they took from our article of RAM latency is done at 3 GHz with SMT and HT off. Real RAM latency @ stock is around 89.6 with DDR4-2400. That's still much higher than other CPUs with same RAM, but you can't compare a 3 GHz value to other CPUs @ stock.

Hopefully this news will get fixed, please check the original article with Google Translate if you want more details.
Posted on Reply
#32
ssdpro
C_WizNope, I'm sorry but this summary of our article is missing so many facts/have so many innacuracies that it's very misleading at this point. I hope it will be fixed soon.

Again :
- You can't compare L3 values (especially L3 latency), they are wrong (in orange, for a reason)
- FYI, the table that they took from our article of RAM latency is done at 3 GHz with SMT and HT off. Real RAM latency @ stock is around 89.6 with DDR4-2400. That's still much higher than other CPUs with same RAM, but you can't compare a 3 GHz value to other CPUs @ stock.

Hopefully this news will get fixed, please check the original article with Google Translate if you want more details.
I found the summary to be consistent with actual tests of the CPU with ram at 2666. If you think Techpowerup's summary of your article made some manipulation of the data, I guess that is between you and them. You can simply run AIDA64 tests and find similar results. I actually found 92ns for memory latency.

Whether we are splitting hairs at the 98ns in this article, the 92ns, this recent 89.6ns you reference, what we have is some pretty bad latency comparing to AMD's other offerings or Intel products. As a result of these findings, coupled with gaming performance, we have a stock that continues it's slide.
Posted on Reply
#33
r9
CammOne does wonder if the 4 core parts will suffer the same fate since it will be one straight core complex.
More or less what I wanted to know yesterday.
r9Can somebody disable 4 cores and SMT and do some game benchmarks. Just to get a glimpse from what to expect from the Ryzen 3 cpus.
Also that would take out Windows scheduler optimization from the equation.
The issue with scheduler not distinguishing between actual and SMT cores, assigning threads to SMT that are four time slower than actual cores.
Moving threads between CCX and causing bottlenecking from split L3 cache and slow inter cache link.
Explained here:
www.reddit.com/r/Amd/comments/5x7oaq/ryzens_memory_latency_problem_a_discussion_of/.
Posted on Reply
#34
Captain_Tom
This greatly explains the gaming performance. In other words Zen shouldn't perform worse (In IPC) than intel if either 1) the game only uses 4 threads, or 2) the game uses 8 or more threads.

Most modern games only really use 6 threads (While jumping to 8 when necessary) depending on the workload, and thus AMD loses in most games.


Makes me once again say that AMD should try to make a 4.5 - 5.0 GHz 4c/8t Ryzen 7 chip for $275. They need a version made specifically for high-FPS gamers.
Posted on Reply
#35
C_Wiz
hardware.fr
ssdproI found the summary to be consistent with actual tests of the CPU with ram at 2666. If you think Techpowerup's summary of your article made some manipulation of the data, I guess that is between you and them. You can simply run AIDA64 tests and find similar results. I actually found 92ns for memory latency.
I'm saying there are many errors in the summary, such as quoting latency in milliseconds instead of nanoseconds, and a lot of context missing by quoting our tables for example without giving the actual configuration of said test. A lot can be put to barrier language and mistranslation by Google Translate. I'm simply trying to give readers here some more accurate information.

We alerted tpu this morning of the discrepancies, I have 0 doubt they will fix the summary ;)
Posted on Reply
#36
Captain_Tom
theGryphonAll this makes it even more impressive the current Ryzen performance. I mean, it's a chip with basically a handicapped cache/memory implementation but it still trades blows with Intel chips clock-to-clock. This actually makes me think that the real Ryzen IPC (how it handles the instructions) is significantly better than Intel's.

At the end, this is good news for AMD: they have a clear improvement path --> Lower those L3 and system memory latency figures!

It's clear that the CCX design relies on the interconnect bandwidth, so AMD has two paths going forward: 1) either find a way to increase that bandwidth for a truly scalable architecture, or 2) go Intel's route and design a chip that uses a larger CCX (with 16 cores), or 3) Do both.

It seems to me AMD should really do both if they want to also become a player in the server market again. 32-core (2 x CCX), 4-chip configurations with up to 128 cores/system is not too much to ask in the server business...

Or (totally fantasizing now, or am I?), they could truly innovate and ditch the multi-chip system designs but rather build up on the scalability idea to come up with 16-core CCX's that can do up to 8-way (on-chip) interconnects, yielding a full chip with 128 cores. Think about the implications for business clients: a single 128-core chip on a small board, meaning much-easier-to-deal-with systems with much lower power utilization (4 chips on a huge board means huge power overhead). Then, similar to what they do in GPUs, they can trim it down to create a product line-up. I have a feeling this is AMD's way (vision), but it's a goal that's a long way off at the moment...
If I had to guess AMD will go the improved interconnect route. It is just cheaper (And infinitely scale-able) to make a system of essentially taping multiple clusters together.

In fact I am pretty sure they plan to build up their Navi GPU's in the same way (Interconnected clusters) so that they can make some monster 400w single-gpu chips.
Posted on Reply
#37
ssdpro
C_WizI'm saying there are many errors in the summary, such as quoting latency in milliseconds instead of nanoseconds
Now that I did see. I don't think TPU was doing that with malicious intent... I think that is more in the "brain fart" category on their part.

I have visited your site and understand it would be more appropriate for TPU to outline the precise configuration to better represent the data. I believe the conclusion remains the same - latency is higher than we would like.

Just to make sure no one confuses anything (check my previous posts if necessary), I think this product is impressive and a remarkable value. It fell a little below AMD's hype and our expectations but is a remarkable achievement for a company previously on the verge. Even as is, it has provided some competition for Intel and with some tuning may do some decent disruption.
Posted on Reply
#38
TRWOV
L3 performance has been AMD's achilles heel for quite some time, kind of surprised that they haven't corrected this yet. I suppose that a Windows patch to make it "Ryzen aware" will have to be developed (just as it was the case with P4's HT, Athlon 64, Core Duo, Bulldozer, etc., etc) in order to minimize the impact on real world performance.

Considering all the contains that AMD has decked against them (budget, marketshare, less workforce, etc., etc.) it's amazing what they managed to do. I for sure will replace all my crunchers with 1700s, that's a given. :D

I'll keep my 4590 and 3770K for gaming tough. Maybe I'll replace them with 4 core R5s down the line but they still do their work just fine.
Posted on Reply
#39
Steevo
TRWOVL3 performance has been AMD's achilles heel for quite some time, kind of surprised that they haven't corrected this yet. I suppose that a Windows patch to make it "Ryzen aware" will have to be developed (just as it was the case with P4's HT, Athlon 64, Core Duo, Bulldozer, etc., etc) in order to minimize the impact on real world performance.

Considering all the contains that AMD has decked against them (budget, marketshare, less workforce, etc., etc.) it's amazing what they managed to do. I for sure will replace all my crunchers with 1700s, that's a given. :D
This, I called the memory issues when we kept seeing AMD test systems with 8 or 16GB of slower RAM only. The cache issues are a continuation of their plague that effected prior designs and held them back, but they seem to have overcome or at least masked the issues with over engineering in other parts of the chip, but the gaming results, and other very out of order operations will continue to show the cache weakness.


The only thing I am unsure about reading other reports is how well thread handling will improve the efficiency of the chip, it appears that the windows task scheduler is doing a poor job as its unaware of the nuances of the hardware, and may send threads to other CCX's and the huge increase in cache latency is what hurts the most, so keeping threads in the same CCX and or treating some threads as affinity bound should help the performance, the implied AI in this situation ( I haven't seen any definitive tests to show that program performance increases over runs) may be able to work as intended, or perhaps we are already seeing its effects in the already good but not great performance.
Posted on Reply
#40
Dimi
I am just going to wait for skylake-x and if its not affordable enough i'll go for a 6850K and OC it once they go back to under 500$. I'm thinking of using a few nvme drives so Ryzen with its 24pcie lanes does not offer what i'm looking for right now.

I've seen some benchmarks of the 1800x performing WORSE than a 7700K while streaming a game while doing other tasks.

They tried, i had hopes but i'm gonna give this one a pass.
Posted on Reply
#41
r9
Captain_TomThis greatly explains the gaming performance. In other words Zen shouldn't perform worse (In IPC) than intel if either 1) the game only uses 4 threads, or 2) the game uses 8 or more threads.

Most modern games only really use 6 threads (While jumping to 8 when necessary) depending on the workload, and thus AMD loses in most games.


Makes me once again say that AMD should try to make a 4.5 - 5.0 GHz 4c/8t Ryzen 7 chip for $275. They need a version made specifically for high-FPS gamers.
Missing the point there. It can be 2 threads and still bottleneck if the software tries to move the thread from CCX0 to CCX1.
Which is something that Games and OS do quite often to balance load among cores.
By doing that will have to move the data from CCX0 L3 Cache to CCX1 L3 Cache which will cause the bottleneck because of the ultra slow L3 interconnect.
The solution should be in sight, they just to make the Windows scheduler aware of the design and move thread only in the CCX that thread originates.
That way it eliminates moving data between L3 caches for both modules.

This hopefully can be confirmed benching a game that doesn't use more than 4 threads and disable SMT and one of the CCX on the Ryzen 7.
That eliminates all the above scenarios.
Posted on Reply
#42
Joss
r9The solution should be in sight, they just to make the Windows scheduler aware of the design and move thread only in the CCX that thread originates.
That way it eliminates moving data between L3 caches for both modules
yeap, that makes sense.
It would make the solution software only, exciting.
Posted on Reply
#43
Raevenlord
News Editor
niboarHi, the memory latency is in "ns" (nano) =1/1000000000 second not "ms" 1/1000 second.
PiotrekDGAnd it's not a typo, it appears 5 times in the text, while "ns" never appears.
It isn't a typo; I filed that under the recently created "laughable brain farts" category of my own posting analysis. Thank you for calling my attention to that =)
C_WizAuthor of the article here, I know the language barrier doesn't make things easy but there are a few innacuracies here in this summary. Some quick points on what we found :

- Memory latency (not L3) is higher (and ns, not ms ;))
- L3 is split in half and communication between the two CCX is thru the same link that links the CCX to the memory controller, PCIe, etc, at a much lower speed.

Plus many other things regarding CCX etc. I don't know how good a job Google Translate does of our article but I'd suggest people interested give it a shot (page 22/23 maybe 24 [we found another issue with game performance that's linked to Windows 10] is what you're looking for).

To answer another question, yes, L3 readings are innacurate in Aida (that's why we show them in orange in the table). We do use another test (a beta benchmark from Aida, too) to check latency at different block sizes, that one is the basis of our analysis.

G.
Hello =) Thank you for taking the time to comment and try and improve understanding on some of these issues. The language barrier is certainly part of the problem. And congrats on such an in-depth look at what makes RYzen tick!

I'll take the time to read and pour through your comments and some of the questions pose to see if I can shed some light on some other things.
C_Wiz- You can't compare L3 values (especially L3 latency), they are wrong (in orange, for a reason)
I can compare them between your own results, which where all done with the same configuration between the 6900K and the 1800X, right? That's what I compare in the article.
C_WizI'm saying there are many errors in the summary, such as quoting latency in milliseconds instead of nanoseconds, and a lot of context missing by quoting our tables for example without giving the actual configuration of said test. A lot can be put to barrier language and mistranslation by Google Translate. I'm simply trying to give readers here some more accurate information.

We alerted tpu this morning of the discrepancies, I have 0 doubt they will fix the summary ;)
Latency in milliseconds or microseconds doesn't really change anything: the discrepancy remains the same, and the units of measurement remained constant. It's a "brain-farted" technicality, which doesn't affect the overall picture. Unfortunate, yes, but doesn't change anything in the grand scheme of things.

Regarding the absent configuration, a stark neglect on my part, which I will update accordingly, so thanks for bringing that to my attention =) Time isn't as we would like, hence why only now I'm here and improving the article.
ssdproNow that I did see. I don't think TPU was doing that with malicious intent... I think that is more in the "brain fart" category on their part.

I have visited your site and understand it would be more appropriate for TPU to outline the precise configuration to better represent the data. I believe the conclusion remains the same - latency is higher than we would like.
^

This. I will, however, edit the piece including the noted configuration.
lexluthermiesterThere were a few problems with this article. The use of "ms"(milliseconds) instead of "ns"(nanoseconds) was fairly glaring. CPU operating reaction speeds have not been measured in "ms" since the early 80's. There were also a few grammatical errors which have been fixed. You're welcome.
I will ignore the delivery of your criticism and focus on the content. Thank you for it.
XzibitAIDA64 tweeted


Kind of hard to have a working AIDA64 for Ryzen when the company Tweets it cant fix it until they get a Ryzen chip the same day that article is published.
For me, that was the whole point of the post. AIDA 64 is a benchmarking utility, but until it has been "fixed", as in, properly optimized for Ryzen, I think it presents itself as a great opportunity to see Ryzen's behavior on non-optimized workloads (ie, what all games currently are).
Posted on Reply
#44
geon2k2
r9More or less what I wanted to know yesterday.
I would also want to know if the 4 core 8 thread part will be affected.
Anyway that is the most interesting part from this launch, the 16 core, while it is nice and powerful is too much for current software.
Posted on Reply
#45
trparky
Can these issues be fixed in software or is a design flaw that simply can't be fixed until the next version of Ryzen? As a person who hoped and prayed that AMD would be able to give Intel a much deserved kick to their balls, all of this news about Ryzen's performance (or lack thereof) is a major let down to me.
Posted on Reply
#46
akumod77
Why not compare any Ryzen againts i7 7700k at same clock speed, mem timings, core/thread count?

For eg, because Ryzen won't oc much. Clock them both @ 3.9ghz, 4c/8t. I know we are gimping the i7 7700k but i'm just curious to know the result of "almost the same" setup would be. Gaming & productivity benches needed.
Posted on Reply
#47
r9
akumod77Why not compare any Ryzen againts i7 7700k at same clock speed, mem timings, core/thread count?

For eg, because Ryzen won't oc much. Clock them both @ 3.9ghz, 4c/8t. I know we are gimping the i7 7700k but i'm just curious to know the result of "almost the same" setup would be. Gaming & productivity benches needed.
Where did you get this graphs from ?
Posted on Reply
#48
eidairaman1
The Exiled Airman
2400 DDR4 is slower than my 2133 DDR3 at 2400 with my timings below.
Posted on Reply
#49
lexluthermiester
RaevenlordI will ignore the delivery of your criticism and focus on the content. Thank you for it.
My delivery was intended as constructive, helpful criticism. Don't let it bruise you're ego.
RaevenlordFor me, that was the whole point of the post. AIDA 64 is a benchmarking utility, but until it has been "fixed", as in, properly optimized for Ryzen, I think it presents itself as a great opportunity to see Ryzen's behavior on non-optimized workloads (ie, what all games currently are).
If AIDA 64 and game engines worked in similar ways, that logic would be flawless. But they don't, so that logic fails. What is needed is a utility that works the hardware it's testing properly to give accurate results and information.
Posted on Reply
Add your own comment
Apr 26th, 2024 21:20 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts