• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD's Ryzen Cache Analyzed - Improvements; Improveable; CCX Compromises

I wonder if aida64 was updated... we were told directly from FinalWire not to use it for data until they updated it... AMD didn't send them ryzen pre launch...

See: https://forums.aida64.com/topic/3768-aida64-compatibility-with-amd-ryzen-processors/

3) L1 cache bandwidth and latency scores, as well as memory bandwidth and latency scores are already accurately measured.

1800x sits right between a Celeron J1900 (2013) and a Opteron 2378 (2008).
 
Last edited:
Huh?
69.3 vs 98 is... 3 times?

PS
Are they testing "Core from the left quad accessing L3 of the right quad" scenario? (CCX in the title hints at that, but nothing in the chaotic text of OP talks about it.

You're looking at the wrong table, that's system memory latency. What OP means is L3 cache latency, 17.3 vs 46.6
 
L3 is split in half and communication between the two CCX is thru the same link that links the CCX to the memory controller, PCIe, etc, at a much lower speed.

Interesting - this should mean the 4c/8t Ryzen parts won't suffer from this penalty, so their performance should be correspondingly better.
 
Citation please.
Download AIDA64's most recent beta and run the latency benchmark for yourself. If you do not have a 1800X or an AIDA64 license you can take the 98ns figure cited in this article, download the trial of AIDA64 (beta), and view the data yourself. ;)
 
Download AIDA64's most recent beta and run the latency benchmark for yourself. If you do not have a 1800X or an AIDA64 license you can take the 98ns figure cited in this article, download the trial of AIDA64 (beta), and view the data yourself. ;)
Nope, I'm sorry but this summary of our article is missing so many facts/have so many innacuracies that it's very misleading at this point. I hope it will be fixed soon.

Again :
- You can't compare L3 values (especially L3 latency), they are wrong (in orange, for a reason)
- FYI, the table that they took from our article of RAM latency is done at 3 GHz with SMT and HT off. Real RAM latency @ stock is around 89.6 with DDR4-2400. That's still much higher than other CPUs with same RAM, but you can't compare a 3 GHz value to other CPUs @ stock.

Hopefully this news will get fixed, please check the original article with Google Translate if you want more details.
 
Nope, I'm sorry but this summary of our article is missing so many facts/have so many innacuracies that it's very misleading at this point. I hope it will be fixed soon.

Again :
- You can't compare L3 values (especially L3 latency), they are wrong (in orange, for a reason)
- FYI, the table that they took from our article of RAM latency is done at 3 GHz with SMT and HT off. Real RAM latency @ stock is around 89.6 with DDR4-2400. That's still much higher than other CPUs with same RAM, but you can't compare a 3 GHz value to other CPUs @ stock.

Hopefully this news will get fixed, please check the original article with Google Translate if you want more details.
I found the summary to be consistent with actual tests of the CPU with ram at 2666. If you think Techpowerup's summary of your article made some manipulation of the data, I guess that is between you and them. You can simply run AIDA64 tests and find similar results. I actually found 92ns for memory latency.

Whether we are splitting hairs at the 98ns in this article, the 92ns, this recent 89.6ns you reference, what we have is some pretty bad latency comparing to AMD's other offerings or Intel products. As a result of these findings, coupled with gaming performance, we have a stock that continues it's slide.
 
One does wonder if the 4 core parts will suffer the same fate since it will be one straight core complex.

More or less what I wanted to know yesterday.

Can somebody disable 4 cores and SMT and do some game benchmarks. Just to get a glimpse from what to expect from the Ryzen 3 cpus.
Also that would take out Windows scheduler optimization from the equation.
The issue with scheduler not distinguishing between actual and SMT cores, assigning threads to SMT that are four time slower than actual cores.
Moving threads between CCX and causing bottlenecking from split L3 cache and slow inter cache link.
Explained here:
https://www.reddit.com/r/Amd/comments/5x7oaq/ryzens_memory_latency_problem_a_discussion_of/.
 
This greatly explains the gaming performance. In other words Zen shouldn't perform worse (In IPC) than intel if either 1) the game only uses 4 threads, or 2) the game uses 8 or more threads.

Most modern games only really use 6 threads (While jumping to 8 when necessary) depending on the workload, and thus AMD loses in most games.


Makes me once again say that AMD should try to make a 4.5 - 5.0 GHz 4c/8t Ryzen 7 chip for $275. They need a version made specifically for high-FPS gamers.
 
I found the summary to be consistent with actual tests of the CPU with ram at 2666. If you think Techpowerup's summary of your article made some manipulation of the data, I guess that is between you and them. You can simply run AIDA64 tests and find similar results. I actually found 92ns for memory latency.
I'm saying there are many errors in the summary, such as quoting latency in milliseconds instead of nanoseconds, and a lot of context missing by quoting our tables for example without giving the actual configuration of said test. A lot can be put to barrier language and mistranslation by Google Translate. I'm simply trying to give readers here some more accurate information.

We alerted tpu this morning of the discrepancies, I have 0 doubt they will fix the summary ;)
 
All this makes it even more impressive the current Ryzen performance. I mean, it's a chip with basically a handicapped cache/memory implementation but it still trades blows with Intel chips clock-to-clock. This actually makes me think that the real Ryzen IPC (how it handles the instructions) is significantly better than Intel's.

At the end, this is good news for AMD: they have a clear improvement path --> Lower those L3 and system memory latency figures!

It's clear that the CCX design relies on the interconnect bandwidth, so AMD has two paths going forward: 1) either find a way to increase that bandwidth for a truly scalable architecture, or 2) go Intel's route and design a chip that uses a larger CCX (with 16 cores), or 3) Do both.

It seems to me AMD should really do both if they want to also become a player in the server market again. 32-core (2 x CCX), 4-chip configurations with up to 128 cores/system is not too much to ask in the server business...

Or (totally fantasizing now, or am I?), they could truly innovate and ditch the multi-chip system designs but rather build up on the scalability idea to come up with 16-core CCX's that can do up to 8-way (on-chip) interconnects, yielding a full chip with 128 cores. Think about the implications for business clients: a single 128-core chip on a small board, meaning much-easier-to-deal-with systems with much lower power utilization (4 chips on a huge board means huge power overhead). Then, similar to what they do in GPUs, they can trim it down to create a product line-up. I have a feeling this is AMD's way (vision), but it's a goal that's a long way off at the moment...


If I had to guess AMD will go the improved interconnect route. It is just cheaper (And infinitely scale-able) to make a system of essentially taping multiple clusters together.

In fact I am pretty sure they plan to build up their Navi GPU's in the same way (Interconnected clusters) so that they can make some monster 400w single-gpu chips.
 
I'm saying there are many errors in the summary, such as quoting latency in milliseconds instead of nanoseconds
Now that I did see. I don't think TPU was doing that with malicious intent... I think that is more in the "brain fart" category on their part.

I have visited your site and understand it would be more appropriate for TPU to outline the precise configuration to better represent the data. I believe the conclusion remains the same - latency is higher than we would like.

Just to make sure no one confuses anything (check my previous posts if necessary), I think this product is impressive and a remarkable value. It fell a little below AMD's hype and our expectations but is a remarkable achievement for a company previously on the verge. Even as is, it has provided some competition for Intel and with some tuning may do some decent disruption.
 
L3 performance has been AMD's achilles heel for quite some time, kind of surprised that they haven't corrected this yet. I suppose that a Windows patch to make it "Ryzen aware" will have to be developed (just as it was the case with P4's HT, Athlon 64, Core Duo, Bulldozer, etc., etc) in order to minimize the impact on real world performance.

Considering all the contains that AMD has decked against them (budget, marketshare, less workforce, etc., etc.) it's amazing what they managed to do. I for sure will replace all my crunchers with 1700s, that's a given. :D

I'll keep my 4590 and 3770K for gaming tough. Maybe I'll replace them with 4 core R5s down the line but they still do their work just fine.
 
L3 performance has been AMD's achilles heel for quite some time, kind of surprised that they haven't corrected this yet. I suppose that a Windows patch to make it "Ryzen aware" will have to be developed (just as it was the case with P4's HT, Athlon 64, Core Duo, Bulldozer, etc., etc) in order to minimize the impact on real world performance.

Considering all the contains that AMD has decked against them (budget, marketshare, less workforce, etc., etc.) it's amazing what they managed to do. I for sure will replace all my crunchers with 1700s, that's a given. :D


This, I called the memory issues when we kept seeing AMD test systems with 8 or 16GB of slower RAM only. The cache issues are a continuation of their plague that effected prior designs and held them back, but they seem to have overcome or at least masked the issues with over engineering in other parts of the chip, but the gaming results, and other very out of order operations will continue to show the cache weakness.


The only thing I am unsure about reading other reports is how well thread handling will improve the efficiency of the chip, it appears that the windows task scheduler is doing a poor job as its unaware of the nuances of the hardware, and may send threads to other CCX's and the huge increase in cache latency is what hurts the most, so keeping threads in the same CCX and or treating some threads as affinity bound should help the performance, the implied AI in this situation ( I haven't seen any definitive tests to show that program performance increases over runs) may be able to work as intended, or perhaps we are already seeing its effects in the already good but not great performance.
 
I am just going to wait for skylake-x and if its not affordable enough i'll go for a 6850K and OC it once they go back to under 500$. I'm thinking of using a few nvme drives so Ryzen with its 24pcie lanes does not offer what i'm looking for right now.

I've seen some benchmarks of the 1800x performing WORSE than a 7700K while streaming a game while doing other tasks.

They tried, i had hopes but i'm gonna give this one a pass.
 
This greatly explains the gaming performance. In other words Zen shouldn't perform worse (In IPC) than intel if either 1) the game only uses 4 threads, or 2) the game uses 8 or more threads.

Most modern games only really use 6 threads (While jumping to 8 when necessary) depending on the workload, and thus AMD loses in most games.


Makes me once again say that AMD should try to make a 4.5 - 5.0 GHz 4c/8t Ryzen 7 chip for $275. They need a version made specifically for high-FPS gamers.

Missing the point there. It can be 2 threads and still bottleneck if the software tries to move the thread from CCX0 to CCX1.
Which is something that Games and OS do quite often to balance load among cores.
By doing that will have to move the data from CCX0 L3 Cache to CCX1 L3 Cache which will cause the bottleneck because of the ultra slow L3 interconnect.
The solution should be in sight, they just to make the Windows scheduler aware of the design and move thread only in the CCX that thread originates.
That way it eliminates moving data between L3 caches for both modules.

This hopefully can be confirmed benching a game that doesn't use more than 4 threads and disable SMT and one of the CCX on the Ryzen 7.
That eliminates all the above scenarios.
 
The solution should be in sight, they just to make the Windows scheduler aware of the design and move thread only in the CCX that thread originates.
That way it eliminates moving data between L3 caches for both modules
yeap, that makes sense.
It would make the solution software only, exciting.
 
Hi, the memory latency is in "ns" (nano) =1/1000000000 second not "ms" 1/1000 second.

And it's not a typo, it appears 5 times in the text, while "ns" never appears.

It isn't a typo; I filed that under the recently created "laughable brain farts" category of my own posting analysis. Thank you for calling my attention to that =)

Author of the article here, I know the language barrier doesn't make things easy but there are a few innacuracies here in this summary. Some quick points on what we found :

- Memory latency (not L3) is higher (and ns, not ms ;))
- L3 is split in half and communication between the two CCX is thru the same link that links the CCX to the memory controller, PCIe, etc, at a much lower speed.

Plus many other things regarding CCX etc. I don't know how good a job Google Translate does of our article but I'd suggest people interested give it a shot (page 22/23 maybe 24 [we found another issue with game performance that's linked to Windows 10] is what you're looking for).

To answer another question, yes, L3 readings are innacurate in Aida (that's why we show them in orange in the table). We do use another test (a beta benchmark from Aida, too) to check latency at different block sizes, that one is the basis of our analysis.

G.

Hello =) Thank you for taking the time to comment and try and improve understanding on some of these issues. The language barrier is certainly part of the problem. And congrats on such an in-depth look at what makes RYzen tick!

I'll take the time to read and pour through your comments and some of the questions pose to see if I can shed some light on some other things.

- You can't compare L3 values (especially L3 latency), they are wrong (in orange, for a reason)

I can compare them between your own results, which where all done with the same configuration between the 6900K and the 1800X, right? That's what I compare in the article.


I'm saying there are many errors in the summary, such as quoting latency in milliseconds instead of nanoseconds, and a lot of context missing by quoting our tables for example without giving the actual configuration of said test. A lot can be put to barrier language and mistranslation by Google Translate. I'm simply trying to give readers here some more accurate information.

We alerted tpu this morning of the discrepancies, I have 0 doubt they will fix the summary ;)

Latency in milliseconds or microseconds doesn't really change anything: the discrepancy remains the same, and the units of measurement remained constant. It's a "brain-farted" technicality, which doesn't affect the overall picture. Unfortunate, yes, but doesn't change anything in the grand scheme of things.

Regarding the absent configuration, a stark neglect on my part, which I will update accordingly, so thanks for bringing that to my attention =) Time isn't as we would like, hence why only now I'm here and improving the article.

Now that I did see. I don't think TPU was doing that with malicious intent... I think that is more in the "brain fart" category on their part.

I have visited your site and understand it would be more appropriate for TPU to outline the precise configuration to better represent the data. I believe the conclusion remains the same - latency is higher than we would like.

^

This. I will, however, edit the piece including the noted configuration.

There were a few problems with this article. The use of "ms"(milliseconds) instead of "ns"(nanoseconds) was fairly glaring. CPU operating reaction speeds have not been measured in "ms" since the early 80's. There were also a few grammatical errors which have been fixed. You're welcome.

I will ignore the delivery of your criticism and focus on the content. Thank you for it.


AIDA64 tweeted


Kind of hard to have a working AIDA64 for Ryzen when the company Tweets it cant fix it until they get a Ryzen chip the same day that article is published.

For me, that was the whole point of the post. AIDA 64 is a benchmarking utility, but until it has been "fixed", as in, properly optimized for Ryzen, I think it presents itself as a great opportunity to see Ryzen's behavior on non-optimized workloads (ie, what all games currently are).
 
Last edited:
More or less what I wanted to know yesterday.


I would also want to know if the 4 core 8 thread part will be affected.
Anyway that is the most interesting part from this launch, the 16 core, while it is nice and powerful is too much for current software.
 
Can these issues be fixed in software or is a design flaw that simply can't be fixed until the next version of Ryzen? As a person who hoped and prayed that AMD would be able to give Intel a much deserved kick to their balls, all of this news about Ryzen's performance (or lack thereof) is a major let down to me.
 
Why not compare any Ryzen againts i7 7700k at same clock speed, mem timings, core/thread count?

For eg, because Ryzen won't oc much. Clock them both @ 3.9ghz, 4c/8t. I know we are gimping the i7 7700k but i'm just curious to know the result of "almost the same" setup would be. Gaming & productivity benches needed.
 

Attachments

  • bf1.png
    bf1.png
    11.3 KB · Views: 303
  • farcry-primal.png
    farcry-primal.png
    11.3 KB · Views: 612
Why not compare any Ryzen againts i7 7700k at same clock speed, mem timings, core/thread count?

For eg, because Ryzen won't oc much. Clock them both @ 3.9ghz, 4c/8t. I know we are gimping the i7 7700k but i'm just curious to know the result of "almost the same" setup would be. Gaming & productivity benches needed.

Where did you get this graphs from ?
 
2400 DDR4 is slower than my 2133 DDR3 at 2400 with my timings below.
 
I will ignore the delivery of your criticism and focus on the content. Thank you for it.
My delivery was intended as constructive, helpful criticism. Don't let it bruise you're ego.

For me, that was the whole point of the post. AIDA 64 is a benchmarking utility, but until it has been "fixed", as in, properly optimized for Ryzen, I think it presents itself as a great opportunity to see Ryzen's behavior on non-optimized workloads (ie, what all games currently are).
If AIDA 64 and game engines worked in similar ways, that logic would be flawless. But they don't, so that logic fails. What is needed is a utility that works the hardware it's testing properly to give accurate results and information.
 
Back
Top