Tuesday, August 8th 2017

AMD Confirms Ryzen Marginality Performance Issue Under Linux, TR and EPYC Clear

An issue on AMD's Ryzen performance under certain Linux workloads, which caused segmentation faults in very heavy, continuous workloads on the Ryzen silicon (parallel compilation workloads in particular) has been confirmed by AMD. Tests like Phoronix's Test Suite's stress run quickly bring the Ryzen processors to their knees with multiple segmentation faults. While this problem is easy to cause under very heavy workloads, the issue is virtually absent under normal Linux desktop workloads and benchmarking,

AMD also confirmed this issue is not present in EPYC or Threadripper processors, but are isolated to early Ryzen samples under Linux (AMD's testing under Windows has found no such behavior.) AMD's analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor, but are problems with the processors themselves. AMD encourages Ryzen customers who believe to be affected by the problem to contact AMD Customer Care. Some of those who have contacted customer care about the segmentation faults have in turn been affected by thermal, power, or other problems, but AMD says they are committed to working with those encountering this performance marginality issue under Linux. AMD will also be stepping up their Linux testing/QA for future consumer products.

Sources: Phoronix, AMD Confirms Ryzen Issue - Phoronix
Add your own comment

44 Comments on AMD Confirms Ryzen Marginality Performance Issue Under Linux, TR and EPYC Clear

#1
justimber
waiting for bashers in 3...2....1

hehe
just for this specific test or if Ryzen also gets loaded with the same amount of workload?
Posted on Reply
#2
notb
Segmentation fault is not a "marginality performance issue"...
It's a huge problem - just like with the FMA before.
And as people are pointing out on CPU-specific forums - it most likely can be fixed, but each fix of this sort takes a bit of performance. So will AMD address this? Especially after they called it "marginal"?

BTW: this is not specific to Linux. The same problem should happen under Windows at this kind of load. It's just that people normally don't use Windows for such tasks.

Good news: no one has succeeded in replicating this issue on an EPYC system, yet.


justimber said:

just for this specific test or if Ryzen also gets loaded with the same amount of workload?
It could happen in a similar workload as well, not "same amount".
Posted on Reply
#3
bencrutz
notb said:
BTW: this is not specific to Linux. The same problem should happen under Windows at this kind of load. It's just that people normally don't use Windows for such tasks.
"AMD's testing of this issue under Windows hasn't uncovered problematic behavior."

source
Posted on Reply
#4
notb
bencrutz said:
"AMD's testing of this issue under Windows hasn't uncovered problematic behavior."

source
I assume their testing also hasn't found the issue under Linux. Or has it and they've knowingly released a faulty CPU? :)

I'm waiting for 3rd party tests. It would be great if Phoronix did it (one of the last proper CPU-testing websites), but I doubt they would do a Windows-based review. :-(

I have to say... I might have got a Ryzen if AMD had a bug bounty programme. This could be more profitable than mining crypto. :-P
BTW: Intel and Apple started bug bounty programs... maybe they feel more secure with their products...
Posted on Reply
#5
R-T-B
notb said:
I assume their testing also hasn't found the issue under Linux. Or has it and they've knowingly released a faulty CPU? :)
I'm pretty sure they could detect it quickly with the multiplatform phronix test suite.

This is more likely due to the fact linux tends to be more "low level" in hardware init than windows. Less is left to the bios. That said, it does explain why my gentoo install constantly segfaults on compile. I thought a higher SOC voltage alleviated this, but we shall see.
Posted on Reply
#6
bencrutz
notb said:
I assume their testing also hasn't found the issue under Linux. Or has it and they've knowingly released a faulty CPU? :)

I'm waiting for 3rd party tests. It would be great if Phoronix did it (one of the last proper CPU-testing websites), but I doubt they would do a Windows-based review. :-(

I have to say... I might have got a Ryzen if AMD had a bug bounty programme. This could be more profitable than mining crypto. :p
BTW: Intel and Apple started bug bounty programs... maybe they feel more secure with their products...
well, i'm just stating the fact.

you are free to assume anything.
Posted on Reply
#7
TheGuruStud
You can make this happen on intel, too...
Posted on Reply
#8
bug
Please note this statement comes from the PR department.

The only known way to reliably reproduce this is on linux (https://github.com/suaefar/ryzen-test). If AMD did not acknowledge they identified the issue, I'm having a hard time they know for certain Windows, Epyc or Threadripper are not affected. Their internal testing came up empty so far, but like pointed above, their internal testing failed to spot the problem on Linux as well until someone from outside stepped in and pinpointed it for them.

Keep in mind there's still the possibility this is not wide spread and can be fixed with a firmware update. But until AMD identifies the issue, we just don't know.
Posted on Reply
#9
birdie
bencrutz said:
"AMD's testing of this issue under Windows hasn't uncovered problematic behavior."

source
Windows is affected. This bug has already been reproduced under WSL.
Posted on Reply
#10
TheinsanegamerN
notb said:
Segmentation fault is not a "marginality performance issue"...
It's a huge problem - just like with the FMA before.
And as people are pointing out on CPU-specific forums - it most likely can be fixed, but each fix of this sort takes a bit of performance. So will AMD address this? Especially after they called it "marginal"?

BTW: this is not specific to Linux. The same problem should happen under Windows at this kind of load. It's just that people normally don't use Windows for such tasks.

Good news: no one has succeeded in replicating this issue on an EPYC system, yet.



It could happen in a similar workload as well, not "same amount".
Well, based on the article you didnt read, AMD is already in the process of taking care of it. "AMD encourages Ryzen customers who believe to be affected by the problem to contact AMD Customer Care. Some of those who have contacted customer care about the segmentation faults have in turn been affected by thermal, power, or other problems, but AMD says they are committed to working with those encountering this performance marginality issue under Linux."

Also, "AMD also confirmed this issue is not present in EPYC or ThreadRipper processors, but are isolated to early Ryzen processors under Linux" seems to signify it is an issue with early batches of silicon. Worst case, AMD could simply replace those chips, as it sounds like the issue was already fixed on newer ryzens.
Posted on Reply
#11
bug
TheinsanegamerN said:
Well, based on the article you didnt read, AMD is already in the process of taking care of it. "AMD encourages Ryzen customers who believe to be affected by the problem to contact AMD Customer Care. Some of those who have contacted customer care about the segmentation faults have in turn been affected by thermal, power, or other problems, but AMD says they are committed to working with those encountering this performance marginality issue under Linux."

Also, "AMD also confirmed this issue is not present in EPYC or ThreadRipper processors, but are isolated to early Ryzen processors under Linux" seems to signify it is an issue with early batches of silicon. Worst case, AMD could simply replace those chips, as it sounds like the issue was already fixed on newer ryzens.
Again, how can you tell Epyc or Threadripper (or Windows) are not affected if you don't know what is the actual problem? To me, that's a logical fracture.

Also, when they tell me about "performance marginality", it seems more like a big FU to users than AMD "already in the process of taking care of it". Wth is "performance marginality"?
Posted on Reply
#12
notb
TheinsanegamerN said:
Well, based on the article you didnt read, AMD is already in the process of taking care of it. "AMD encourages Ryzen customers who believe to be affected by the problem to contact AMD Customer Care. Some of those who have contacted customer care about the segmentation faults have in turn been affected by thermal, power, or other problems, but AMD says they are committed to working with those encountering this performance marginality issue under Linux."
Actually, I did. And I'm sure you did as well. You're just showing a comprehansion marginality issue. This quote does not say that AMD is taking care of the problem. What it says is: AMD is commited to work with those that have this problem.
It doesn't even say "help" like in "update firmware" or "replace CPU".
It says "work with".
Also, "AMD also confirmed this issue is not present in EPYC or ThreadRipper processors, but are isolated to early Ryzen processors under Linux" seems to signify it is an issue with early batches of silicon. Worst case, AMD could simply replace those chips, as it sounds like the issue was already fixed on newer ryzens.
That's so cute of them!
What I'd expect is a proof that it doesn't affect EPYC, not confirmation.

bug said:

Wth is "performance marginality"?
Most likely "fixing this will eat 5% of Ryzen performance, so no way".
Posted on Reply
#13
bencrutz
birdie said:
Windows is affected. This bug has already been reproduced under WSL.
source?


bug said:
Again, how can you tell Epyc or Threadripper (or Windows) are not affected if you don't know what is the actual problem? To me, that's a logical fracture.

Also, when they tell me about "performance marginality", it seems more like a big FU to users than AMD "already in the process of taking care of it". Wth is "performance marginality"?
quote: "AMD was also able to confirm this issue is not present with AMD Epyc or AMD ThreadRipper processors, but isolated to these early Ryzen processors under Linux. We will also now be receiving Threadripper and Epyc hardware for testing to confirm their Linux state."

from the same source

if TR & Epyc also affected, why would AMD send hardware for them to test, eh?

AMD stated TR / Epyc are not stated. Do you have any source that claim otherwise? Or should we all fall the same under your assumption? :banghead:
Posted on Reply
#14
bug
bencrutz said:
source?




quote: "AMD was also able to confirm this issue is not present with AMD Epyc or AMD ThreadRipper processors, but isolated to these early Ryzen processors under Linux. We will also now be receiving Threadripper and Epyc hardware for testing to confirm their Linux state."

from the same source

if TR & Epyc also affected, why would AMD send hardware for them to test, eh?

AMD stated TR / Epyc are not stated. Do you have any source that claim otherwise? Or should we all fall the same under your assumption? :banghead:
Logic says you can't announce you fixed something before you announce that you have identified, but since that seems to fly right over your head, I'm not sure there's another way to explain it to you.
Posted on Reply
#15
R-T-B
TheGuruStud said:
You can make this happen on intel, too...
Not really. Please clarify what you mean.
Posted on Reply
#16
Solaris17
Creator Solaris Utility DVD
TheGuruStud said:
You can make this happen on intel, too...
Whats wrong is it not shady AF when AMD does it though?

TheGuruStud said:
m glad no one is surprised. Intel has ALWAYS been shady as F. Aside from their marketing/brainwashing, they've generally directed their evilness towards AMD. Now, consumers are fair game, it seems.
Posted on Reply
#17
efikkan
Raevenlord said:
An issue on AMD's Ryzen performance under certain Linux workloads…

AMD also confirmed this issue is not present in EPYC or Threadripper processors, but are isolated to early Ryzen samples under Linux (AMD's testing under Windows has found no such behavior.)
The issue actually have nothing to do with performance at all, nor to do with Linux.

The issue was reported over at AMD's forums 3 months ago, the BSD and Gentoo communities, and of course lately Phoronix. Our friends over there have done some extensive debugging.

There are at least two distinct symptoms:
1 "Segfaults" - Under load pointers may get corrupted, which results in undefined behavior. This is why compilation fails "randomly".
2 uOP cache errors

Some examples I've grabbed from the thread over at AMD:
Example Linux:
"mce: [Hardware Error]: Machine check events logged"
"mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108"
"mce: [Hardware Error]: TSC 0 ADDR 1ffffa94be452 MISC d012000101000000 SYND 4d000000 IPID 500b000000000"
"mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1500732880 SOCKET 0 APIC 2 microcode 8001126"

Example BSD:
MCA: Bank 1, Status 0x90200000000b0151
MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
MCA: CPU 14 COR ICACHE L1 IRD error


The problems have nothing to do with Linux. Linux is a kernel, but the problems are reproduced on Linux, BSD and Windows Subsystem for Linux (WSL) (which runs on the Windows kernel). Both gcc and llvm are tested, the problems have been reproduced during compilation of gcc, mesa, chromium, thunderbird, libreoffice, ffmpeg, linux kernel, bsd kernel and more. Memory configurations and timings have been eliminated as a cause.

Corruption of pointers and instructions is a hardware defect. It remains unknown if the two distinct symptoms are caused by a single bug, or by two unrelated bugs. The bug is present for all Ryzen chips of the B1 stepping (even brand new ones). The bug(s) are not specific to compiler workloads, the symptoms occur under stress, which is most easily reproduced with heavy compilation tasks. The bug(s) will cause "random" application and system instability. Other tasks, such as Prime95 or Cinebench does not run into these problems, since these stresses different parts of the CPU. It seems like an internal synchronization issue in the prefetcher, resulting in undefined behavior when certain conditions apply.

A proper solution would require a new stepping. Hopefully the existing Ryzen parts can eliminate the problems through a firmware update, which of course might cause a performance penalty. Users have tried to disable the uOP cache and/or SMT, etc. , which seems to reduce the symptoms but not eliminate them. As long as this remains unsolved, people should postpone buying these chips for workstation/productive workloads. Note, there is so far no clear indication that this bug poses problems for games, so it's quite possible that the chips are "stable enough" for certain workloads.

While AMD have had reports of this since early May (and some indications in April), it's possible that they've thought of this as an obscure Linux bug and not prioritized this during the summer vacation. But the evidence is now mounting, some users have gotten several new chips through RMA and the problem is still there.
Posted on Reply
#18
xorbe
efikkan said:
... bug is present for all Ryzen chips of the B1 stepping (even brand new ones)...
Source? The thread you linked seems to contradict, post 638 (successful rma) and 717 (not all ryzen). My chip seems immune to both ryzen_segv and the parallel compile loop for 24 hours, no freezes or cache errors. Is yours failing the compile stress test?
Posted on Reply
#19
TheGuruStud
Solaris17 said:
Whats wrong is it not shady AF when AMD does it though?
What are you reaching for, here? AMD took time to investigate it and have released a statement on the matter.

Maybe a V bump can fix it like the last bug fixed with microcode.
Posted on Reply
#20
bug
TheGuruStud said:
What are you reaching for, here? AMD took time to investigate it and have released a statement on the matter.

Maybe a V bump can fix it like the last bug fixed with microcode.
I think the difference is some won't take a statement from AMD's PR at face value.
Posted on Reply
#21
Solaris17
Creator Solaris Utility DVD
TheGuruStud said:
What are you reaching for, here? AMD took time to investigate it and have released a statement on the matter.

Maybe a V bump can fix it like the last bug fixed with microcode.
Just found it odd you dont seem as upset about this as with Intel. Intel released microcode fixes for this and they were put on the rack by users that experienced problems and never told them. I'm curious how thats intels fault. They fixed there bug in April under there own accord and 1 month later ocaml asked the fix be patched to upstream debian.

It just seems to me the community likes to watch fires and only call the fireman for some of them. Blaming there lack of response on the rest.

AMD is no saint. Lots of chip manufacturers fix errata silently. Thats a big part of what BIOS updates are. but BIOS updates that contain fixes arent headlined on CNN or tech forums around the globe.

Just wondering why its so easy for you to hate the way a company fixed an issue they never got an official report on, and later fixed; only to be chastised for not paying "attention". But AMD on the other hand gets an official complaint looks into it and reports on it and now they are saints. I'm confused AMD has fixed errata on there own accord without publicity. Shouldnt they also be target for your wrath?
Posted on Reply
#22
TheGuruStud
Solaris17 said:
Just found it odd you dont seem as upset about this as with Intel. Intel released microcode fixes for this and they were put on the rack by users that experienced problems and never told them. I'm curious how thats intels fault. They fixed there bug in April under there own accord and 1 month later ocaml asked the fix be patched to upstream debian.

It just seems to me the community likes to watch fires and only call the fireman for some of them. Blaming there lack of response on the rest.

AMD is no saint. Lots of chip manufacturers fix errata silently. Thats a big part of what BIOS updates are. but BIOS updates that contain fixes arent headlined on CNN or tech forums around the globe.

Just wondering why its so easy for you to hate the way a company fixed an issue they never got an official report on, and later fixed; only to be chastised for not paying "attention". But AMD on the other hand gets an official complaint looks into it and reports on it and now they are saints. I'm confused AMD has fixed errata on there own accord without publicity. Shouldnt they also be target for your wrath?
You're completely ignoring Intel's history of being complete c****. They're the bug kings and pretend it doesn't exist. Publications begging for their money will go ham on AMD for lesser bugs or ones that do not matter. Remember the infamous phenom TLB bug? It literally affected zero people. It was remotely possible in a server environment, so AMD fixed it with an update and RMA'd the CPUs. It was so blown out of proportion, b/c it shouldn't even have been a thing in consumer land. Meanwhile, Intel has sata ports fail on every one of those chipsets made, atom CPUs degrading and they get a couple days of mild stories about it and no one cares.

There's a double standard here alright and it's not from me or to AMD's benefit.
Posted on Reply
#23
Solaris17
Creator Solaris Utility DVD
TheGuruStud said:
You're completely ignoring Intel's history of being complete c****. They're the bug kings and pretend it doesn't exist.
huh? Intel is openely transparent.

https://www.google.com/search?q=Intel+errata&oq=Intel+errata&aqs=chrome..69i57j0l5.1823j0j7&sourceid=chrome&ie=UTF-8

They post there errata PDFs publicly.

https://www.intel.com/content/www/us/en/search.html?toplevelcategory=none&query=errata&keyword=errata&:cq_csrf_token=undefined
Posted on Reply
#24
TheGuruStud
Solaris17 said:
huh? Intel is openely transparent.

https://www.google.com/search?q=Intel+errata&oq=Intel+errata&aqs=chrome..69i57j0l5.1823j0j7&sourceid=chrome&ie=UTF-8

They post there errata PDFs publicly.

https://www.intel.com/content/www/us/en/search.html?toplevelcategory=none&query=errata&keyword=errata&:cq_csrf_token=undefined
Let me reword that...their fanboys and paid shills/beneficiaries pretend it doesn't exist.

Edited b/c I can't English properly, murrica.
Posted on Reply
#25
Solaris17
Creator Solaris Utility DVD
TheGuruStud said:
Let me reword that...their fanboys and paid shills/sponsors pretend it doesn't exist.
ah sorry maybe we got off on the wrong foot. Now I understand :toast:

I thought you were slipping stud reg date of 07 I just couldnt grasp that you may have lost your touch for facts like the new users have.
Posted on Reply
Add your own comment