Tuesday, August 8th 2017

AMD Confirms Ryzen Marginality Performance Issue Under Linux, TR and EPYC Clear

An issue on AMD's Ryzen performance under certain Linux workloads, which caused segmentation faults in very heavy, continuous workloads on the Ryzen silicon (parallel compilation workloads in particular) has been confirmed by AMD. Tests like Phoronix's Test Suite's stress run quickly bring the Ryzen processors to their knees with multiple segmentation faults. While this problem is easy to cause under very heavy workloads, the issue is virtually absent under normal Linux desktop workloads and benchmarking,

AMD also confirmed this issue is not present in EPYC or Threadripper processors, but are isolated to early Ryzen samples under Linux (AMD's testing under Windows has found no such behavior.) AMD's analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor, but are problems with the processors themselves. AMD encourages Ryzen customers who believe to be affected by the problem to contact AMD Customer Care. Some of those who have contacted customer care about the segmentation faults have in turn been affected by thermal, power, or other problems, but AMD says they are committed to working with those encountering this performance marginality issue under Linux. AMD will also be stepping up their Linux testing/QA for future consumer products.
Sources: Phoronix, AMD Confirms Ryzen Issue - Phoronix
Add your own comment

45 Comments on AMD Confirms Ryzen Marginality Performance Issue Under Linux, TR and EPYC Clear

#26
TheGuruStud
Solaris17ah sorry maybe we got off on the wrong foot. Now I understand :toast:

I thought you were slipping stud reg date of 07 I just couldnt grasp that you may have lost your touch for facts like the new users have.
Yeah, sorry, I get testy, b/c I'm old enough to remember all of intel's lies and deceptions. My favorite is probably when they owned that benchmark tool back in the athlon 64 days...my brain is shot. Was it PCmark? They hid it, but very poorly. They got the incredibly crooked results published in college textbooks. It took me 5 mins to figure it out. The address of the benchmark company was the same as Intel's...busted.
Posted on Reply
#27
bug
TheGuruStudYou're completely ignoring Intel's history of being complete c****. They're the bug kings and pretend it doesn't exist. Publications begging for their money will go ham on AMD for lesser bugs or ones that do not matter. Remember the infamous phenom TLB bug? It literally affected zero people. It was remotely possible in a server environment, so AMD fixed it with an update and RMA'd the CPUs. It was so blown out of proportion, b/c it shouldn't even have been a thing in consumer land. Meanwhile, Intel has sata ports fail on every one of those chipsets made, atom CPUs degrading and they get a couple days of mild stories about it and no one cares.

There's a double standard here alright and it's not from me or to AMD's benefit.
So you admit you hammer Intel more, because Intel is evil...
TheGuruStudLet me reword that...their fanboys and paid shills/beneficiaries pretend it doesn't exist.

Edited b/c I can't English properly, murrica.
... and then proceed to call others fanboys and paid shills.
Posted on Reply
#28
trparky
I would say that Intel deserves the hate that they receive, just like how Microsoft, Dell, HP, Symantec, and a whole host of other companies deserve the hate that they get. They are world class companies, they should be handling things far better than they do. Instead they try to pass the buck/blame, pretend the issue doesn't exist, or other forms of trying to sweep the dirt under the rug.
Posted on Reply
#29
trparky
I don't care who the company is, you bite the damn bullet and admit you did wrong. I am far more likely to trust a company that admits that they screwed up than a company that was found out years later.

There's a reason why I don't shop at Home Depot, they tried to cover up the credit card hack. Only after it was exposed that they came out and said "We're sorry". Well "sorry" ain't good enough!
Posted on Reply
#30
Scrizz
trparkyI don't care who the company is, you bite the damn bullet and admit you did wrong. I am far more likely to trust a company that admits that they screwed up than a company that was found out years later.

There's a reason why I don't shop at Home Depot, they tried to cover up the credit card hack. Only after it was exposed that they came out and said "We're sorry". Well "sorry" ain't good enough!
This. I don't know why people think certain companies "care" about their customers.
It's a business. They are in it for the money regardless of who it is or what they say.
Posted on Reply
#31
trparky
I understand, in the end it's all about business. What these companies need to understand is that when bad things happen the way they handle it can and will make people think differently. Fool me one, shame on me. Fool me twice... well, go f*** yourself.
Posted on Reply
#32
justimber
birdieWindows is affected. This bug has already been reproduced under WSL.
wait....it's still different...my basic knowledge of Linux tells me that WSL is just Linux running under Windows (10 to be specific. please correct me if I'm wrong)..so it's still inconclusive to say that Windows is affected.
Posted on Reply
#33
bug
trparkyI understand, in the end it's all about business. What these companies need to understand is that when bad things happen the way they handle it can and will make people think differently. Fool me one, shame on me. Fool me twice... well, go f*** yourself.
And when AMD sends their PR to tell the problem is no more, but don't provide a technical explanation of the issue (or even imply that they know what it is), what do you say about that? it that the right way to handle this?
Posted on Reply
#34
trparky
bugAnd when AMD sends their PR to tell the problem is no more, but don't provide a technical explanation of the issue (or even imply that they know what it is), what do you say about that? it that the right way to handle this?
I tend to not trust what the public relations departments of companies say, after all, PR departments are known for spinning things in their favor. In the case of AMD, much like Intel, I would like to see a somewhat technical write-up on what the issue was and how it was corrected.

As for why this issue occurred on Linux and not on Windows, it could be that Linux (being that Linux tends to be more on the enthusiast front) was using some kind of instruction set in a weird way whereas Windows tends to be more conservative in terms of using newer processor instruction sets since Microsoft wants to make sure that Windows runs on just about anything including some old-ass Pentium 4 machine.
Posted on Reply
#35
R-T-B
It appears more to be an ASLR bug from some reading. Linux uses this, Windows not so much.
Posted on Reply
#36
bug
trparkyI tend to not trust what the public relations departments of companies say, after all, PR departments are known for spinning things in their favor. In the case of AMD, much like Intel, I would like to see a somewhat technical write-up on what the issue was and how it was corrected.
Agreed, that's what I'm waiting for as well. With an explanation about what "performance marginality" is to go with it :D
Posted on Reply
#37
efikkan
justimberwait....it's still different...my basic knowledge of Linux tells me that WSL is just Linux running under Windows (10 to be specific. please correct me if I'm wrong)..so it's still inconclusive to say that Windows is affected.
Linux is a kernel. WSL implements a Linux-compatible userland which makes Linux applications able to run on the Windows kernel. There is no Linux code in Windows.
R-T-BIt appears more to be an ASLR bug from some reading. Linux uses this, Windows not so much.
ASLR is a software implementation in the kernel. The problems have been reproduced with ASLR disabled, but the amount of occurrences might be slightly reduced, since ASLR increases the stress on the prefetcher.

The errors in the uOP cache is clearly a corruption happening inside the CPU core, micro operations are generated in the front-end/prefetcher, and since the hardware detects these there it's clearly a hardware bug.
Posted on Reply
#38
R-T-B
efikkanLinux is a kernel. WSL implements a Linux-compatible userland which makes Linux applications able to run on the Windows kernel. There is no Linux code in Windows.


ASLR is a software implementation in the kernel. The problems have been reproduced with ASLR disabled, but the amount of occurrences might be slightly reduced, since ASLR increases the stress on the prefetcher.

The errors in the uOP cache is clearly a corruption happening inside the CPU core, micro operations are generated in the front-end/prefetcher, and since the hardware detects these there it's clearly a hardware bug.
This bug seems to be a silicon quality issue honestly.

I was terribly plagued by this and found that a final 1.2v SOC completely eliminated the bug in all forms and test suites for me.

Weird.
Posted on Reply
#39
trparky
If it's a silicon quality issue then it kind of does make sense that both Threadripper and Epyc wouldn't have this issue, they both are made out of the higher quality silicon while us mere peasants buying the Ryzen CPUs would be stuck with the... less than desirable stuff.
Posted on Reply
#40
notb
R-T-BThis bug seems to be a silicon quality issue honestly.

I was terribly plagued by this and found that a final 1.2v SOC completely eliminated the bug in all forms and test suites for me.

Weird.
You're manipulating frequency or just the voltage?

And define "completely eliminated". This is not how electronics work. This problem has already manifested, so it is there - that's the only sure thing.
So you can't say that a problem has been eliminated just with stability tests. Lowering voltage might only make this less probable.

Now we need an explanation and a proof that it won't happen...
Posted on Reply
#41
R-T-B
notbAnd define "completely eliminated". This is not how electronics work. This problem has already manifested, so it is there - that's the only sure thing.
So you can't say that a problem has been eliminated just with stability tests. Lowering voltage might only make this less probable.
Just voltage. And yes, I can. I think the SOC voltage out the gate is set too low for the quality of silicon they have (note that 1.2v is signifigantly higher than stock SOC voltage). This isn't really something new and novel. Yes the problem is still there but the problem is effectively eliminated for my practical purposes.
Posted on Reply
#42
efikkan
R-T-BThis bug seems to be a silicon quality issue honestly.

I was terribly plagued by this and found that a final 1.2v SOC completely eliminated the bug in all forms and test suites for me.

Weird.
As many have reported, the first thing AMD's support tell them is to increase the voltage. Many of them have increased it way beyond what you have, and the problem is still not completely gone. You will reach dangerous voltages before you can get high enough. Increasing the voltage also significantly impacts the lifespan of the chip, which means than any stability issues (including this one) will be more likely to occur over time.

All chips seems to have the potential, but silicon quality seems to play a factor in how likely it is to occur. As you know, bumping the voltage does lower the rise/fall time of the transistors, but it's still not enough to guarantee synchronicity, and would not eliminate all disturbances towards the end of a cycle. A proper fix would require a realignment of the circuits in this region of the CPU.
Posted on Reply
#43
R-T-B
efikkanAs many have reported, the first thing AMD's support tell them is to increase the voltage. Many of them have increased it way beyond what you have, and the problem is still not completely gone. You will reach dangerous voltages before you can get high enough. Increasing the voltage also significantly impacts the lifespan of the chip, which means than any stability issues (including this one) will be more likely to occur over time.

All chips seems to have the potential, but silicon quality seems to play a factor in how likely it is to occur. As you know, bumping the voltage does lower the rise/fall time of the transistors, but it's still not enough to guarantee synchronicity, and would not eliminate all disturbances towards the end of a cycle. A proper fix would require a realignment of the circuits in this region of the CPU.
There's a difference between the SOC (basically uncore) voltage and core voltage though. Are they telling them to increase SOC voltage at all? I'm unsure if I discovered something new or not. Everything I read tells that AMD support tells them to lower SOC voltage to stock, I'm doing the opposite.
Posted on Reply
#44
justimber
efikkanLinux is a kernel. WSL implements a Linux-compatible userland which makes Linux applications able to run on the Windows kernel. There is no Linux code in Windows.


ASLR is a software implementation in the kernel. The problems have been reproduced with ASLR disabled, but the amount of occurrences might be slightly reduced, since ASLR increases the stress on the prefetcher.

The errors in the uOP cache is clearly a corruption happening inside the CPU core, micro operations are generated in the front-end/prefetcher, and since the hardware detects these there it's clearly a hardware bug.
thanks for clarifying.
Posted on Reply
#45
bencrutz
bugLogic says you can't announce you fixed something before you announce that you have identified, but since that seems to fly right over your head, I'm not sure there's another way to explain it to you.
it is fixed

threadripper is not affected
Posted on Reply
Add your own comment
Apr 25th, 2024 05:01 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts