• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

My research into AMD's Linux "Performance Marginality" issue:

Joined
Aug 20, 2007
Messages
22,228 (3.44/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen 9 9950X
Motherboard MSI MAG X670E Tomahawk Wifi
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 128GB (4x 32GB) G.Skill Flare X5 @ DDR5-4200(Running 1:1:1 w/FCLK)
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64, other office machines run Windows 11 Enterprise
I've been doing some behind the scenes research into AMD's so called Linux "Performance Marginality." When I initially began researching this, I had big plans to write an independent research script to attempt to prove the crash can happen in Windows with a program to prove it. Unfortunately, I never quite got there, and it appears I may even have been off on my expected results. The crash is triggered by ASLR, and Windows doesn't use this, generally. Javascript might, but find me any webpage that spawns a 16 thread javascript process that isn't mining coins malware style and I'll be genuinely shocked.

What did come of this is a document where I detailed my results with the RMA. It appears if nothing else, there is heavy evidence indicating there is not a new stepping, but actually just improved binning to mitigate the issue amongst those whom complain. It's circumstantial evidence at this point, but given AMD has declined to comment repeatedly when asked how they fix this, I am very very suspicious at this point they aren't simply gluing threadripper grade dies to Ryzen CPUs on request, and standard Ryzen grade CPUs simply don't have a fully functional ASLR function under load (at least, at the binning level they chose).

I'm putting the document I typed up below, including evidence, in hopes you guys can do more research and maybe find enough to make this case a bit more than circumstantial. As it is, I'm out of time and energy to pursue this further, but it certainly seems suspect.

BEGIN PM (Originally sent to W1zzard and company, advised to share with community):

As a user of Gentoo Linux, I have been hit hard by the so-called Ryzen “Performance Marginality.” This manifests itself as an event in which several build jobs running concurrently will crash a random process on the system, usually (but not necessarily) one of the running build jobs. The problem is well documented, and AMD is offering RMAs to affected users. The thing is, that makes it sound like not everyone is affected. Truth be told, after a lot of online research, it is my opinion that anyone with a processor older than build week 25 is affected. Since anything newer than build week 20 has not made it into retail yet (at least, if user reports can be believed), this means nearly all Ryzen processors on the market at present time are affected by this issue.

This is a big deal, and not just on Linux. Why?

The issue vanishes in Linux with nearly all users when they turn off Kernel ASLR (Address Space Layout Randomization). This is a critical security feature that is not presently used much in Windows (and frankly, may never be) but is already being used inside web browsers in VMs like Javascript and similar. I’d be very interested in how a loaded Ryzen VM performs with Javascript longterm, for example. I’m sure this issue can manifest itself elsewhere if ASLR is truly being corrupted under load.

What else is newsworthy here? Well, the issue does not appear to be fixed. By that I mean, there is no new stepping. It appears by all accounts that the most likely “fix” for this issue AMD is employing is to simply bin the processor better (that means picking a better performing wafer of silicon). This also explains why Threadripper and EPYC are “unaffected.” They are ALREADY binned higher.

To test this theory, I submitted my processor for an RMA. All users are reportedly getting “fresh from the presses” Ryzen’s manufactured not too long ago. Personally, my theory is that they are being pulled straight from assembly line binning process and used for RMAs. The fact that my CPU took nearly 2 weeks to “prepare” but got to me almost overnight only supports this theory. Anyhow, my CPU is made in Week 33. You can see this vs my old Week 9 Ryzen compared below:

oldryzen296.jpg


newryzen831.jpg


Note, in the images above, the older CPU container has a plastic shield that is much more “shiny” for some reason. It obscures the laser markings a bit but they should still be legible. I think it is just a packaging difference.

The new CPU has been opened on the bottom (no sticker), as prior reports indicated. It was also shipped rather pathetically. Unfortunately, I forgot to photograph this fact in my excitement, but I can certify there was no bottom “security” sticker and online reports support this. Have a look at the poor packaging anyways for kicks:

packaging.jpg


The CPU, as predicted, is much higher binned or otherwise a “golden” chip. It does 1.425v 4.1Ghz all cores where it took 1.475v to attain 4.0Ghz All cores on my old Ryzen. It also lets the IMC fly up to 3600Mhz where before, 3200Mhz was a struggle. Here are some relevant comparison shots.

A basic overview of my old Ryzen. Lacking memory/voltage tabs, but this is all I could ever push out of it, and my “daily driver” clocks were lower. IMC was at 3200 MHz with 4 Single rank Samsung B-Die DIMMS. Clock was 4Ghz with 1.475v.

cpuz.png



My new Ryzen. Clocks higher, with less volts. Obviously better binned or otherwise golden. IMC goes outrageously high at 3600 MHz. Same memory/DIMMS as above.

1.png
2.png
3.png


Oh, and yes, the issue is fixed.

What does this all mean?

I think AMD is binning run of the mill Ryzen CPUs so low that ASLR is effectively broken as soon as things get "hot" under load. I don't have direct confirmation of this yet, but a lot of circumstantial evidence, mostly found via myself and this thread here:

https://community.amd.com/thread/215773

It's a long read, but the evidence is there, if you look. I'd recommend the later/within last 2 month posts as they cover the RMA process and reports of binning/testing going on prior to chip arrival.
 
Last edited:
After editing / typing all that, please let me remind you I'd like to keep this thread a informtation/research thread, no fanboyism allowed.

I'd like to start the discussion by asking if anyone knows a good javascript "stress test" of sorts one could run alongside say, Prime95. If my theory is right, it should eventually crash, or something equally strange will happen.

Right now I have JetStream 1.1 but I have no idea how to loop it long term.

http://browserbench.org/JetStream/
 
Last edited:
Thanks for the information @R-T-B, it was an interesting read and I'm actually considering RMA'ing the ryzen CPU in my brother rig as a result
 
Thanks for the information @R-T-B, it was an interesting read and I'm actually considering RMA'ing the ryzen CPU in my brother rig as a result

If you do, be aware they make you go through a little song and dance routine of making sure your voltage/cooling settings are adequate and have you test a fairly crazy set of voltages. I personally (being this was before I resigned) just got fed up with it, posted my settings and voltages and flashed my press credentials, which got the process escalated immediately and had them overnight me a CPU (lulz). I'm told it "normally" takes a good few months, sadly.

EDIT:

Example:

Thank you for submitting your RMA. I’m sorry to hear that you’re experiencing stability issues with your system. Please be assured that I am here to help find a resolution to your problem


Before approving your RMA, I would like to firstly perform some troubleshooting and focus on your system’s hardware configuration.


Please provide the details of the following hardware components in your system:

• Make and model of motherboard?

• Motherboard BIOS version?

• Make and model of RAM?

• Make and model of the power supply unit?


Please could you let me know the current settings you have for the CPU VCORE, SOC, and RAM? It would be very helpful if you could provide with pictures of your BIOS screens with these settings.


In addition, through troubleshooting with other customers we have found that the layout of the components inside the system case have caused sub-optimal cooling of the CPU causing a variety of issues.


I would like to better understand your system cooling to rule out any thermal issues. Please could you provide a picture of the whole interior of your system showing the CPU cooler?


Also, could you let me know the reported CPU temperature during heavy load or when the errors occur?


Thanks for contacting AMD
 
Last edited:
If you do, be aware they make you go through a little song and dance routine of making sure your voltage/cooling settings are adequate and have you test a fairly crazy set of voltages. I personally (being this was before I resigned) just got fed up with it, posted my settings and voltages and flashed my press credentials, which got the process escalated immediately and had them overnight me a CPU (lulz). I'm told it "normally" takes a good few months, sadly.

EDIT:

Example:
Thanks for the heads up on that it is a massive song and dance routine to go through.
 
Thanks for the heads up on that it is a massive song and dance routine to go through.

What finally got me was when the rep asked if I "had a cooler attached." o_O

I was like... you mean which cooler? No, just like do you, at all? I was like, no, not doing this anymore... summon supervisor! :laugh:
 
A few users experience this and out of thousands and its suddenly everyone has a problem, even when they experience none.
 
A few users experience this and out of thousands and its suddenly everyone has a problem, even when they experience none.

If you'd read this well researched thread, this is basically due to the lack of usage of ASLR outside of linux. It's similar to how no one "experienced" the old Prime95 avx bug despite everyone having it without wait for it... running Prime95.

This isn't a fanboy thread and I'd like to keep it free of that, thanks. The current best outcome would be to develop a windows tool to prove you are affected, and I have come seeking help for that.
 
A few users experience this and out of thousands and its suddenly everyone has a problem, even when they experience none.
Well that’s his point you “can” create the problem and easily in Linux just not as easy in Windows. Might not be an issue today but next year who knows some ASLR functionality in Windows appears and you’re now just realizing you’re on a bad CPU
 
Well that’s his point you “can” create the problem and easily in Linux just not as easy in Windows. Might not be an issue today but next year who knows some ASLR functionality in Windows appears and you’re now just realizing you’re on a bad CPU

Pretty much.

I'm also slightly alarmed that their "fix" seems to be simply to throw better binned silicon to people who complain, and not globally change the binning process. Unless maybe they have? I don't know, week 25+ cpus have not hit the market yet.
 
wait, how did you conclude it's a 'heat' issue or that different bins should result in different failure rates/times? i dont remember heat being mentioned on phoronix & its user comments

if week 25+, not to mention threadripper/epyc are 'permanently fixed', doesnt that mean it's more to do with physical microscopic manufacturing defects?

for some reason i never thought of this aspect of virtualization, is ASLR of a client actually randomized on the non-ASLR host's memory (at least within the preallocated chunk of the VM process)?

i want to know more about the ram limits, we really need to confirm if different cpus result in different memory support even after all the agesa updates

guess it's a good thing i've still been waiting & waiting due to the ram+nand+gpu price inflations before building...
 
If you'd read this well researched thread, this is basically due to the lack of usage of ASLR outside of linux. It's similar to how no one "experienced" the old Prime95 avx bug despite everyone having it without wait for it... running Prime95.

This isn't a fanboy thread and I'd like to keep it free of that, thanks. The current best outcome would be to develop a windows tool to prove you are affected, and I have come seeking help for that.
not sure to whom you're replying, but I'd say with your tone, there's a reason for that... *hint hint*
 
wait, how did you conclude it's a 'heat' issue or that different bins should result in different failure rates/times? i dont remember heat being mentioned on phoronix & its user comments

I don't know it's heat for certain (actaully, I more suspect it's load related since I wrote that). Frankly, all we really 100% know is for some reason the rma'd chips are binned better. Why is anyones guess, but I would assume it's because of poor binning causing the issue if we're going to conjecture.

not sure to whom you're replying, but I'd say with your tone, there's a reason for that... *hint hint*

I was replying to the quoted party.

Reason for what? Your comment is confusing. I'm not attempting any sort of tone, though maybe the old PM I copied and pasted to support these claims has one, I really didn"t check... my bad there. I'm all about sorting out what makes this issue tick and how AMD is handling it, nothing more.

For the record, AMD support deserves a gold star for how they treated me, though telling them I was a press member probably helped with that...
 
Last edited:
ASLR is effectively broken as soon as things get "hot" under load
I wonder if running more volts through the IMC would result in ASLR becoming more stable. It's entirely possible that ASLR is doing something in a particular way where the CPU becomes unstable and doesn't sound too different from another linux issue with the ocaml compiler where certain conditions could make the machine unstable. A lot like AVX, there are a number of things happening within a given CPU cycle and transistors that are more leaky are going to have more trouble switching at such high frequencies. If you're right and they're giving out better binned CPUs to get around it, it's entirely possible that a little more voltage in the right place might have the same effect but, resulting in more heat.
 
I wonder if running more volts through the IMC would result in ASLR becoming more stable. It's entirely possible that ASLR is doing something in a particular way where the CPU becomes unstable and doesn't sound too different from another linux issue with the ocaml compiler where certain conditions could make the machine unstable. A lot like AVX, there are a number of things happening within a given CPU cycle and transistors that are more leaky are going to have more trouble switching at such high frequencies. If you're right and they're giving out better binned CPUs to get around it, it's entirely possible that a little more voltage in the right place might have the same effect but, resulting in more heat.


Pre-RMA, I nearly fixed the issue by upping SOC voltage to 1.2v (later it came back with a vengance though), so you might be onto something.
 
Pre-RMA, I nearly fixed the issue by upping SOC voltage to 1.2v (later it came back with a vengance though), so you might be onto something.
it's not adding up, how can week25 or ALL threadrippers/epycs not have the issue? binning isnt exact, there are still variances, how would some small difference in target voltage or temperature or stable clock result in a very specific calculation error being permanently fixed?

the only logical way to test the bin hypothesis is by (running the errata scripts people made while) underclocking/overvolting/watercooling/timing loosening old ryzen cpus & overclocking/undervolting/overheating/timing tightening new ryzen cpus
 
it's not adding up, how can week25 or ALL threadrippers/epycs not have the issue?

Threadripper/EPYC have always been top 5% binned.

My current theory is that the reason all the rma'd cpus are "hot off the presses" is that they are essentially made to order with higher binned dies. Of course I could be wrong, but my build number was very close to when my RMA was approved.

We'll only really know when week 25+ cpus make it to market. It will be interesting to see if all of them are higher binned as well. All I know is RMA requests, for whatever reason, seem to be higher binned. It could be that AMD is just doing that for "added insurance" against a re-rma.

Oddly however, in contrary to my hypothesis, I can't seem to make my new Ryzen segfault by lowering soc volts to low low voltage (I tried 0.8v). I may be completely off on this afterall. I will fully admit a lot of this is my "best guess" for what is going on.
 
The thing with bins is not only with desktop parts.

Mobile does it and always did. You can buy two same phone models, but the difference between worst and best voltage bin is HUGE, heat and battery life wise. Community often does make graphs of their samples, pretty much logic looking charts. Also the the cheating with NAND speeds etc things... like screens with useless gorilla his a** or not... there are batches...

It is a lottery IMHO.
 
Threadripper/EPYC have always been top 5% binned.

My current theory is that the reason all the rma'd cpus are "hot off the presses" is that they are essentially made to order with higher binned dies. Of course I could be wrong, but my build number was very close to when my RMA was approved.

We'll only really know when week 25+ cpus make it to market. It will be interesting to see if all of them are higher binned as well. All I know is RMA requests, for whatever reason, seem to be higher binned. It could be that AMD is just doing that for "added insurance" against a re-rma.

Oddly however, in contrary to my hypothesis, I can't seem to make my new Ryzen segfault by lowering soc volts to low low voltage (I tried 0.8v). I may be completely off on this afterall. I will fully admit a lot of this is my "best guess" for what is going on.
how are they going to give old stock during rma? the old stock has been shipped to stores, there is no reason for them to keep some for rma since they are constantly manufacturing new ones, take some new ones as needed to fill rmas

i thought some week25 did hit the market, but dont remember

if TR/E is 5%, that's no guarantee, amd would have to be sure that something like top 30% are fine, but this goes against the official statement that week25+ is fine (unless they make a more convoluted binning process, but TR/E got released... around week25 didnt they, what's the oldest known week for one?)

was this issue confirmed on the fewer core models or only the 8cores?
 
Was there ever an official statement from AMD that week 25+ are ok? I was under the impression that was just a phronix claim/guesstimate.
 
I am very very suspicious at this point they aren't simply gluing threadripper grade dies to Ryzen CPUs on request
Please explain what you mean by this. Was that a tongue in cheek comment? Or do you really mean they delidded and replaced the lid on a different processor die?

I only ask because I wonder if one of those Frankenstein processors escaped AMD and somehow got released into the retail distribution channel? That might explain why a poster I was helping on another site received a "brand new" :rolleyes: ??? Ryzen 1600 from Overclockers in the UK where the lid clearly had been removed and replaced as a "blue substance" (I am assuming TIM) was oozing out from all around the edges of the lid. The box was sealed with an ESD precaution label. Customer Support at Overclockers seemed shocked and puzzled and even paid for return shipping, suggesting ("guessing") it was a "warehouse/packing error" at AMD because it should have really been brand new.

Still waiting on the OP to see what the replacement processor looks like but it appears, at least, that Overclockers is stepping up and taking care of their customer. :)
 
Please explain what you mean by this. Was that a tongue in cheek comment? Or do you really mean they delidded and replaced the lid on a different processor die?

Well, I mean I don't actually mean/think they are delidding and replacing dies. I think they simply build these RMA'd cpus to order with better binned parts. But I could be wrong. The whole thing is an information vacuum which is half the issue.

I did ask AMD directly what was going on, but my previously quite talkitive person helping me with my RMA went silent on that. (Not unexpected mind you, he's probably not qualified to comment there).

As for the rest of your comment, it sounds very much like what I got. Have him check his heatspreader label. I bet it's a week 25 or newer CPU. That would be an RMA-return at this point. They do look otherwise new, so maybe it went something like this:

Overclockers.co.uk gets returned CPU, RMA's. -> Gets replacement CPU, looks new, puts on shelf -> Customer gets replacement cpu, notices missing sticker and thermal paste, complains -> Overclockers support is clueless, as they don't handle RMAs.

EDIT: Scratch all that. You mean the lid had actually been removed? Like the processor heatspreader? If so, no, that's not at all what mine was like.
 
Last edited:
EDIT: Scratch all that. You mean the lid had actually been removed? Like the processor heatspreader?
That's exactly what I mean. It appeared the lid was removed and an excessive amount of TIM was applied that then squished out when the lid was replaced. And the retail box was still sealed so it does appears Overclockers did not do anything funny here as they too thought they were selling a "new" CPU.

Note this was (or was supposed to be) a new retail boxed CPU. Not an OEM. So I guess this was something totally different from your scenarios. Sorry for the OT sidetrack.
 
That's exactly what I mean. It appeared the lid was removed and an excessive amount of TIM was applied that then squished out when the lid was replaced. And the retail box was still sealed so it does appears Overclockers did not do anything funny here as they too thought they were selling a "new" CPU.

Note this was (or was supposed to be) a new retail boxed CPU. Not an OEM. So I guess this was something totally different from your scenarios. Sorry for the OT sidetrack.

No apology necessary. Makes me wonder what went on there but your correct it's likely unrelated.
 
I do hope we will not see an large proportion of RyZen owners RMA their stuff for a higher binned processor
 
Back
Top