1. Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

My research into AMD's Linux "Performance Marginality" issue:

Discussion in 'General Hardware' started by R-T-B, Sep 21, 2017.

  1. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    I've been doing some behind the scenes research into AMD's so called Linux "Performance Marginality." When I initially began researching this, I had big plans to write an independent research script to attempt to prove the crash can happen in Windows with a program to prove it. Unfortunately, I never quite got there, and it appears I may even have been off on my expected results. The crash is triggered by ASLR, and Windows doesn't use this, generally. Javascript might, but find me any webpage that spawns a 16 thread javascript process that isn't mining coins malware style and I'll be genuinely shocked.

    What did come of this is a document where I detailed my results with the RMA. It appears if nothing else, there is heavy evidence indicating there is not a new stepping, but actually just improved binning to mitigate the issue amongst those whom complain. It's circumstantial evidence at this point, but given AMD has declined to comment repeatedly when asked how they fix this, I am very very suspicious at this point they aren't simply gluing threadripper grade dies to Ryzen CPUs on request, and standard Ryzen grade CPUs simply don't have a fully functional ASLR function under load (at least, at the binning level they chose).

    I'm putting the document I typed up below, including evidence, in hopes you guys can do more research and maybe find enough to make this case a bit more than circumstantial. As it is, I'm out of time and energy to pursue this further, but it certainly seems suspect.

    BEGIN PM (Originally sent to W1zzard and company, advised to share with community):

    As a user of Gentoo Linux, I have been hit hard by the so-called Ryzen “Performance Marginality.” This manifests itself as an event in which several build jobs running concurrently will crash a random process on the system, usually (but not necessarily) one of the running build jobs. The problem is well documented, and AMD is offering RMAs to affected users. The thing is, that makes it sound like not everyone is affected. Truth be told, after a lot of online research, it is my opinion that anyone with a processor older than build week 25 is affected. Since anything newer than build week 20 has not made it into retail yet (at least, if user reports can be believed), this means nearly all Ryzen processors on the market at present time are affected by this issue.

    This is a big deal, and not just on Linux. Why?

    The issue vanishes in Linux with nearly all users when they turn off Kernel ASLR (Address Space Layout Randomization). This is a critical security feature that is not presently used much in Windows (and frankly, may never be) but is already being used inside web browsers in VMs like Javascript and similar. I’d be very interested in how a loaded Ryzen VM performs with Javascript longterm, for example. I’m sure this issue can manifest itself elsewhere if ASLR is truly being corrupted under load.

    What else is newsworthy here? Well, the issue does not appear to be fixed. By that I mean, there is no new stepping. It appears by all accounts that the most likely “fix” for this issue AMD is employing is to simply bin the processor better (that means picking a better performing wafer of silicon). This also explains why Threadripper and EPYC are “unaffected.” They are ALREADY binned higher.

    To test this theory, I submitted my processor for an RMA. All users are reportedly getting “fresh from the presses” Ryzen’s manufactured not too long ago. Personally, my theory is that they are being pulled straight from assembly line binning process and used for RMAs. The fact that my CPU took nearly 2 weeks to “prepare” but got to me almost overnight only supports this theory. Anyhow, my CPU is made in Week 33. You can see this vs my old Week 9 Ryzen compared below:

    [​IMG]

    [​IMG]

    Note, in the images above, the older CPU container has a plastic shield that is much more “shiny” for some reason. It obscures the laser markings a bit but they should still be legible. I think it is just a packaging difference.

    The new CPU has been opened on the bottom (no sticker), as prior reports indicated. It was also shipped rather pathetically. Unfortunately, I forgot to photograph this fact in my excitement, but I can certify there was no bottom “security” sticker and online reports support this. Have a look at the poor packaging anyways for kicks:

    [​IMG]

    The CPU, as predicted, is much higher binned or otherwise a “golden” chip. It does 1.425v 4.1Ghz all cores where it took 1.475v to attain 4.0Ghz All cores on my old Ryzen. It also lets the IMC fly up to 3600Mhz where before, 3200Mhz was a struggle. Here are some relevant comparison shots.

    A basic overview of my old Ryzen. Lacking memory/voltage tabs, but this is all I could ever push out of it, and my “daily driver” clocks were lower. IMC was at 3200 MHz with 4 Single rank Samsung B-Die DIMMS. Clock was 4Ghz with 1.475v.

    [​IMG]


    My new Ryzen. Clocks higher, with less volts. Obviously better binned or otherwise golden. IMC goes outrageously high at 3600 MHz. Same memory/DIMMS as above.

    [​IMG][​IMG][​IMG]

    Oh, and yes, the issue is fixed.

    What does this all mean?

    I think AMD is binning run of the mill Ryzen CPUs so low that ASLR is effectively broken as soon as things get "hot" under load. I don't have direct confirmation of this yet, but a lot of circumstantial evidence, mostly found via myself and this thread here:

    https://community.amd.com/thread/215773

    It's a long read, but the evidence is there, if you look. I'd recommend the later/within last 2 month posts as they cover the RMA process and reports of binning/testing going on prior to chip arrival.
     
    Last edited: Sep 21, 2017
    Hugis, biffzinker, infrared and 5 others say thanks.
    10 Year Member at TPU
  2. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    After editing / typing all that, please let me remind you I'd like to keep this thread a informtation/research thread, no fanboyism allowed.

    I'd like to start the discussion by asking if anyone knows a good javascript "stress test" of sorts one could run alongside say, Prime95. If my theory is right, it should eventually crash, or something equally strange will happen.

    Right now I have JetStream 1.1 but I have no idea how to loop it long term.

    http://browserbench.org/JetStream/
     
    Last edited: Sep 21, 2017
    Hugis says thanks.
    10 Year Member at TPU
  3. Nuckles56

    Nuckles56

    Joined:
    Sep 10, 2016
    Messages:
    280 (0.68/day)
    Thanks Received:
    228
    Location:
    Riverwood, Skyrim
    Thanks for the information @R-T-B, it was an interesting read and I'm actually considering RMA'ing the ryzen CPU in my brother rig as a result
     
  4. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    If you do, be aware they make you go through a little song and dance routine of making sure your voltage/cooling settings are adequate and have you test a fairly crazy set of voltages. I personally (being this was before I resigned) just got fed up with it, posted my settings and voltages and flashed my press credentials, which got the process escalated immediately and had them overnight me a CPU (lulz). I'm told it "normally" takes a good few months, sadly.

    EDIT:

    Example:

     
    Last edited: Sep 21, 2017
    biffzinker, Nuckles56 and INSTG8R say thanks.
    10 Year Member at TPU
  5. Nuckles56

    Nuckles56

    Joined:
    Sep 10, 2016
    Messages:
    280 (0.68/day)
    Thanks Received:
    228
    Location:
    Riverwood, Skyrim
    Thanks for the heads up on that it is a massive song and dance routine to go through.
     
    R-T-B says thanks.
  6. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    What finally got me was when the rep asked if I "had a cooler attached." o_O

    I was like... you mean which cooler? No, just like do you, at all? I was like, no, not doing this anymore... summon supervisor! :laugh:
     
    10 Year Member at TPU
  7. Steevo

    Steevo

    Joined:
    Nov 4, 2005
    Messages:
    9,897 (2.26/day)
    Thanks Received:
    2,339
    A few users experience this and out of thousands and its suddenly everyone has a problem, even when they experience none.
     
    10 Year Member at TPU 10 Million points folded for TPU
  8. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    If you'd read this well researched thread, this is basically due to the lack of usage of ASLR outside of linux. It's similar to how no one "experienced" the old Prime95 avx bug despite everyone having it without wait for it... running Prime95.

    This isn't a fanboy thread and I'd like to keep it free of that, thanks. The current best outcome would be to develop a windows tool to prove you are affected, and I have come seeking help for that.
     
    10 Year Member at TPU
  9. INSTG8R

    INSTG8R

    Joined:
    Nov 26, 2004
    Messages:
    4,294 (0.91/day)
    Thanks Received:
    1,459
    Location:
    Canuck in Norway
    Well that’s his point you “can” create the problem and easily in Linux just not as easy in Windows. Might not be an issue today but next year who knows some ASLR functionality in Windows appears and you’re now just realizing you’re on a bad CPU
     
    R-T-B says thanks.
    10 Year Member at TPU
  10. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    Pretty much.

    I'm also slightly alarmed that their "fix" seems to be simply to throw better binned silicon to people who complain, and not globally change the binning process. Unless maybe they have? I don't know, week 25+ cpus have not hit the market yet.
     
    10 Year Member at TPU
  11. kn00tcn

    kn00tcn

    Joined:
    Feb 9, 2009
    Messages:
    1,442 (0.45/day)
    Thanks Received:
    376
    Location:
    Toronto
    wait, how did you conclude it's a 'heat' issue or that different bins should result in different failure rates/times? i dont remember heat being mentioned on phoronix & its user comments

    if week 25+, not to mention threadripper/epyc are 'permanently fixed', doesnt that mean it's more to do with physical microscopic manufacturing defects?

    for some reason i never thought of this aspect of virtualization, is ASLR of a client actually randomized on the non-ASLR host's memory (at least within the preallocated chunk of the VM process)?

    i want to know more about the ram limits, we really need to confirm if different cpus result in different memory support even after all the agesa updates

    guess it's a good thing i've still been waiting & waiting due to the ram+nand+gpu price inflations before building...
     
  12. Ahhzz

    Ahhzz

    Joined:
    Feb 27, 2008
    Messages:
    4,132 (1.17/day)
    Thanks Received:
    3,474
    not sure to whom you're replying, but I'd say with your tone, there's a reason for that... *hint hint*
     
  13. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    I don't know it's heat for certain (actaully, I more suspect it's load related since I wrote that). Frankly, all we really 100% know is for some reason the rma'd chips are binned better. Why is anyones guess, but I would assume it's because of poor binning causing the issue if we're going to conjecture.

    I was replying to the quoted party.

    Reason for what? Your comment is confusing. I'm not attempting any sort of tone, though maybe the old PM I copied and pasted to support these claims has one, I really didn"t check... my bad there. I'm all about sorting out what makes this issue tick and how AMD is handling it, nothing more.

    For the record, AMD support deserves a gold star for how they treated me, though telling them I was a press member probably helped with that...
     
    Last edited: Sep 21, 2017
    10 Year Member at TPU
  14. Aquinus

    Aquinus Resident Wat-man

    Joined:
    Jan 28, 2012
    Messages:
    10,283 (4.91/day)
    Thanks Received:
    5,362
    Location:
    Concord, NH
    I wonder if running more volts through the IMC would result in ASLR becoming more stable. It's entirely possible that ASLR is doing something in a particular way where the CPU becomes unstable and doesn't sound too different from another linux issue with the ocaml compiler where certain conditions could make the machine unstable. A lot like AVX, there are a number of things happening within a given CPU cycle and transistors that are more leaky are going to have more trouble switching at such high frequencies. If you're right and they're giving out better binned CPUs to get around it, it's entirely possible that a little more voltage in the right place might have the same effect but, resulting in more heat.
     
    R-T-B says thanks.
  15. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397

    Pre-RMA, I nearly fixed the issue by upping SOC voltage to 1.2v (later it came back with a vengance though), so you might be onto something.
     
    10 Year Member at TPU
  16. kn00tcn

    kn00tcn

    Joined:
    Feb 9, 2009
    Messages:
    1,442 (0.45/day)
    Thanks Received:
    376
    Location:
    Toronto
    it's not adding up, how can week25 or ALL threadrippers/epycs not have the issue? binning isnt exact, there are still variances, how would some small difference in target voltage or temperature or stable clock result in a very specific calculation error being permanently fixed?

    the only logical way to test the bin hypothesis is by (running the errata scripts people made while) underclocking/overvolting/watercooling/timing loosening old ryzen cpus & overclocking/undervolting/overheating/timing tightening new ryzen cpus
     
  17. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    Threadripper/EPYC have always been top 5% binned.

    My current theory is that the reason all the rma'd cpus are "hot off the presses" is that they are essentially made to order with higher binned dies. Of course I could be wrong, but my build number was very close to when my RMA was approved.

    We'll only really know when week 25+ cpus make it to market. It will be interesting to see if all of them are higher binned as well. All I know is RMA requests, for whatever reason, seem to be higher binned. It could be that AMD is just doing that for "added insurance" against a re-rma.

    Oddly however, in contrary to my hypothesis, I can't seem to make my new Ryzen segfault by lowering soc volts to low low voltage (I tried 0.8v). I may be completely off on this afterall. I will fully admit a lot of this is my "best guess" for what is going on.
     
    10 Year Member at TPU
  18. Ferrum Master

    Ferrum Master

    Joined:
    Nov 18, 2010
    Messages:
    3,761 (1.49/day)
    Thanks Received:
    2,201
    Location:
    Rīga, Latvia
    The thing with bins is not only with desktop parts.

    Mobile does it and always did. You can buy two same phone models, but the difference between worst and best voltage bin is HUGE, heat and battery life wise. Community often does make graphs of their samples, pretty much logic looking charts. Also the the cheating with NAND speeds etc things... like screens with useless gorilla his a** or not... there are batches...

    It is a lottery IMHO.
     
    Crunching for Team TPU
  19. kn00tcn

    kn00tcn

    Joined:
    Feb 9, 2009
    Messages:
    1,442 (0.45/day)
    Thanks Received:
    376
    Location:
    Toronto
    how are they going to give old stock during rma? the old stock has been shipped to stores, there is no reason for them to keep some for rma since they are constantly manufacturing new ones, take some new ones as needed to fill rmas

    i thought some week25 did hit the market, but dont remember

    if TR/E is 5%, that's no guarantee, amd would have to be sure that something like top 30% are fine, but this goes against the official statement that week25+ is fine (unless they make a more convoluted binning process, but TR/E got released... around week25 didnt they, what's the oldest known week for one?)

    was this issue confirmed on the fewer core models or only the 8cores?
     
  20. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    Was there ever an official statement from AMD that week 25+ are ok? I was under the impression that was just a phronix claim/guesstimate.
     
    10 Year Member at TPU
  21. Bill_Bright

    Bill_Bright

    Joined:
    Jul 25, 2006
    Messages:
    3,201 (0.78/day)
    Thanks Received:
    2,125
    Location:
    Nebraska, USA
    Please explain what you mean by this. Was that a tongue in cheek comment? Or do you really mean they delidded and replaced the lid on a different processor die?

    I only ask because I wonder if one of those Frankenstein processors escaped AMD and somehow got released into the retail distribution channel? That might explain why a poster I was helping on another site received a "brand new" :rolleyes: ??? Ryzen 1600 from Overclockers in the UK where the lid clearly had been removed and replaced as a "blue substance" (I am assuming TIM) was oozing out from all around the edges of the lid. The box was sealed with an ESD precaution label. Customer Support at Overclockers seemed shocked and puzzled and even paid for return shipping, suggesting ("guessing") it was a "warehouse/packing error" at AMD because it should have really been brand new.

    Still waiting on the OP to see what the replacement processor looks like but it appears, at least, that Overclockers is stepping up and taking care of their customer. :)
     
    10 Year Member at TPU
  22. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    Well, I mean I don't actually mean/think they are delidding and replacing dies. I think they simply build these RMA'd cpus to order with better binned parts. But I could be wrong. The whole thing is an information vacuum which is half the issue.

    I did ask AMD directly what was going on, but my previously quite talkitive person helping me with my RMA went silent on that. (Not unexpected mind you, he's probably not qualified to comment there).

    As for the rest of your comment, it sounds very much like what I got. Have him check his heatspreader label. I bet it's a week 25 or newer CPU. That would be an RMA-return at this point. They do look otherwise new, so maybe it went something like this:

    Overclockers.co.uk gets returned CPU, RMA's. -> Gets replacement CPU, looks new, puts on shelf -> Customer gets replacement cpu, notices missing sticker and thermal paste, complains -> Overclockers support is clueless, as they don't handle RMAs.

    EDIT: Scratch all that. You mean the lid had actually been removed? Like the processor heatspreader? If so, no, that's not at all what mine was like.
     
    Last edited: Sep 21, 2017
    10 Year Member at TPU
  23. Bill_Bright

    Bill_Bright

    Joined:
    Jul 25, 2006
    Messages:
    3,201 (0.78/day)
    Thanks Received:
    2,125
    Location:
    Nebraska, USA
    That's exactly what I mean. It appeared the lid was removed and an excessive amount of TIM was applied that then squished out when the lid was replaced. And the retail box was still sealed so it does appears Overclockers did not do anything funny here as they too thought they were selling a "new" CPU.

    Note this was (or was supposed to be) a new retail boxed CPU. Not an OEM. So I guess this was something totally different from your scenarios. Sorry for the OT sidetrack.
     
    10 Year Member at TPU
  24. R-T-B

    R-T-B

    Joined:
    Aug 20, 2007
    Messages:
    7,107 (1.91/day)
    Thanks Received:
    6,397
    No apology necessary. Makes me wonder what went on there but your correct it's likely unrelated.
     
    Bill_Bright says thanks.
    10 Year Member at TPU
  25. xkm1948

    xkm1948

    Joined:
    Mar 18, 2008
    Messages:
    2,470 (0.70/day)
    Thanks Received:
    1,540
    I do hope we will not see an large proportion of RyZen owners RMA their stuff for a higher binned processor
     
    R-T-B says thanks.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guest)