1. Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Hardware Bug in AMD CPU Family

Discussion in 'General Hardware' started by IlluminAce, Mar 7, 2012.

  1. IlluminAce

    IlluminAce New Member

    Joined:
    Aug 6, 2011
    Messages:
    46 (0.03/day)
    Thanks Received:
    40
    Location:
    UK
    AMD's CPU range has a problem updating its stack pointer. Ouch!

    Read all about it.

    This has been confirmed on one of the Phenom II X4 range's CPUs and even on a high-end Opteron. AMD's confirmation indicates it probably affects a nice range of their CPUs. The problem comes to light only in a very specific case - the best I can ascertain, it's when you're in a rather specific section of the stack (possibly requiring stack randomization), and when a very particular situation arises with a particular arrangement of assembly calls involving a sequence of pops (so really in deep recursion) and some NOPs.

    It was found by the man behind DragonFly BSD - a pretty nifty fork of OpenBSD which has undergone extensive kernel rewriting, so this guy knew his stuff, and put the requisite time in (over the course of a year) to track this blighter of a bug down. Kudos to Matthew Dillon for his efforts.

    Before everybody dashes out to buy Intels, AFAIK the bug has only exhibited (or been noticed, at least) on DragonFly BSD in a particular method called by the GCC implementation in use there, and only very irregularly at that. So you should be safe :cool:

    *segfault*
     
    Last edited: Mar 7, 2012
    qubit says thanks.
  2. erocker

    erocker Super Moderator Staff Member

    Joined:
    Jul 19, 2006
    Messages:
    40,623 (12.43/day)
    Thanks Received:
    15,442
  3. qubit

    qubit Overclocked quantum bit

    Joined:
    Dec 6, 2007
    Messages:
    10,849 (3.93/day)
    Thanks Received:
    4,203
    Location:
    Quantum well (UK)
    The articles came out the same day, yesterday, so it's not old news.

    I agree it's not the kind of bug to make your PC crash and burn, however. Still, it's a bug and will be fixed.
     
  4. theoneandonlymrk

    theoneandonlymrk

    Joined:
    Mar 10, 2010
    Messages:
    3,492 (1.80/day)
    Thanks Received:
    613
    Location:
    Manchester uk
    both intel and AMD keep updated lists of known faults for their cpus, its been posted on here before im sure, each cpu seems to have a massive list of known bugs and errors but they chunder away, odd isnt it:rolleyes:
     
    More than 25k PPD
  5. qubit

    qubit Overclocked quantum bit

    Joined:
    Dec 6, 2007
    Messages:
    10,849 (3.93/day)
    Thanks Received:
    4,203
    Location:
    Quantum well (UK)
    Yeah, that goes with what I said. Basically, the CPUs are pretty bug-free in all the usual operations they do, leaving more obscure code sequences with errors in them, which don't get used very often. That and the workarounds that developers use for known ensures that the systems keep running ok. Occasionally, bad errors like the Phenom TLB bug crop up, which put a kink in a CPU.
     
  6. IlluminAce

    IlluminAce New Member

    Joined:
    Aug 6, 2011
    Messages:
    46 (0.03/day)
    Thanks Received:
    40
    Location:
    UK
    Quite right, the errata lists are surprisingly extensive (or unsurprisingly if you consider the complexity). However, this fault was previously unreported in the errata, and will exhibit as a segfault given the right conditions. Moreover, it's almost impossible to track down. It's far from inconceivable that this issue could be behind a variety of unexplained segfaults on production systems. They certainly were in Matthew's case, and his usage was relatively lightweight, if slightly specific.

    Such an issue exhibiting on a home system could be put down to unstable hardware - too high OC/temps for example, or dodgy RAM, or an OS or userland software bug. We all know of them occurring; who knows, the odd one may have had just such a root cause. Ultimately it's not likely to cause us any major headaches.

    As for whether it's a big deal, I'd have to disagree erocker. Just because something happens irregularly and under specific workloads doesn't make it unimportant, especially when an entire family of CPUs is affected. With Opterons, we're talking about the backbone of many a prod app/DB server and grid computation node. In the case of the former, a single segfault can be completely catastrophic; in the latter, occasional errors would often go largely uninvestigated, or assumptions made as to unstable hardware. If it only affected one particular model, or was a fault in a keyboard for example, then fair enough; but (probably a large subset of) an entire CPU family is another matter completely. If you have a datacentre of Opterons and do experience occasional segfaults which you haven't managed to track down... you now have an interesting decision to make :)

    Whilst we shouldn't jump to conclusions, AMD's final errata statement will make for interesting reading for many infra teams and sysadmins, I'm sure.
     
    qubit says thanks.
  7. trickson

    trickson OH, I have such a headache

    Joined:
    Dec 5, 2004
    Messages:
    6,494 (1.68/day)
    Thanks Received:
    956
    Location:
    Planet Earth.
    This is not really a big deal at all. Intel has bugs AMD has bugs, Maybe they should hire a good exterminator for there FAB plants.
     
  8. erocker

    erocker Super Moderator Staff Member

    Joined:
    Jul 19, 2006
    Messages:
    40,623 (12.43/day)
    Thanks Received:
    15,442
    I said "doesn't seem to be a big deal". I've run AMD for years.. Still have a s754 system that's been running 24/7 for about 6 years now. No bugs to report. People can make this bug out to whatever they want it to be or mean to them. ;)
     
  9. qubit

    qubit Overclocked quantum bit

    Joined:
    Dec 6, 2007
    Messages:
    10,849 (3.93/day)
    Thanks Received:
    4,203
    Location:
    Quantum well (UK)
    There's an update to the story now over at tng, part of which are exclusive. ;)
     
  10. IlluminAce

    IlluminAce New Member

    Joined:
    Aug 6, 2011
    Messages:
    46 (0.03/day)
    Thanks Received:
    40
    Location:
    UK
    Quite, us end users are not likely to suffer much as a result of this - unless you do much compilation on DragonFly BSD ;) (but, seriously, I do like its tenents and the work that's gone into it. I might give it a spin soon). Perhaps on the odd occasion us 24/7'ers might encounter this bug without realising, but that's nothing too serious from our perspectives. Thankfully it corrupts the sp rather than eax for example - if it would occasionally silently corrupt my computations, I'd be a lot more concerned.

    But as you say, it's what you make of it, and for those of us doing serious computing - large organisations with big datacentres - this sort of rare, intermittent problem can present pretty nasty real-world problems. Thankfully I deal with programming on grids as opposed to supporting them!
     

Currently Active Users Viewing This Thread: 1 (0 members and 1 guest)

Share This Page