• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

live patching in GNU LInux exists, the servers don't need to be rebooted
No - because you still need to update firmware for hardware, chiefly BIOS. You ARE updating them, yes...?
 
no fixed planned?
its like amd doesn't want to sell datacenter chips
plenty of servers don't get rebooted for a year

No - because you still need to update firmware for hardware, chiefly BIOS. You ARE updating them, yes...?
brauh no
ipmi and idrac exist
 
While I agree , if you're not performing some sort of upgrade, maintenance or check, in three years of a server part in a normal IT department I would be surprised, filters need cleaning and three years is quite long to use the same server hardware ( considers own IT DPT), maybe though. :D :)

It's a surprising oversight for a server part.

@JAKra get a job a Cyberdyne, these firm's need this kind of thought:)
You be surprised, big projects I have worked on basically just add new hardware to the pool of servers, with the oldest stuff staying in place for long periods of time.

One year+ uptime not unusual either.
 
"It's Not a Bug, It's a Feature."
In case of AI uprising, all machines based on EPIC would automatically go to sleep after 34 months of uptime.
Humanity saved thanks to this "feature"! :D
Exactly + saving the planet from global warming, thanks you AMD !
 
We will soon see, I mean in 7 months, if Zen 3 Epycs inherited the exact same bug.
 
brauh no
ipmi and idrac exist

Sure they do - and what is the message when you update BIOS in iLO? Reset power to the system. For HPE I can tell you that same happens with things like RAID controller, network card, FC card and disk firmwares (and all those are not iLO updatable).
You seem to be mistaking remote management chips with actual BIOS.
 
One of the most harmless bugs ever. Servers don't get to sleep mode and no PC stays on for more than 1 year (or it gets neglected and gets damaged without proper maintenance).
 
One of the most harmless bugs ever. Servers don't get to sleep mode and no PC stays on for more than 1 year (or it gets neglected and gets damaged without proper maintenance).
On the other hand, an entire datacenter or a large part of it may go out in a timespan of a few minutes because all nodes were turned on at the same time. And it might be something mission-critical that never connects to the internet and does not need updates.
 
On the other hand, an entire datacenter or a large part of it may go out in a timespan of a few minutes because all nodes were turned on at the same time. And it might be something mission-critical that never connects to the internet and does not need updates.
I would like to learn from an official source if that bug happened to any of the installed servers. I guess not.
 
I would like to learn from an official source if that bug happened to any of the installed servers. I guess not.

A late bug such as this was likely discovered and learned about in a production environment ;)
 
A late bug such as this was likely discovered and learned about in a production environment ;)
Not going to argue on that but not sure about that. Are you?
 
Not going to argue on that but not sure about that. Are you?

I can't prove it if that's what you're implying, but given its a bug that takes literally 3 years to manifest it's not so hard to believe that was the case. Or at a bare minimum, AMD must have one running 24/7 in a QA lab and ran into it ;)
 
A late bug such as this was likely discovered and learned about in a production environment ;)
yeah most likely a customer(enterprise) hit it and complained to AMD
 
yeah most likely a customer(enterprise) hit it and complained to AMD
I can imagine that.

"Hey, I haven't restarted our server for 3 years and now it's acting weird."
"Huh? You what now? 3 years?"
:roll:
 
I can't prove it if that's what you're implying, but given its a bug that takes literally 3 years to manifest it's not so hard to believe that was the case. Or at a bare minimum, AMD must have one running 24/7 in a QA lab and ran into it ;)
As a bare minimum, AMD (and everyone else for that matter) should at least evaluate what happens with counters and timers in border cases, and when.
 
As a bare minimum, AMD (and everyone else for that matter) should at least evaluate what happens with counters and timers in border cases, and when.

Agreed. Edge cases indeed must be accounted for, and this one seems like it's a bug related to the CPU's low power mode functionality. Disabling CC6/arguing that many boards have CC6 off by default is one thing but, let's be honest, we live in the age where people make a big deal out of a few watts in their tree-hugging "save the planet" craze, and such functionality would likely be enabled by company directive if it's available, so it should, to the extent that it's offered, work.
 
Uptime?
Naptime!

Definitely Errata, but also... highly unlikely to be a major issue for anyone.
 
Not sure you mean “errata” here, but this editor is wonky as hell
 
Running for a year without a reboot isn't abnormal for some servers. I should restate that. Running a year without rebooting isn't abnormal for some Linux servers. That being said, I have seen a server that ran for over 5 years without a reboot. It was not on purpose, just more like it got forgotten about and kept doing its thing without issues. I'm on the fence about this. Almost three years is a long time. Rome launched almost 4 years ago, I'm guessing that is why it isn't going to be fixed. I assume, since we are just hearing about it now, it wasn't in the first gen of EPYC. Is it still an issue? Is Milan or Genoa affected? Nice can of worms...
 
Hell I once run a Mac mini server for five years without a reboot (only rebooted to update to APFS, probably unneeded). Have had a pi running for six now. Lucky not to have any power outages, only my nas and main are attached to UPSs. Idk why I am sharing this carry on
 
Hell I once run a Mac mini server for five years without a reboot (only rebooted to update to APFS, probably unneeded). Have had a pi running for six now. Lucky not to have any power outages, only my nas and main are attached to UPSs. Idk why I am sharing this carry on
Don't forget to test your UPS batteries!
 
Agreed. Edge cases indeed must be accounted for, and this one seems like it's a bug related to the CPU's low power mode functionality. Disabling CC6/arguing that many boards have CC6 off by default is one thing but, let's be honest, we live in the age where people make a big deal out of a few watts in their tree-hugging "save the planet" craze, and such functionality would likely be enabled by company directive if it's available, so it should, to the extent that it's offered, work.
Think about a large server installation or a HPC/supercomputing cluster. Each 2-processor node consumes a couple hundred watts on idle. It makes a lot of sense to put some nodes on sleep when the system is not fully loaded. I don't know if C6 (deep power down state) is used for that purpose or not but it seems to be the appropriate mechanism.
 
It's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart.
Ehhhhh, when I first started my professional career I worked with a colo server with over 1000 days of uptime on it. It was a database server. That streak was broken when we switched from Rackspace to Google. Believe it or not, this is very normal for Linux, in particular with builds that can do live kernel patching while the system is hot.
 
Last edited:
Back
Top