Wednesday, November 27th 2019

HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

HP issued a warning to its customers that some of its SAS SSDs come with a bug that causes them to fail at exactly 32,768 hours of use. For an always-on or high-uptime server, this translates to 3 years, 270 days and 8 hours of usage. The affected models of SSDs are shipped in many of HP's flagship server and storage products, spanning its HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 product-lines.

HP has released an SSD firmware update that fixes this bug and cannot stress the importance of deploying the update enough. This is because once a drive hits the 32,768-hour literal deadline and breaks down, both the drive and the data on it become unrecoverable. There is no other mitigation to this bug than the firmware update. HP released easy to use online firmware update tools that let admins update firmware of their drivers from within their OS. The online firmware update tools support Linux, Windows, and VMWare. Below is a list of affected drives. Get the appropriate firmware update from this page.
Source: Bleeping Computer
Add your own comment

27 Comments on HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

#2
btarunr
Editor & Senior Moderator
DeathtoGnomes
why does that number look familiar?
It's 32 kibi hours.
Posted on Reply
#3
piloponth
DeathtoGnomes
why does that number look familiar?
2^15
Posted on Reply
#4
Easo
This is SSD manufacturer's fault, most likely. While HPE does use custom firmwares, I really doubt they write them from scratch for, say, Samsung drives. It is possible that Dell and others also may come forward soon.
Posted on Reply
#5
R-T-B
Easo
This is SSD manufacturer's fault, most likely. While HPE does use custom firmwares, I really doubt they write them from scratch for, say, Samsung drives. It is possible that Dell and others also may come forward soon.
Possibly, but HP would certainly not be the only one noticing effects from this then I'd think?

I would be curious what controller the drives use. It sounds like it triggers a SSD controller reset, judging from the value I can only assume it is a "value wrap" situation which the controller drtects and freaks out about, triggering a drive wide reset including the onboard encryption keys.

If so... Much dumb, very dead, WOW.
Posted on Reply
#6
LocutusH
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.
Posted on Reply
#7
Yukikaze
LocutusH
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.
It is amazing what some people believe.
Posted on Reply
#8
Easo
LocutusH
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.
We are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.
Posted on Reply
#9
Nater
Easo
We are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.
Nothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).
Posted on Reply
#10
Rich Riedl
From the HPE bullitin: (https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us)
HPE was notified by a Solid State Drive (SSD) manufacturer of a firmware defect affecting certain SAS SSD models (reference the table below) used in a number of HPE server and storage products (i.e., HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected).

The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.
Posted on Reply
#11
yakk
"After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously."

:eek::eek::eek:

Backup Restore hell awaits...

Not acceptable at all from HP...
Posted on Reply
#12
Steevo
Nater
Nothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).
Those were caused by the plague of cheap electrolytic capacitors. There were more expensive and higher quality but almost everyone got bit by cheaper caps.
Posted on Reply
#13
gamefoo21
Steevo
Those were caused by the plague of cheap electrolytic capacitors. There were more expensive and higher quality but almost everyone got bit by cheaper caps.
The other dirty little thing they did was use low quality flash. Flashing the rom, things that wrote to the NVRAM...

Brickity brick...
Posted on Reply
#14
efikkan
DeathtoGnomes
why does that number look familiar?
Every decent coder probably immediately understands what's going on here, this is an integer overflow, causing the firmware to crash. The maximum range for a signed 16-bit integer is 32767, add 1 to this and you'll get -32768, which probably causes undefined behavior in the firmware.

Those who wants to see what happens can run this:
[code=c]
#include <stdio.h>
#include <stdint.h>

int main(int argc, char* argv[]) {
int16_t test = 32767;
printf("Before: %d\n", test);
test++;
printf("After: %d\n", test);

return 0;
}
[/code]
This will output:
Before: 32767
After: -32768

This is a well known rookie mistake, but there are in fact two mistakes here; 1) the small range for the integer and 2) whatever caused the crash after the overflow, where the second one is the serious one. This kind of bug is inexcusable in critical software like firmware.

So how did this mistake pass code review? Well, either the coder explicitly used a fixed precision integer type like int16_t, which should have made the overflow pretty obvious, or used int and the compiler chose a 16-bit integer for the embedded platform. For native code, I usually recommend using fixed precision integer types over int whenever possible as it makes potential overflows much more obvious, and it forces the coder to consciously choose an appropriate range.
Posted on Reply
#15
Readlight
Why HP Windows 10 only haw standard AHC driver?
Posted on Reply
#16
Vayra86
LocutusH
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.
Isn't planned obscolescence always on purpose?
Posted on Reply
#17
R-T-B
efikkan
Every decent coder probably immediately understands what's going on here, this is an integer overflow, causing the firmware to crash. The maximum range for a signed 16-bit integer is 32767, add 1 to this and you'll get -32768, which probably causes undefined behavior in the firmware.

Those who wants to see what happens can run this:
[code=c]
#include <stdio.h>
#include <stdint.h>

int main(int argc, char* argv[]) {
int16_t test = 32767;
printf("Before: %d\n", test);
test++;
printf("After: %d\n", test);

return 0;
}
[/code]
This will output:
Before: 32767
After: -32768

This is a well known rookie mistake, but there are in fact two mistakes here; 1) the small range for the integer and 2) whatever caused the crash after the overflow, where the second one is the serious one. This kind of bug is inexcusable in critical software like firmware.

So how did this mistake pass code review? Well, either the coder explicitly used a fixed precision integer type like int16_t, which should have made the overflow pretty obvious, or used int and the compiler chose a 16-bit integer for the embedded platform. For native code, I usually recommend using fixed precision integer types over int whenever possible as it makes potential overflows much more obvious, and it forces the coder to consciously choose an appropriate range.
I called that earlier. This is exactly what is going on. The fact that the firmware resets after is probably an antitampering measure biting them.
Posted on Reply
#18
lexluthermiester
Easo
We are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.
While that is a bit "tinhat", what I find interesting(and simultaneously disturbing), is how easy such a scenario would be to pull off. And has it actually been done?
Posted on Reply
#19
R-T-B
lexluthermiester
While that is a bit "tinhat", what I find interesting(and simultaneously disturbing), is how easy such a scenario would be to pull off. And has it actually been done?
I mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.
Posted on Reply
#20
lexluthermiester
R-T-B
I mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.
Yeah, but that's printer ink. Not really a vital part of a system that can cause liability issues. SSD's are a critical component which have legal liability potential.
Posted on Reply
#21
R-T-B
lexluthermiester
Yeah, but that's printer ink. Not really a vital part of a system that can cause liability issues. SSD's are a critical component which have legal liability potential.
I thought you were asking if it was technically done before?

Not comparing the practices by any means, just saying yep, it has.
Posted on Reply
#22
rtwjunkie
PC Gaming Enthusiast
LocutusH
I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...
Maybe my math is off, but servers running these will reach that point in 1,333 days (roughly). At 365 days per year, that would be about 3.65 years until that bug failure is reached.

That’s a pretty good run. But if you want to go the conspiracy route, I’ll step out of your way. I’m not one to limit thiose who have a quest to tilt with windmills.
Posted on Reply
#23
rutra80
Reminds VelociRaptor firmware bug which would spit TLERs after a month.
Never understood how it became a worldwide advisory to use same drives in mirrored arrays.
Posted on Reply
#24
Easo
Nater
Nothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).
HPE literally does not manufacture SSD's. They are using rebranded ones - Samsung, Intel, I think Micron too.

R-T-B
I mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.
IIRC they got punished for that, no?
P.S.
HP and HPE split 4 years ago, just a reminder.
Posted on Reply
#25
R-T-B
Easo
IIRC they got punished for that, no?
Toner I think. No idea on if it covered the similar practices in ink.


Easo
HP and HPE split 4 years ago, just a reminder.
Yep, was not implying corellation, just saying it's been done in the general industry.
Posted on Reply
Add your own comment