Tuesday, November 26th 2019

HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

Nov 26th, 2019 23:46 Discuss (27 Comments)

HP issued a warning to its customers that some of its SAS SSDs come with a bug that causes them to fail at exactly 32,768 hours of use. For an always-on or high-uptime server, this translates to 3 years, 270 days and 8 hours of usage. The affected models of SSDs are shipped in many of HP's flagship server and storage products, spanning its HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 product-lines.

HP has released an SSD firmware update that fixes this bug and cannot stress the importance of deploying the update enough. This is because once a drive hits the 32,768-hour literal deadline and breaks down, both the drive and the data on it become unrecoverable. There is no other mitigation to this bug than the firmware update. HP released easy to use online firmware update tools that let admins update firmware of their drivers from within their OS. The online firmware update tools support Linux, Windows, and VMWare. Below is a list of affected drives. Get the appropriate firmware update from this page.

Source: Bleeping Computer

Add your own comment

27 Comments on HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

DeathtoGnomes

why does that number look familiar?

btarunr

Editor & Senior Moderator

DeathtoGnomeswhy does that number look familiar?

It's 32 kibi hours.

piloponth

DeathtoGnomeswhy does that number look familiar?

2^15

Easo

This is SSD manufacturer's fault, most likely. While HPE does use custom firmwares, I really doubt they write them from scratch for, say, Samsung drives. It is possible that Dell and others also may come forward soon.

R-T-B

EasoThis is SSD manufacturer's fault, most likely. While HPE does use custom firmwares, I really doubt they write them from scratch for, say, Samsung drives. It is possible that Dell and others also may come forward soon.

Possibly, but HP would certainly not be the only one noticing effects from this then I'd think?

I would be curious what controller the drives use. It sounds like it triggers a SSD controller reset, judging from the value I can only assume it is a "value wrap" situation which the controller drtects and freaks out about, triggering a drive wide reset including the onboard encryption keys.

If so... Much dumb, very dead, WOW.

LocutusH

A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.

Yukikaze

LocutusHA fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.

It is amazing what some people believe.

Easo

LocutusHA fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.

We are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.

Nater

EasoWe are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.

Nothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).

#10

Rich Riedl

From the HPE bullitin: (support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us)

HPE was notified by a Solid State Drive (SSD) manufacturer of a firmware defect affecting certain SAS SSD models (reference the table below) used in a number of HPE server and storage products (i.e., HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected).

The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.

#11

Unregistered

"After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously."

:eek::eek::eek:

Backup Restore hell awaits...

Not acceptable at all from HP...

#12

Steevo

NaterNothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).

Those were caused by the plague of cheap electrolytic capacitors. There were more expensive and higher quality but almost everyone got bit by cheaper caps.

#13

gamefoo21

SteevoThose were caused by the plague of cheap electrolytic capacitors. There were more expensive and higher quality but almost everyone got bit by cheaper caps.

The other dirty little thing they did was use low quality flash. Flashing the rom, things that wrote to the NVRAM...

Brickity brick...

#14

efikkan

DeathtoGnomeswhy does that number look familiar?

Every decent coder probably immediately understands what's going on here, this is an integer overflow, causing the firmware to crash. The maximum range for a signed 16-bit integer is 32767, add 1 to this and you'll get -32768, which probably causes undefined behavior in the firmware.

Those who wants to see what happens can run this:

#include <stdio.h>
#include <stdint.h>

int main(int argc, char* argv[]) {
int16_t test = 32767;
printf("Before: %d\n", test);
test++;
printf("After: %d\n", test);

return 0;
}

This will output:
Before: 32767
After: -32768

This is a well known rookie mistake, but there are in fact two mistakes here; 1) the small range for the integer and 2) whatever caused the crash after the overflow, where the second one is the serious one. This kind of bug is inexcusable in critical software like firmware.

So how did this mistake pass code review? Well, either the coder explicitly used a fixed precision integer type like int16_t, which should have made the overflow pretty obvious, or used int and the compiler chose a 16-bit integer for the embedded platform. For native code, I usually recommend using fixed precision integer types over int whenever possible as it makes potential overflows much more obvious, and it forces the coder to consciously choose an appropriate range.

#15

Readlight

Why HP Windows 10 only haw standard AHC driver?

#16

Vayra86

LocutusHA fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.

Isn't planned obscolescence always on purpose?

#17

R-T-B

efikkanEvery decent coder probably immediately understands what's going on here, this is an integer overflow, causing the firmware to crash. The maximum range for a signed 16-bit integer is 32767, add 1 to this and you'll get -32768, which probably causes undefined behavior in the firmware.

Those who wants to see what happens can run this:

#include <stdio.h>
#include <stdint.h>

int main(int argc, char* argv[]) {
int16_t test = 32767;
printf("Before: %d\n", test);
test++;
printf("After: %d\n", test);

return 0;
}
This will output:
Before: 32767
After: -32768

This is a well known rookie mistake, but there are in fact two mistakes here; 1) the small range for the integer and 2) whatever caused the crash after the overflow, where the second one is the serious one. This kind of bug is inexcusable in critical software like firmware.

So how did this mistake pass code review? Well, either the coder explicitly used a fixed precision integer type like int16_t, which should have made the overflow pretty obvious, or used int and the compiler chose a 16-bit integer for the embedded platform. For native code, I usually recommend using fixed precision integer types over int whenever possible as it makes potential overflows much more obvious, and it forces the coder to consciously choose an appropriate range.

I called that earlier. This is exactly what is going on. The fact that the firmware resets after is probably an antitampering measure biting them.

#18

lexluthermiester

EasoWe are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.

While that is a bit "tinhat", what I find interesting(and simultaneously disturbing), is how easy such a scenario would be to pull off. And has it actually been done?

#19

R-T-B

lexluthermiesterWhile that is a bit "tinhat", what I find interesting(and simultaneously disturbing), is how easy such a scenario would be to pull off. And has it actually been done?

I mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.

#20

lexluthermiester

R-T-BI mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.

Yeah, but that's printer ink. Not really a vital part of a system that can cause liability issues. SSD's are a critical component which have legal liability potential.

#21

R-T-B

lexluthermiesterYeah, but that's printer ink. Not really a vital part of a system that can cause liability issues. SSD's are a critical component which have legal liability potential.

I thought you were asking if it was technically done before?

Not comparing the practices by any means, just saying yep, it has.

#22

rtwjunkie

PC Gaming Enthusiast

LocutusHI am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

Maybe my math is off, but servers running these will reach that point in 1,333 days (roughly). At 365 days per year, that would be about 3.65 years until that bug failure is reached.

That’s a pretty good run. But if you want to go the conspiracy route, I’ll step out of your way. I’m not one to limit thiose who have a quest to tilt with windmills.

#23

rutra80

Reminds VelociRaptor firmware bug which would spit TLERs after a month.
Never understood how it became a worldwide advisory to use same drives in mirrored arrays.

#24

Easo

NaterNothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).

HPE literally does not manufacture SSD's. They are using rebranded ones - Samsung, Intel, I think Micron too.

R-T-BI mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.

IIRC they got punished for that, no?
P.S.
HP and HPE split 4 years ago, just a reminder.

#25

R-T-B

EasoIIRC they got punished for that, no?

Toner I think. No idea on if it covered the similar practices in ink.

EasoHP and HPE split 4 years ago, just a reminder.

Yep, was not implying corellation, just saying it's been done in the general industry.

Add your own comment

HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

27 Comments on HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

Related News

27 Comments on HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts