• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

HP Enterprise SSD Firmware Bug Causes them to Fail at 32,768 Hours of Use, Fix Released

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,776 (7.41/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
HP issued a warning to its customers that some of its SAS SSDs come with a bug that causes them to fail at exactly 32,768 hours of use. For an always-on or high-uptime server, this translates to 3 years, 270 days and 8 hours of usage. The affected models of SSDs are shipped in many of HP's flagship server and storage products, spanning its HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 product-lines.

HP has released an SSD firmware update that fixes this bug and cannot stress the importance of deploying the update enough. This is because once a drive hits the 32,768-hour literal deadline and breaks down, both the drive and the data on it become unrecoverable. There is no other mitigation to this bug than the firmware update. HP released easy to use online firmware update tools that let admins update firmware of their drivers from within their OS. The online firmware update tools support Linux, Windows, and VMWare. Below is a list of affected drives. Get the appropriate firmware update from this page.



View at TechPowerUp Main Site
 
why does that number look familiar?
 
This is SSD manufacturer's fault, most likely. While HPE does use custom firmwares, I really doubt they write them from scratch for, say, Samsung drives. It is possible that Dell and others also may come forward soon.
 
This is SSD manufacturer's fault, most likely. While HPE does use custom firmwares, I really doubt they write them from scratch for, say, Samsung drives. It is possible that Dell and others also may come forward soon.

Possibly, but HP would certainly not be the only one noticing effects from this then I'd think?

I would be curious what controller the drives use. It sounds like it triggers a SSD controller reset, judging from the value I can only assume it is a "value wrap" situation which the controller drtects and freaks out about, triggering a drive wide reset including the onboard encryption keys.

If so... Much dumb, very dead, WOW.
 
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.
 
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.

It is amazing what some people believe.
 
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.

We are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.
 
We are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.

Nothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).
 
From the HPE bullitin: (https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us)

HPE was notified by a Solid State Drive (SSD) manufacturer of a firmware defect affecting certain SAS SSD models (reference the table below) used in a number of HPE server and storage products (i.e., HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected).

The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.
 
"After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously."

:eek::eek::eek:

Backup Restore hell awaits...

Not acceptable at all from HP...
 
Nothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).
Those were caused by the plague of cheap electrolytic capacitors. There were more expensive and higher quality but almost everyone got bit by cheaper caps.
 
Those were caused by the plague of cheap electrolytic capacitors. There were more expensive and higher quality but almost everyone got bit by cheaper caps.

The other dirty little thing they did was use low quality flash. Flashing the rom, things that wrote to the NVRAM...

Brickity brick...
 
why does that number look familiar?
Every decent coder probably immediately understands what's going on here, this is an integer overflow, causing the firmware to crash. The maximum range for a signed 16-bit integer is 32767, add 1 to this and you'll get -32768, which probably causes undefined behavior in the firmware.

Those who wants to see what happens can run this:
C:
#include <stdio.h>
#include <stdint.h>

int main(int argc, char* argv[]) {
    int16_t test = 32767;
    printf("Before: %d\n", test);
    test++;
    printf("After: %d\n", test);

    return 0;
}
This will output:
Before: 32767
After: -32768

This is a well known rookie mistake, but there are in fact two mistakes here; 1) the small range for the integer and 2) whatever caused the crash after the overflow, where the second one is the serious one. This kind of bug is inexcusable in critical software like firmware.

So how did this mistake pass code review? Well, either the coder explicitly used a fixed precision integer type like int16_t, which should have made the overflow pretty obvious, or used int and the compiler chose a 16-bit integer for the embedded platform. For native code, I usually recommend using fixed precision integer types over int whenever possible as it makes potential overflows much more obvious, and it forces the coder to consciously choose an appropriate range.
 
Why HP Windows 10 only haw standard AHC driver?
 
A fix for a bug... sure... who believes this?
There you have your proof, that planned obscolescence on purpose exists.

I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...

I hope someone sues HP, and forces all other manufacturers too, to stop this practice.

Isn't planned obscolescence always on purpose?
 
Every decent coder probably immediately understands what's going on here, this is an integer overflow, causing the firmware to crash. The maximum range for a signed 16-bit integer is 32767, add 1 to this and you'll get -32768, which probably causes undefined behavior in the firmware.

Those who wants to see what happens can run this:
C:
#include <stdio.h>
#include <stdint.h>

int main(int argc, char* argv[]) {
    int16_t test = 32767;
    printf("Before: %d\n", test);
    test++;
    printf("After: %d\n", test);

    return 0;
}
This will output:
Before: 32767
After: -32768

This is a well known rookie mistake, but there are in fact two mistakes here; 1) the small range for the integer and 2) whatever caused the crash after the overflow, where the second one is the serious one. This kind of bug is inexcusable in critical software like firmware.

So how did this mistake pass code review? Well, either the coder explicitly used a fixed precision integer type like int16_t, which should have made the overflow pretty obvious, or used int and the compiler chose a 16-bit integer for the embedded platform. For native code, I usually recommend using fixed precision integer types over int whenever possible as it makes potential overflows much more obvious, and it forces the coder to consciously choose an appropriate range.

I called that earlier. This is exactly what is going on. The fact that the firmware resets after is probably an antitampering measure biting them.
 
We are talking about enterprise drives here, Mr. Conspiracy. HPE are not that retarded to fuck with their primary clients, some of which are bigger than them. It is not like there are bunch of other storage manufacturers who would be more than happy to take HPE's share in that case.
While that is a bit "tinhat", what I find interesting(and simultaneously disturbing), is how easy such a scenario would be to pull off. And has it actually been done?
 
While that is a bit "tinhat", what I find interesting(and simultaneously disturbing), is how easy such a scenario would be to pull off. And has it actually been done?

I mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.
 
I mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.
Yeah, but that's printer ink. Not really a vital part of a system that can cause liability issues. SSD's are a critical component which have legal liability potential.
 
Yeah, but that's printer ink. Not really a vital part of a system that can cause liability issues. SSD's are a critical component which have legal liability potential.

I thought you were asking if it was technically done before?

Not comparing the practices by any means, just saying yep, it has.
 
I am sure they implemented this on purpose, just didnt expect anyone to find out, why the drives die shortly after the warranty period expires...
Maybe my math is off, but servers running these will reach that point in 1,333 days (roughly). At 365 days per year, that would be about 3.65 years until that bug failure is reached.

That’s a pretty good run. But if you want to go the conspiracy route, I’ll step out of your way. I’m not one to limit thiose who have a quest to tilt with windmills.
 
Reminds VelociRaptor firmware bug which would spit TLERs after a month.
Never understood how it became a worldwide advisory to use same drives in mirrored arrays.
 
Nothing says HP made these drives. Being they're keeping the mfg unnamed, probably means they did though. Otherwise I'd be throwing my source under the bus to protect my brand.

To further the conspiracies tho, I remember back in the college days(early 2000s) I had Wifi routers from Netgear and D-Link that would die literally days after the 1 year warranty was up. 3 years straight IIRC. I then bought a Linksys WRT54G (model?) and reflashed it to that third party DD-WRT firmware, and it's still running to this day at my in-laws far as I know (~10+ years?).

HPE literally does not manufacture SSD's. They are using rebranded ones - Samsung, Intel, I think Micron too.

I mean, HPs own consumer ink division has been doing this for some time (increments a counter on print ops, unrelated to ink level)... so yeah.
IIRC they got punished for that, no?
P.S.
HP and HPE split 4 years ago, just a reminder.
 
Back
Top