• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Why do Solid State Drives fail so suddenly?

Joined
Mar 21, 2021
Messages
5,488 (3.65/day)
Location
Colorado, U.S.A.
System Name CyberPowerPC ET8070
Processor Intel Core i5-10400F
Motherboard Gigabyte B460M DS3H AC-Y1
Memory 2 x Crucial Ballistix 8GB DDR4-3000
Video Card(s) MSI Nvidia GeForce GTX 1660 Super
Storage Boot: Intel OPTANE SSD P1600X Series 118GB M.2 PCIE
Display(s) Dell P2416D (2560 x 1440)
Power Supply EVGA 500W1 (modified to have two bridge rectifiers)
Software Windows 11 Home
What is going on when solid state drives fail? If they run out of spare sectors they could go into read-only mode, so I assume it is something more than this.
 
4 fail only one with read only and that was recoverable, as for why there is just so many things that can go wrong and some times it's just a cap or fuse and a easy fix.
 
I had a crucial M4 256 fail with 98% left..

Guess which brand I won’t be buying :laugh:
 
If I would hazard a guess, a combination of wear leveling and system area not doing wear leveling. Both with their own issues.
The main storage area of the NAND will do wear leveling, so it will wear more or less evenly. Which also means they will also all get borked, more or less evenly. So when one goes, it all goes.
And the second part would be that the system area of the NAND, which stores firmware, wear leveling data, metadata in general etc. Does not do wear leveling, and if that section goes well then you got a drive without firmware and it will just fail to boot.

Though its been a while since I read about it, not sure if thats the case but it would be my guess at least. And mostly the second part being the larger issue with drives just going poff one day even with plenty of "health" left reported by SMART.
 
A car ECM has two CPU's, one rather anemic so the vehicle can limp back home in case of failure.

Why can't a solid state drive include a similar second CPU so it can limp along after failure and allow access to the data?
 
A car ECM has two CPU's, one rather anemic so the vehicle can limp back home in case of failure.

Why can't a solid state drive include a similar second CPU so it can limp along after failure and allow access to the data?

That's a fairly easy question to answer... $$$$$
 
Depends on how you see it, sure the ECM might have redundant systems. But does everything else? Does it also come with extra cylinders of one implodes? How about a third axle if one falls off? Two steering wheels? Plenty of things that can make sure no limping will be done.

Either way just because a drive isnt detected or seen as "raw" doesnt mean the data is gone. Factory Access Mode exists for a reason. From there you can read the raw data from the blocks directly. So data recovery is certainly possible in most cases. Though it's not exactly tools distributed freely and as far as I recall, there is not standard method for accessing said factory mode.
 
I had a crucial M4 256 fail with 98% left..

Guess which brand I won’t be buying :laugh:
Hi,
I still have four very old crucial mx100 in operation
2-256gb
2-128gb
Only one died and it was really linux never running trim on it although the crucial firmware wasn't compatible with linux as it turns out
As long as windows is the os they are fine and crucial sent back a rma replacement although a refurbished it's still working just fine heck might be the same I sent to them lol

 
Now that I think of it, it died while running in my X58 system, and I lost 3 out of 4 Raptor 150s, and a RevoDrive with that board in the 8 or 9 years that I ran it as a daily..
 
Unless the information is made public by people who examine failures we wont really know, but from what I have researched, I think all of the following are possible causes.

1 - Early more primitive firmware having poor wear levelling so some cells fail way earlier than they would do with wear levelling.
2 - The address mapping table (sorry if I got name wrong), as I understand it is stuck on the same physical cells, certain workloads might wear these down before the other cells, I think also SATA ssd's with no DRAM would have accelerated wear on these cells as well. As I understand it NVME dramless ssd's can utilize system ram in place of a dram buffer so might not have this issue to the same degree although still riskier than onboard buffer.
3 - Bad firmware. Where a bug could brick an ssd.
4 - Blown capacitors or other failed circuitry, more likely on power cycles.

Enterprise SSD's because they have power loss protection, they dont honor synchronous writes, this reduces the window of vulnerability to kernel panics, as the data is written quicker, it also drastically reduces write amplification as synch writes are very bad for that. I actually emulate enteprise ssd behavior on my consumer ssd's when possible as I am now of the opinion it is safer. I have never blogged or posted about it before though as I am aware its a controversial opinion, but as an example, on a FIO sync write test, on one of my consumer ssd's in default mode it took 46s to write all data so 46s of vulnerability to kernel panic or power loss, with enterprise emulation (disabling sync on ssd but not the filesystem), the same data is written in 7s, so much smaller window of vulnerability. In this mode the filesystem still is sync not async, so isnt immediately reporting success to the software.

Full async it writes in 4s (with OS been told done immediately) providing there is no other writes at same time, as sync writes on filesystem nearly always take priority over async writes. So I still keep sync enabled on filesystems.
 
I remember over a decade ago when they were new, a lot of times it was attributed to firmware, some wouldn't even come out of sleep.

OCZ had issues, there was a few controllers then too, sandforce, indilinx/bigfoot, etc. I would say that's long in the past, but who knows.

Chrcoluk covered it well.
 
I had a crucial M4 256 fail with 98% left..

Guess which brand I won’t be buying :laugh:
I'm with you, because I caught the SMART being wacky with my Crucial MX 500 500 GB SATA SSD. It got taken out of my PC by February 4, 2021.
IIRC, February 4, 2021, was when I installed my Samsung 970 Pro 512 GB NVMe SSD.
 
the only SSDs that are went bad on me were all M.2 drives.

i have/had 6x 1TB MX500 which are all fine since years of constant usage (not just being a data grave.)
my brother still runs a very old sandisk plus (120GB) which exceeded its TBW by almost 2X and is still fine.
atm i have two 4TB 870 QVO, 1x 4TB Sandisk Ultra 3D (basically a WD Blue) and two 500GB 870 Evo for backups.
not a single SSD has even one dead sector or any kind of error.

But my HP EX900 (M.2 NVME) dropped its write speed to 0.x MB/s and bluescreened non stop after less than a week of usage.
my Corsair MP600 just froze my PC and never came back at the second day.
and now my MP600 Pro runs fine but already dropped to 94% with just around 2% of its TBW reached. (no errors)

and i had a WD Green M.2 SATA SSD for an old laptop that went to read only before the windows installation was finished.
 
My guess is that ssds are similar to thumb drives in that aspect. The memory chip can just fail, and with thumb drives it usually happens quite often if its got some generic/low quality memory. As of current ssds they are all using tlc and qlc which is both cheaper /gb and is also less reliable than lets say slc or mlc. Especially with the current trends to eliminate the dram even in midrange models (think WD SN550). I think its the algorithms not doing well enough to hold that cheap rubbish hardware together. Personally i ve experienced a couple of ssd failures, and just to get a bit more fail-safe (at least in the main desktop rig) i chose to have a dedicated os drive - a 32gig optane module as a standalone ssd. And so far it works as an above average ssd. Even though its just got a basic intel microcontroller and 2 banks of 3d xpoint memory. Windows treats it like an ssd and the trim is working. Also got a cheapass WDS240 (only for games tho), an WD SN750 for some random stuff and also some old 2010 wd blue 1tb with 15k hours and issue-free smart.
 
Last edited:
Mine is still good after 7 years, a SanDisk Extreme Pro 960GB

I have 2 original Intel SSD's set me back near $500 for 2x80GB and they still work today, in the end there is a lot to go wrong and i would not be surprised it's just a cheap component on the ssd that gets sorted and makes it fail.
 
linux never running trim on it
Ah yes, I love having to enable automatic TRIM myself on an OS by editing the file that automounts drives at boot (another thing I had to do myself, setting my drives to be actually mounted on startup. Why??) - had to do that while I was testing Linux. Needless to say, never again.

I had a crucial M4 256 fail with 98% left..

Guess which brand I won’t be buying :laugh:
Jeez. Both my MX500's are manually updated to the latest firmware and both at 99% and 98% health respectively, been using them for over a year. Hope they last.

The real star in my PC is a SanDisk X400 that I've had for ages now and it's still at 95% life despite all the hell and formats I've put it through. It's still as quick as my MX500s as well.

My 980 Pro is at 94% and 20 TBW after over a year, should be fine to last until that 300 TBW. If it dies, I'll just replace it - my important backups are on the cloud anyway.
 
I really like the idea of soft failure (warning before total failure)
 
With SSDs, expect the first symptom to suddenly be an error in the event log about a file being corrupted or SFC failing and saying that it can't fix some files, when SFC /scannow is ran!
Along with extreme lag!

Symptoms are based on failures of 2 different SSDs.
 
Why can't a solid state drive include a similar second CPU so it can limp along after failure and allow access to the data?
If the actual storage medium fails, that'll do you no good. DRAM and Flash memory are far more likely to fail than the controller driving them. What SSDs need is more redundancy. More ECC bits gives you more resilience at the cost of capacity. Not every bit has to be committed to novel data.

I had a crucial M4 256 fail with 98% left..
I had two WD Black drive fail within a week of buying them. I still buy WD though. The reality is that hardware can fail early in its life which is why a good stress test on new hardware is never a bad idea. I would have caught the two bad drives if I had properly vetted them before adding them to my RAID.
 
Last edited:
Back
Top