• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

ZFS: "One or more devices has experienced an unrecoverable error."

Joined
Mar 28, 2018
Messages
1,794 (0.81/day)
Location
Arizona
System Name Space Heater MKIV
Processor AMD Ryzen 7 5800X
Motherboard ASRock B550 Taichi
Cooling Noctua NH-U14S, 3x Noctua NF-A14s
Memory 2x32GB Teamgroup T-Force Vulcan Z DDR4-3600 C18 1.35V
Video Card(s) PowerColor RX 6800 XT Red Devil (2150MHz, 240W PL)
Storage 2TB WD SN850X, 4x1TB Crucial MX500 (striped array), LG WH16NS40 BD-RE
Display(s) Dell S3422DWG (34" 3440x1440 144Hz)
Case Phanteks Enthoo Pro M
Audio Device(s) Edifier R1700BT, Samson SR850
Power Supply Corsair RM850x, CyberPower CST135XLU
Mouse Logitech MX Master 3
Keyboard Glorious GMMK 2 96%
Software Windows 10 LTSC 2021, Linux Mint
So I recently upgraded the drive controller in my server from an ancient SIL3132 based one to one with an ASMedia ASM1062 controller. First impressions were great; the device was recognized immediately and just worked, unlike the old card which had about a 50% chance of being recognized on boot.

So I decided to run a scrub on my array (I'm running eight 3TB drives in a ZFS RAID-Z1 array) just to make sure everything is working. About halfway through, one of my drives started reporting write errors; starting at two, and now at four.

root@*********:~# zpool status -v
pool: Library
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Tue Mar 30 15:23:53 2021
9.41T scanned at 165M/s, 8.59T issued at 151M/s, 12.3T total
0B repaired, 70.11% done, 07:03:46 to go
config:

NAME STATE READ WRITE CKSUM
Library ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1261593 ONLINE 0 4 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1278490 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1247533 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1193008 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1206440 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1276345 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1260261 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1261733 ONLINE 0 0 0

errors: No known data errors

So of course, the first thing I did was check the SMART status of the drive in question...

root@*********:~# smartctl -a /dev/sdi
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-0.bpo.3-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68AX9N0
Serial Number: WD-WMC1T1261593
LU WWN Device Id: 5 0014ee 058cefa20
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Mar 31 08:00:52 2021 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 36) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (40320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 404) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 181 177 021 Pre-fail Always - 5941
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1120
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 052 052 000 Old_age Always - 35637
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 92
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1065
194 Temperature_Celsius 0x0022 121 109 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error
# 1 Extended offline Interrupted (host reset) 40% 35633 -
# 2 Short offline Completed without error 00% 28559 -
# 3 Short offline Completed without error 00% 27816 -
# 4 Short offline Completed without error 00% 27073 -
# 5 Short offline Completed without error 00% 26353 -
# 6 Short offline Completed without error 00% 25610 -
# 7 Short offline Completed without error 00% 24891 -
# 8 Short offline Completed without error 00% 24148 -
# 9 Short offline Completed without error 00% 22733 -
#10 Short offline Completed without error 00% 21990 -
#11 Short offline Completed without error 00% 21270 -
#12 Short offline Completed without error 00% 20527 -
#13 Short offline Completed without error 00% 19808 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Only thing that stuck out to me is that the long term test I started last night was interrupted at 40%. Everything else looks fine; pretty good for an eight-year-old drive (yes, I know it's an old drive).

Should I be replacing this drive? I'll be doing a memory test later to rule out the possibility of bad memory, and I'm hoping the new controller isn't defective. While the SIL3132 was extremely unreliable between system boots, it never spat out write errors like this.
 

TheLostSwede

News Editor
Joined
Nov 11, 2004
Messages
16,055 (2.26/day)
Location
Sweden
System Name Overlord Mk MLI
Processor AMD Ryzen 7 7800X3D
Motherboard Gigabyte X670E Aorus Master
Cooling Noctua NH-D15 SE with offsets
Memory 32GB Team T-Create Expert DDR5 6000 MHz @ CL30-34-34-68
Video Card(s) Gainward GeForce RTX 4080 Phantom GS
Storage 1TB Solidigm P44 Pro, 2 TB Corsair MP600 Pro, 2TB Kingston KC3000
Display(s) Acer XV272K LVbmiipruzx 4K@160Hz
Case Fractal Design Torrent Compact
Audio Device(s) Corsair Virtuoso SE
Power Supply be quiet! Pure Power 12 M 850 W
Mouse Logitech G502 Lightspeed
Keyboard Corsair K70 Max
Software Windows 10 Pro
Benchmark Scores https://valid.x86.fr/5za05v
Bad SATA cable or poorly attached cable?
 
Joined
Mar 28, 2018
Messages
1,794 (0.81/day)
Location
Arizona
System Name Space Heater MKIV
Processor AMD Ryzen 7 5800X
Motherboard ASRock B550 Taichi
Cooling Noctua NH-U14S, 3x Noctua NF-A14s
Memory 2x32GB Teamgroup T-Force Vulcan Z DDR4-3600 C18 1.35V
Video Card(s) PowerColor RX 6800 XT Red Devil (2150MHz, 240W PL)
Storage 2TB WD SN850X, 4x1TB Crucial MX500 (striped array), LG WH16NS40 BD-RE
Display(s) Dell S3422DWG (34" 3440x1440 144Hz)
Case Phanteks Enthoo Pro M
Audio Device(s) Edifier R1700BT, Samson SR850
Power Supply Corsair RM850x, CyberPower CST135XLU
Mouse Logitech MX Master 3
Keyboard Glorious GMMK 2 96%
Software Windows 10 LTSC 2021, Linux Mint
Bad SATA cable or poorly attached cable?
I have the drives in an external eSATA enclosure. There are two eSATA cables running from the enclosure to the controller. I'd think a bad cable would cause four of the drives to spit out errors. Good guess though; guess I didn't provide enough information.
 

TheLostSwede

News Editor
Joined
Nov 11, 2004
Messages
16,055 (2.26/day)
Location
Sweden
System Name Overlord Mk MLI
Processor AMD Ryzen 7 7800X3D
Motherboard Gigabyte X670E Aorus Master
Cooling Noctua NH-D15 SE with offsets
Memory 32GB Team T-Create Expert DDR5 6000 MHz @ CL30-34-34-68
Video Card(s) Gainward GeForce RTX 4080 Phantom GS
Storage 1TB Solidigm P44 Pro, 2 TB Corsair MP600 Pro, 2TB Kingston KC3000
Display(s) Acer XV272K LVbmiipruzx 4K@160Hz
Case Fractal Design Torrent Compact
Audio Device(s) Corsair Virtuoso SE
Power Supply be quiet! Pure Power 12 M 850 W
Mouse Logitech G502 Lightspeed
Keyboard Corsair K70 Max
Software Windows 10 Pro
Benchmark Scores https://valid.x86.fr/5za05v
I have the drives in an external eSATA enclosure. There are two eSATA cables running from the enclosure to the controller. I'd think a bad cable would cause four of the drives to spit out errors. Good guess though; guess I didn't provide enough information.
Right, well, that sort of makes that less of a possible cause then.
Afraid I have no other suggestions, as I don't use ZFS. I run OMV on my NAS with UnionFS/SnapRAID, as I don't have ECC RAM.
 
Joined
Mar 28, 2018
Messages
1,794 (0.81/day)
Location
Arizona
System Name Space Heater MKIV
Processor AMD Ryzen 7 5800X
Motherboard ASRock B550 Taichi
Cooling Noctua NH-U14S, 3x Noctua NF-A14s
Memory 2x32GB Teamgroup T-Force Vulcan Z DDR4-3600 C18 1.35V
Video Card(s) PowerColor RX 6800 XT Red Devil (2150MHz, 240W PL)
Storage 2TB WD SN850X, 4x1TB Crucial MX500 (striped array), LG WH16NS40 BD-RE
Display(s) Dell S3422DWG (34" 3440x1440 144Hz)
Case Phanteks Enthoo Pro M
Audio Device(s) Edifier R1700BT, Samson SR850
Power Supply Corsair RM850x, CyberPower CST135XLU
Mouse Logitech MX Master 3
Keyboard Glorious GMMK 2 96%
Software Windows 10 LTSC 2021, Linux Mint
Bit of an update:

Was able to rule out memory as an issue as the server passed a memory test. Using a brand-new (for me) system as the server, and I got another error after around eight hours on the same drive.

I've swapped the cables going to the drive enclosure to see if the same drive gets errors or if a different one does.

I ran dmesg after seeing the error and saw this in the report

Code:
[39629.630199] ata15.00: exception Emask 0x0 SAct 0x400000 SErr 0x0 action 0x6 frozen
[39629.630291] ata15.00: failed command: WRITE FPDMA QUEUED
[39629.630352] ata15.00: cmd 61/08:b0:50:af:b8/00:00:8f:00:00/40 tag 22 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[39629.630485] ata15.00: status: { DRDY }
[39629.630528] ata15: hard resetting link
[39629.945062] ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[39629.946474] ata15.00: configured for UDMA/133
[39629.946496] sd 14:0:0:0: [sdi] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
[39629.946501] sd 14:0:0:0: [sdi] tag#22 Sense Key : Illegal Request [current]
[39629.946505] sd 14:0:0:0: [sdi] tag#22 Add. Sense: Unaligned write command
[39629.946510] sd 14:0:0:0: [sdi] tag#22 CDB: Write(16) 8a 00 00 00 00 00 8f b8 af 50 00 00 00 08 00 00
[39629.946516] blk_update_request: I/O error, dev sdi, sector 2411245392 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[39629.946624] zio pool=Library vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1261593-part1 error=5 type=2 offset=1234556592128 size=4096 flags=180880

Still not sure if this is a drive error, a controller error, or something else. The drive in question still doesn't have any red flags in the SMART report.

I've only just now started getting these errors after switching my SIL3132 controller to an ASM1062 one. I'm hoping it's not the cause.
 

newtekie1

Semi-Retired Folder
Joined
Nov 22, 2005
Messages
28,472 (4.24/day)
Location
Indiana, USA
Processor Intel Core i7 10850K@5.2GHz
Motherboard AsRock Z470 Taichi
Cooling Corsair H115i Pro w/ Noctua NF-A14 Fans
Memory 32GB DDR4-3600
Video Card(s) RTX 2070 Super
Storage 500GB SX8200 Pro + 8TB with 1TB SSD Cache
Display(s) Acer Nitro VG280K 4K 28"
Case Fractal Design Define S
Audio Device(s) Onboard is good enough for me
Power Supply eVGA SuperNOVA 1000w G3
Software Windows 10 Pro x64
SMART is not the end all to errors. I've had plenty of drives that were bad that never showed any SMART errors.
 
Joined
Mar 28, 2018
Messages
1,794 (0.81/day)
Location
Arizona
System Name Space Heater MKIV
Processor AMD Ryzen 7 5800X
Motherboard ASRock B550 Taichi
Cooling Noctua NH-U14S, 3x Noctua NF-A14s
Memory 2x32GB Teamgroup T-Force Vulcan Z DDR4-3600 C18 1.35V
Video Card(s) PowerColor RX 6800 XT Red Devil (2150MHz, 240W PL)
Storage 2TB WD SN850X, 4x1TB Crucial MX500 (striped array), LG WH16NS40 BD-RE
Display(s) Dell S3422DWG (34" 3440x1440 144Hz)
Case Phanteks Enthoo Pro M
Audio Device(s) Edifier R1700BT, Samson SR850
Power Supply Corsair RM850x, CyberPower CST135XLU
Mouse Logitech MX Master 3
Keyboard Glorious GMMK 2 96%
Software Windows 10 LTSC 2021, Linux Mint
SMART is not the end all to errors. I've had plenty of drives that were bad that never showed any SMART errors.
I guess I'll install the cold spare I have if I get another error and see what happens.
 
Top