- Joined
- Mar 28, 2018
- Messages
- 1,889 (0.73/day)
- Location
- Arizona
System Name | Space Heater MKIV |
---|---|
Processor | AMD Ryzen 7 5800X |
Motherboard | ASRock B550 Taichi |
Cooling | Noctua NH-U14S, 3x Noctua NF-A14s |
Memory | 2x32GB Teamgroup T-Force Vulcan Z DDR4-3600 C18 1.35V |
Video Card(s) | PowerColor RX 6800 XT Red Devil (2150MHz, 240W PL) |
Storage | 2TB WD SN850X, 4x1TB Crucial MX500 (striped array), LG WH16NS40 BD-RE |
Display(s) | Dell S3422DWG (34" 3440x1440 144Hz) |
Case | Phanteks Enthoo Pro M |
Audio Device(s) | Edifier R1700BT, Samson SR850 |
Power Supply | Corsair RM850x, CyberPower CST135XLU |
Mouse | Logitech MX Master 3 |
Keyboard | Glorious GMMK 2 96% |
Software | Windows 10 LTSC 2021, Linux Mint |
So I recently upgraded the drive controller in my server from an ancient SIL3132 based one to one with an ASMedia ASM1062 controller. First impressions were great; the device was recognized immediately and just worked, unlike the old card which had about a 50% chance of being recognized on boot.
So I decided to run a scrub on my array (I'm running eight 3TB drives in a ZFS RAID-Z1 array) just to make sure everything is working. About halfway through, one of my drives started reporting write errors; starting at two, and now at four.
So of course, the first thing I did was check the SMART status of the drive in question...
Only thing that stuck out to me is that the long term test I started last night was interrupted at 40%. Everything else looks fine; pretty good for an eight-year-old drive (yes, I know it's an old drive).
Should I be replacing this drive? I'll be doing a memory test later to rule out the possibility of bad memory, and I'm hoping the new controller isn't defective. While the SIL3132 was extremely unreliable between system boots, it never spat out write errors like this.
So I decided to run a scrub on my array (I'm running eight 3TB drives in a ZFS RAID-Z1 array) just to make sure everything is working. About halfway through, one of my drives started reporting write errors; starting at two, and now at four.
root@*********:~# zpool status -v
pool: Library
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Tue Mar 30 15:23:53 2021
9.41T scanned at 165M/s, 8.59T issued at 151M/s, 12.3T total
0B repaired, 70.11% done, 07:03:46 to go
config:
NAME STATE READ WRITE CKSUM
Library ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1261593 ONLINE 0 4 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1278490 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1247533 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1193008 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1206440 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1276345 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1260261 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1261733 ONLINE 0 0 0
errors: No known data errors
pool: Library
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Tue Mar 30 15:23:53 2021
9.41T scanned at 165M/s, 8.59T issued at 151M/s, 12.3T total
0B repaired, 70.11% done, 07:03:46 to go
config:
NAME STATE READ WRITE CKSUM
Library ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1261593 ONLINE 0 4 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1278490 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1247533 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1193008 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1206440 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1276345 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1260261 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1261733 ONLINE 0 0 0
errors: No known data errors
So of course, the first thing I did was check the SMART status of the drive in question...
root@*********:~# smartctl -a /dev/sdi
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-0.bpo.3-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68AX9N0
Serial Number: WD-WMC1T1261593
LU WWN Device Id: 5 0014ee 058cefa20
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Mar 31 08:00:52 2021 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 36) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (40320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 404) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 181 177 021 Pre-fail Always - 5941
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1120
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 052 052 000 Old_age Always - 35637
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 92
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1065
194 Temperature_Celsius 0x0022 121 109 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error
# 1 Extended offline Interrupted (host reset) 40% 35633 -
# 2 Short offline Completed without error 00% 28559 -
# 3 Short offline Completed without error 00% 27816 -
# 4 Short offline Completed without error 00% 27073 -
# 5 Short offline Completed without error 00% 26353 -
# 6 Short offline Completed without error 00% 25610 -
# 7 Short offline Completed without error 00% 24891 -
# 8 Short offline Completed without error 00% 24148 -
# 9 Short offline Completed without error 00% 22733 -
#10 Short offline Completed without error 00% 21990 -
#11 Short offline Completed without error 00% 21270 -
#12 Short offline Completed without error 00% 20527 -
#13 Short offline Completed without error 00% 19808 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-0.bpo.3-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68AX9N0
Serial Number: WD-WMC1T1261593
LU WWN Device Id: 5 0014ee 058cefa20
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Mar 31 08:00:52 2021 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 36) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (40320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 404) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 181 177 021 Pre-fail Always - 5941
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1120
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 052 052 000 Old_age Always - 35637
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 92
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1065
194 Temperature_Celsius 0x0022 121 109 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error
# 1 Extended offline Interrupted (host reset) 40% 35633 -
# 2 Short offline Completed without error 00% 28559 -
# 3 Short offline Completed without error 00% 27816 -
# 4 Short offline Completed without error 00% 27073 -
# 5 Short offline Completed without error 00% 26353 -
# 6 Short offline Completed without error 00% 25610 -
# 7 Short offline Completed without error 00% 24891 -
# 8 Short offline Completed without error 00% 24148 -
# 9 Short offline Completed without error 00% 22733 -
#10 Short offline Completed without error 00% 21990 -
#11 Short offline Completed without error 00% 21270 -
#12 Short offline Completed without error 00% 20527 -
#13 Short offline Completed without error 00% 19808 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Only thing that stuck out to me is that the long term test I started last night was interrupted at 40%. Everything else looks fine; pretty good for an eight-year-old drive (yes, I know it's an old drive).
Should I be replacing this drive? I'll be doing a memory test later to rule out the possibility of bad memory, and I'm hoping the new controller isn't defective. While the SIL3132 was extremely unreliable between system boots, it never spat out write errors like this.