We saw a Kernel error in LogWatch and wanted to investigate closely. This is dedicated CentOS 6.5 machine. The error we got was
WARNING: Kernel Errors Present res 51/40:18:68:f2:8d/00:00:00:00:00/40 Emask 0x409 (media error) <F> ...: 1 Time(s) res 51/40:40:c0:28:8e/00:00:00:00:00/40 Emask 0x409 (media error) <F> ...: 6 Time(s) res 51/40:48:38:34:8e/00:00:00:00:00/40 Emask 0x409 (media error) <F> ...: 5 Time(s) ata2.00: error: { UNC } ...: 1 Time(s) ata2: SError: { 10B8B } ...: 1 Time(s) ata3.00: error: { UNC } ...: 11 Time(s) sd 2:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocat ...: 2 Time(s) sd 2:0:0:0: [sdb] Sense Key : Medium Error [current] [descr ...: 2 Time(s)
We wanted to check the hard drives so we used smartctl to check the disks. smartctl controls the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many newer hard drives. The machine is under RAID 1, so we checked each drive, one at a time. First we checked the first disk by running a long smartctl test.
> smartctl -t long /dev/sg0
After ~2 hours, we reviewed the output by printing all SMART information about the disk.
> smartctl -a /dev/sg0
This command outputs a lot of information. Here’s a snippet of our results, which says the long test results are good, and we have nothing to worry about.
<snip...> SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 2924 - <snip...>
We then repeated the process for the second disk.
> smartctl -t long /dev/sg1 > smartctl -a /dev/sg1
The output showed a read failure, and a 90% remaining life. This means we need to look at replacing the second disk.
<snip...> SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 2827 9316552 <snip...>