Kernel Errors Present and smartctl

We saw a Kernel error in LogWatch and wanted to investigate closely. This is dedicated CentOS 6.5 machine. The error we got was

WARNING: Kernel Errors Present
res 51/40:18:68:f2:8d/00:00:00:00:00/40 Emask 0x409 (media error)
<F> ...: 1 Time(s)
res 51/40:40:c0:28:8e/00:00:00:00:00/40 Emask 0x409 (media error)
<F> ...: 6 Time(s)
res 51/40:48:38:34:8e/00:00:00:00:00/40 Emask 0x409 (media error)
<F> ...: 5 Time(s)
ata2.00: error: { UNC } ...: 1 Time(s)
ata2: SError: { 10B8B } ...: 1 Time(s)
ata3.00: error: { UNC } ...: 11 Time(s)
sd 2:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocat ...:
2 Time(s)
sd 2:0:0:0: [sdb] Sense Key : Medium Error [current] [descr ...: 2 Time(s)

We wanted to check the hard drives so we used smartctl to check the disks. smartctl controls the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many newer hard drives. The machine is under RAID 1, so we checked each drive, one at a time. First we checked the first disk by running a long smartctl test.

> smartctl -t long /dev/sg0

After ~2 hours, we reviewed the output by printing all SMART information about the disk.

> smartctl -a /dev/sg0

This command outputs a lot of information. Here’s a snippet of our results, which says the long test results are good, and we have nothing to worry about.

<snip...>
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2924 -
<snip...>

We then repeated the process for the second disk.

> smartctl -t long /dev/sg1
> smartctl -a /dev/sg1

The output showed a read failure, and a 90% remaining life. This means we need to look at replacing the second disk.

<snip...>
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2827         9316552
<snip...>