RAID - mdadm - Troubleshooting - Disk Failure

RAID - mdadm - Troubleshooting - Disk Failure

cat /proc/mdstat

returns:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde[4] sdd[2](F) sdc[1] sdb[0]
      3144192 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]
 
unused devices:

NOTE: The (F): Indicates the /dev/sdd device has failed.

Confirm the failure

mdadm --detail /dev/md0

returns:

/dev/md0:
        Version : 1.2
  Creation Time : Tue Sep  6 18:31:41 2011
     Raid Level : raid5
     Array Size : 3144192 (3.00 GiB 3.22 GB)
  Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent
 
    Update Time : Thu Sep  8 16:14:21 2011
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
 
         Layout : left-symmetric
     Chunk Size : 512K
 
           Name : raidtest.loc:0  (local to host raidtest.loc)
           UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
         Events : 75
 
    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       0        0        2      removed
       4       8       64        3      active sync   /dev/sde
 
       2       8       48        -      faulty spare   /dev/sdd

Remove the failed device from the array

mdadm --manage --remove /dev/md0 /dev/sdd

returns:

mdadm: hot removed /dev/sdd from /dev/md0

Obtain the serial number of the failed device

This allows you to refer to your documentation to know which bay the failed disk in in.

Some systems do not have any visible indication of which bay a specific disk is loaded into, so unless this is documented at the time the RAID was created there is a risk that the wrong physical disk may be dealt with, instead of the actual broken disk.

smartctl -a /dev/sdd | grep -i serial

returns:

Serial Number:    VB455d882e-8013d7c9

Replace the failed drive

Physically remove the failed drive.

Replace this failed drive with a new one.

Add the replacement drive to the array:

mdadm --add /dev/md0 /dev/sdd

returns:

mdadm: added /dev/sdd

Check the status of the RAID

mdadm --detail /dev/md0

returns:

/dev/md0:
        Version : 1.2
  Creation Time : Tue Sep  6 18:31:41 2011
     Raid Level : raid5
     Array Size : 3144192 (3.00 GiB 3.22 GB)
  Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent
 
    Update Time : Thu Sep  8 17:03:44 2011
          State : clean, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1
 
         Layout : left-symmetric
     Chunk Size : 512K
 
 Rebuild Status : 36% complete
 
           Name : raidtest.loc:0  (local to host raidtest.loc)
           UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
         Events : 85
 
    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       5       8       48        2      spare rebuilding   /dev/sdd
       4       8       64        3      active sync   /dev/sde

NOTE: Wait for the RAID to rebuild.

This will result in a fully redundant array again.

WARNING: If another drive fails before the point a new drive is added and the rebuild is complete, it could result in all data being lost!

There are several ways around this:

Add a hot spare, which will cause the rebuild process to start right away if a drive fails, minimizing the “danger time”.
- This is simply adding another drive while the array already has enough. It will automatically get picked up and a rebuild will occur if a drive fails.

Have another raid level, such as raid 6, which can survive 2 failures.

mdadm --add /dev/md0 /dev/sdf

returns:

mdadm: added /dev/sdf

Grow the array

mdadm --grow /dev/md0 --level=6 --raid-devices=5 --backup-file=/root/backup

returns:

mdadm level of /dev/md0 changed to raid6

NOTE: Notice the backup-file argument.

Ensure:

The backup-file location must NOT be in the array.
There is sufficient space for this backup file.
- This data will be deleted after the rebuild is complete.

Check

cat /proc/mdstat

returns:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf[6] sdd[5] sde[4] sdc[1] sdb[0]
      3144192 blocks super 1.2 level 6, 512k chunk, algorithm 18 [5/4] [UUUU_]
      [==>..................]  reshape = 13.2% (139264/1048064) finish=4.4min speed=3373K/sec
 
unused devices:

Another Check

mdadm --detail /dev/md0

returns:

/dev/md0:
        Version : 1.2
  Creation Time : Tue Sep  6 18:31:41 2011
     Raid Level : raid6
     Array Size : 3144192 (3.00 GiB 3.22 GB)
  Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent
 
    Update Time : Thu Sep  8 18:54:26 2011
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0
 
         Layout : left-symmetric
     Chunk Size : 512K
 
           Name : raidtest.loc:0  (local to host raidtest.loc)
           UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
         Events : 2058
 
    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       5       8       48        2      active sync   /dev/sdd
       4       8       64        3      active sync   /dev/sde
       6       8       80        4      active sync   /dev/sdf

NOTE: If for whatever reason you want to go from 6 drives to 5, you can also do that.

This will then end up with a hot spare.

Table of Contents

RAID - mdadm - Troubleshooting - Disk Failure

Confirm the failure

Remove the failed device from the array

Obtain the serial number of the failed device

Replace the failed drive

Check the status of the RAID

Grow the array

Check

Another Check