User Tools

Site Tools


raid:mdadm:troubleshooting:disk_failure

RAID - mdadm - Troubleshooting - Disk Failure

cat /proc/mdstat

returns:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde[4] sdd[2](F) sdc[1] sdb[0]
      3144192 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]
 
unused devices: 

NOTE: The (F): Indicates the /dev/sdd device has failed.


Confirm the failure

mdadm --detail /dev/md0

returns:

/dev/md0:
        Version : 1.2
  Creation Time : Tue Sep  6 18:31:41 2011
     Raid Level : raid5
     Array Size : 3144192 (3.00 GiB 3.22 GB)
  Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent
 
    Update Time : Thu Sep  8 16:14:21 2011
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
 
         Layout : left-symmetric
     Chunk Size : 512K
 
           Name : raidtest.loc:0  (local to host raidtest.loc)
           UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
         Events : 75
 
    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       0        0        2      removed
       4       8       64        3      active sync   /dev/sde
 
       2       8       48        -      faulty spare   /dev/sdd

Remove the failed device from the array

mdadm --manage --remove /dev/md0 /dev/sdd

returns:

mdadm: hot removed /dev/sdd from /dev/md0

Obtain the serial number of the failed device

This allows you to refer to your documentation to know which bay the failed disk in in.

  • Some systems do not have any visible indication of which bay a specific disk is loaded into, so unless this is documented at the time the RAID was created there is a risk that the wrong physical disk may be dealt with, instead of the actual broken disk.
smartctl -a /dev/sdd | grep -i serial

returns:

Serial Number:    VB455d882e-8013d7c9

Replace the failed drive

Physically remove the failed drive.

Replace this failed drive with a new one.

Add the replacement drive to the array:

mdadm --add /dev/md0 /dev/sdd

returns:

mdadm: added /dev/sdd

Check the status of the RAID

mdadm --detail /dev/md0

returns:

/dev/md0:
        Version : 1.2
  Creation Time : Tue Sep  6 18:31:41 2011
     Raid Level : raid5
     Array Size : 3144192 (3.00 GiB 3.22 GB)
  Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent
 
    Update Time : Thu Sep  8 17:03:44 2011
          State : clean, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1
 
         Layout : left-symmetric
     Chunk Size : 512K
 
 Rebuild Status : 36% complete
 
           Name : raidtest.loc:0  (local to host raidtest.loc)
           UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
         Events : 85
 
    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       5       8       48        2      spare rebuilding   /dev/sdd
       4       8       64        3      active sync   /dev/sde

NOTE: Wait for the RAID to rebuild.

  • This will result in a fully redundant array again.

WARNING: If another drive fails before the point a new drive is added and the rebuild is complete, it could result in all data being lost!

There are several ways around this:

  • Add a hot spare, which will cause the rebuild process to start right away if a drive fails, minimizing the “danger time”.
    • This is simply adding another drive while the array already has enough. It will automatically get picked up and a rebuild will occur if a drive fails.
  • Have another raid level, such as raid 6, which can survive 2 failures.
mdadm --add /dev/md0 /dev/sdf

returns:

mdadm: added /dev/sdf

Grow the array

mdadm --grow /dev/md0 --level=6 --raid-devices=5 --backup-file=/root/backup

returns:

mdadm level of /dev/md0 changed to raid6

NOTE: Notice the backup-file argument.

Ensure:

  • The backup-file location must NOT be in the array.
  • There is sufficient space for this backup file.
    • This data will be deleted after the rebuild is complete.

Check

cat /proc/mdstat

returns:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf[6] sdd[5] sde[4] sdc[1] sdb[0]
      3144192 blocks super 1.2 level 6, 512k chunk, algorithm 18 [5/4] [UUUU_]
      [==>..................]  reshape = 13.2% (139264/1048064) finish=4.4min speed=3373K/sec
 
unused devices: 

Another Check

mdadm --detail /dev/md0

returns:

/dev/md0:
        Version : 1.2
  Creation Time : Tue Sep  6 18:31:41 2011
     Raid Level : raid6
     Array Size : 3144192 (3.00 GiB 3.22 GB)
  Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent
 
    Update Time : Thu Sep  8 18:54:26 2011
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0
 
         Layout : left-symmetric
     Chunk Size : 512K
 
           Name : raidtest.loc:0  (local to host raidtest.loc)
           UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
         Events : 2058
 
    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       5       8       48        2      active sync   /dev/sdd
       4       8       64        3      active sync   /dev/sde
       6       8       80        4      active sync   /dev/sdf

NOTE: If for whatever reason you want to go from 6 drives to 5, you can also do that.

  • This will then end up with a hot spare.
raid/mdadm/troubleshooting/disk_failure.txt · Last modified: 2021/09/14 12:21 by peter

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki