raid:mdadm:troubleshooting:disk_failure
Table of Contents
RAID - mdadm - Troubleshooting - Disk Failure
cat /proc/mdstat
returns:
Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sde[4] sdd[2](F) sdc[1] sdb[0] 3144192 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U] unused devices:
NOTE: The (F): Indicates the /dev/sdd device has failed.
Confirm the failure
mdadm --detail /dev/md0
returns:
/dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid5 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Thu Sep 8 16:14:21 2011 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 75 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 2 0 0 2 removed 4 8 64 3 active sync /dev/sde 2 8 48 - faulty spare /dev/sdd
Remove the failed device from the array
mdadm --manage --remove /dev/md0 /dev/sdd
returns:
mdadm: hot removed /dev/sdd from /dev/md0
Obtain the serial number of the failed device
This allows you to refer to your documentation to know which bay the failed disk in in.
- Some systems do not have any visible indication of which bay a specific disk is loaded into, so unless this is documented at the time the RAID was created there is a risk that the wrong physical disk may be dealt with, instead of the actual broken disk.
smartctl -a /dev/sdd | grep -i serial
returns:
Serial Number: VB455d882e-8013d7c9
Replace the failed drive
Physically remove the failed drive.
Replace this failed drive with a new one.
Add the replacement drive to the array:
mdadm --add /dev/md0 /dev/sdd
returns:
mdadm: added /dev/sdd
Check the status of the RAID
mdadm --detail /dev/md0
returns:
/dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid5 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Thu Sep 8 17:03:44 2011 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Rebuild Status : 36% complete Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 85 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 5 8 48 2 spare rebuilding /dev/sdd 4 8 64 3 active sync /dev/sde
NOTE: Wait for the RAID to rebuild.
- This will result in a fully redundant array again.
WARNING: If another drive fails before the point a new drive is added and the rebuild is complete, it could result in all data being lost!
There are several ways around this:
- Add a hot spare, which will cause the rebuild process to start right away if a drive fails, minimizing the “danger time”.
- This is simply adding another drive while the array already has enough. It will automatically get picked up and a rebuild will occur if a drive fails.
- Have another raid level, such as raid 6, which can survive 2 failures.
mdadm --add /dev/md0 /dev/sdf
returns:
mdadm: added /dev/sdf
Grow the array
mdadm --grow /dev/md0 --level=6 --raid-devices=5 --backup-file=/root/backup
returns:
mdadm level of /dev/md0 changed to raid6
NOTE: Notice the backup-file argument.
Ensure:
- The backup-file location must NOT be in the array.
- There is sufficient space for this backup file.
- This data will be deleted after the rebuild is complete.
Check
cat /proc/mdstat
returns:
Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdf[6] sdd[5] sde[4] sdc[1] sdb[0] 3144192 blocks super 1.2 level 6, 512k chunk, algorithm 18 [5/4] [UUUU_] [==>..................] reshape = 13.2% (139264/1048064) finish=4.4min speed=3373K/sec unused devices:
Another Check
mdadm --detail /dev/md0
returns:
/dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid6 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 5 Total Devices : 5 Persistence : Superblock is persistent Update Time : Thu Sep 8 18:54:26 2011 State : clean Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 2058 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 5 8 48 2 active sync /dev/sdd 4 8 64 3 active sync /dev/sde 6 8 80 4 active sync /dev/sdf
NOTE: If for whatever reason you want to go from 6 drives to 5, you can also do that.
- This will then end up with a hot spare.
raid/mdadm/troubleshooting/disk_failure.txt · Last modified: 2021/09/14 12:21 by peter