====== RAID - mdadm - Troubleshooting - Disk Failure ======
cat /proc/mdstat
returns:
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde[4] sdd[2](F) sdc[1] sdb[0]
3144192 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]
unused devices:
**NOTE:** The **(F)**: Indicates the /dev/sdd device has failed.
----
===== Confirm the failure =====
mdadm --detail /dev/md0
returns:
/dev/md0:
Version : 1.2
Creation Time : Tue Sep 6 18:31:41 2011
Raid Level : raid5
Array Size : 3144192 (3.00 GiB 3.22 GB)
Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Thu Sep 8 16:14:21 2011
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : raidtest.loc:0 (local to host raidtest.loc)
UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
Events : 75
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 0 0 2 removed
4 8 64 3 active sync /dev/sde
2 8 48 - faulty spare /dev/sdd
----
===== Remove the failed device from the array =====
mdadm --manage --remove /dev/md0 /dev/sdd
returns:
mdadm: hot removed /dev/sdd from /dev/md0
----
===== Obtain the serial number of the failed device =====
This allows you to refer to your documentation to know which bay the failed disk in in.
* Some systems do not have any visible indication of which bay a specific disk is loaded into, so unless this is documented at the time the RAID was created there is a risk that the wrong physical disk may be dealt with, instead of the actual broken disk.
smartctl -a /dev/sdd | grep -i serial
returns:
Serial Number: VB455d882e-8013d7c9
----
===== Replace the failed drive =====
Physically remove the failed drive.
Replace this failed drive with a new one.
Add the replacement drive to the array:
mdadm --add /dev/md0 /dev/sdd
returns:
mdadm: added /dev/sdd
----
===== Check the status of the RAID =====
mdadm --detail /dev/md0
returns:
/dev/md0:
Version : 1.2
Creation Time : Tue Sep 6 18:31:41 2011
Raid Level : raid5
Array Size : 3144192 (3.00 GiB 3.22 GB)
Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Thu Sep 8 17:03:44 2011
State : clean, degraded, recovering
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Rebuild Status : 36% complete
Name : raidtest.loc:0 (local to host raidtest.loc)
UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
Events : 85
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
5 8 48 2 spare rebuilding /dev/sdd
4 8 64 3 active sync /dev/sde
----
**NOTE:** Wait for the RAID to rebuild.
* This will result in a fully redundant array again.
**WARNING:** If another drive fails before the point a new drive is added and the rebuild is complete, it could result in all data being lost!
There are several ways around this:
* Add a hot spare, which will cause the rebuild process to start right away if a drive fails, minimizing the "danger time".
* This is simply adding another drive while the array already has enough. It will automatically get picked up and a rebuild will occur if a drive fails.
* Have another raid level, such as raid 6, which can survive 2 failures.
mdadm --add /dev/md0 /dev/sdf
returns:
mdadm: added /dev/sdf
----
===== Grow the array =====
mdadm --grow /dev/md0 --level=6 --raid-devices=5 --backup-file=/root/backup
returns:
mdadm level of /dev/md0 changed to raid6
**NOTE:** Notice the **backup-file** argument.
Ensure:
* The backup-file location must NOT be in the array.
* There is sufficient space for this backup file.
* This data will be deleted after the rebuild is complete.
----
===== Check =====
cat /proc/mdstat
returns:
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf[6] sdd[5] sde[4] sdc[1] sdb[0]
3144192 blocks super 1.2 level 6, 512k chunk, algorithm 18 [5/4] [UUUU_]
[==>..................] reshape = 13.2% (139264/1048064) finish=4.4min speed=3373K/sec
unused devices:
----
===== Another Check =====
mdadm --detail /dev/md0
returns:
/dev/md0:
Version : 1.2
Creation Time : Tue Sep 6 18:31:41 2011
Raid Level : raid6
Array Size : 3144192 (3.00 GiB 3.22 GB)
Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Update Time : Thu Sep 8 18:54:26 2011
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : raidtest.loc:0 (local to host raidtest.loc)
UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf
Events : 2058
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
5 8 48 2 active sync /dev/sdd
4 8 64 3 active sync /dev/sde
6 8 80 4 active sync /dev/sdf
**NOTE:** If for whatever reason you want to go from 6 drives to 5, you can also do that.
* This will then end up with a hot spare.