Table of Contents
ZFS - Troubleshooting - Replace a Disk
Check the Pool
Verify that a disk is bad and that it needs to be replaced.
zpool status
returns:
pool: testpool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun 9 00:28:24 2013 config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST3300620A_5QF0MJFP ONLINE 0 0 0 ata-ST3300831A_5NF0552X UNAVAIL 0 0 0 ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0 ata-ST3200822A_3LJ0189C ONLINE 0 0 0 errors: No known data errors
NOTE: This shows that one disk is unavailable.
- This is ata-ST3300831A_5NF0552X.
Add a New Disk
- Add a new disk.
- Optionally remove the old disk.
NOTE: The new disk is ata-ST3500320AS_9QM03ATQ.
- This can be seen at /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ.
- Only remove the old drive at this point if it is a redundant setup.
Replace the Old Device
zpool replace testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ zpool offline testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X zpool detatch testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X
NOTE: Here the old device is specified first followed by the new device.
- If the pool is a redundant configuration, data will be copied from other good disks to the new disk.
- If the pool is not redundant, data will be copied from the old device to the new device.
- Once that is complete, the old device can be physically removed.
Potential Issues
If the bad device has already been removed from the system, this might fail with the following error.
cannot offline /dev/disk/by-id/ata-ST3300831A_5NF0552X: no such device in pool
- This is because the label of the drive that died does not exist in the system any more.
- Therefore the bad device cannot be specified by ID.
- If this case, try specifying it by device name or by GUID.
There are various ways to determine a GUID:
zdb # Find GUID. zdb -l /dev/sda1 # In case the 'zdb' command does not work. zpool status -g # Find GUID. zpool status -L # Find device name, resolving links.
Try to get the GUID using zdb:
zdb testpool: version: 28 name: 'testpool' state: 0 txg: 162804 pool_guid: 14829240649900366534 hostname: 'BigMamba' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 14829240649900366534 children[0]: type: 'raidz' id: 0 guid: 5355850150368902284 nparity: 1 metaslab_array: 31 metaslab_shift: 32 ashift: 9 asize: 791588896768 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 11426107064765252810 path: '/dev/disk/by-id/ata-ST3300620A_5QF0MJFP-part2' phys_path: '/dev/gptid/73b31683-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 children[1]: type: 'disk' id: 1 guid: 15935140517898495532 path: '/dev/disk/by-id/ata-ST3300831A_5NF0552X-part2' phys_path: '/dev/gptid/746c949a-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 children[2]: type: 'disk' id: 2 guid: 7183706725091321492 path: '/dev/disk/by-id/ata-ST3200822A_5LJ1CHMS-part2' phys_path: '/dev/gptid/7541115a-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 children[3]: type: 'disk' id: 3 guid: 17196042497722925662 path: '/dev/disk/by-id/ata-ST3200822A_3LJ0189C-part2' phys_path: '/dev/gptid/760a94ee-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 features_for_read:
NOTE: The GUID can be ascertained as 15935140517898495532.
Use the GUID to offline the old device:
zpool offline testpool 15935140517898495532
And check this has worked:
zpool status pool: testpool state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun 9 00:28:24 2013 config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST3300620A_5QF0MJFP ONLINE 0 0 0 ata-ST3300831A_5NF0552X OFFLINE 0 0 0 ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0 ata-ST3200822A_3LJ0189C ONLINE 0 0 0 errors: No known data errors
and then replace the pool:
zpool replace testpool 15935140517898495532 /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ
And check again this has worked:
zpool status pool: testpool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Jun 9 01:44:36 2013 408M scanned out of 419G at 20,4M/s, 5h50m to go 101M resilvered, 0,10% done config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST3300620A_5QF0MJFP ONLINE 0 0 0 replacing-1 OFFLINE 0 0 0 ata-ST3300831A_5NF0552X OFFLINE 0 0 0 ata-ST3500320AS_9QM03ATQ ONLINE 0 0 0 (resilvering) ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0 ata-ST3200822A_3LJ0189C ONLINE 0 0 0 errors: No known data errors
NOTE: If the old disk is already removed from the system and a new device has replaced it with the same device name, the following command can be used instead:
zpool offline testpool sdd
zpool remove testpool sdd
zpool attach -f testpool sdc sdd
Wait For Resilvering to Complete
Before the pool will be back to normal it will need to sync data over to the new disk.
- It will remain in a degraded status while the data syncs.
- This data syncing process is called resilvering.
- It may take a very long time depending on the size of the disks and on how much data is on them.
The status of the resilvering can be checked:
zpool status testpool
Physically Remove the Old Drive
Physically remove the old drive.
- If it is hot-swappable then just pull it out.
- Otherwise, shutdown the system, before removing the device.