====== ZFS - Troubleshooting - Replace a Disk ======
===== Check the Pool =====
Verify that a disk is bad and that it needs to be replaced.
zpool status
returns:
pool: testpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun 9 00:28:24 2013
config:
NAME STATE READ WRITE CKSUM
testpool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-ST3300620A_5QF0MJFP ONLINE 0 0 0
ata-ST3300831A_5NF0552X UNAVAIL 0 0 0
ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0
ata-ST3200822A_3LJ0189C ONLINE 0 0 0
errors: No known data errors
**NOTE:** This shows that one disk is unavailable.
* This is ata-ST3300831A_5NF0552X.
----
===== Add a New Disk =====
* Add a new disk.
* Optionally remove the old disk.
**NOTE:** The new disk is ata-ST3500320AS_9QM03ATQ.
* This can be seen at /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ.
* Only remove the old drive at this point if it is a redundant setup.
----
===== Replace the Old Device =====
zpool replace testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ
zpool offline testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X
zpool detatch testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X
**NOTE:** Here the old device is specified first followed by the new device.
* If the pool is a redundant configuration, data will be copied from other good disks to the new disk.
* If the pool is not redundant, data will be copied from the old device to the new device.
* Once that is complete, the old device can be physically removed.
==== Potential Issues ====
If the bad device has already been removed from the system, this might fail with the following error.
cannot offline /dev/disk/by-id/ata-ST3300831A_5NF0552X: no such device in pool
* This is because the label of the drive that died does not exist in the system any more.
* Therefore the bad device cannot be specified by ID.
* If this case, try specifying it by device name or by GUID.
----
There are various ways to determine a GUID:
zdb # Find GUID.
zdb -l /dev/sda1 # In case the 'zdb' command does not work.
zpool status -g # Find GUID.
zpool status -L # Find device name, resolving links.
----
Try to get the GUID using zdb:
zdb
testpool:
version: 28
name: 'testpool'
state: 0
txg: 162804
pool_guid: 14829240649900366534
hostname: 'BigMamba'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 14829240649900366534
children[0]:
type: 'raidz'
id: 0
guid: 5355850150368902284
nparity: 1
metaslab_array: 31
metaslab_shift: 32
ashift: 9
asize: 791588896768
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 11426107064765252810
path: '/dev/disk/by-id/ata-ST3300620A_5QF0MJFP-part2'
phys_path: '/dev/gptid/73b31683-537f-11e2-bad7-50465d4eb8b0'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 15935140517898495532
path: '/dev/disk/by-id/ata-ST3300831A_5NF0552X-part2'
phys_path: '/dev/gptid/746c949a-537f-11e2-bad7-50465d4eb8b0'
whole_disk: 1
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 7183706725091321492
path: '/dev/disk/by-id/ata-ST3200822A_5LJ1CHMS-part2'
phys_path: '/dev/gptid/7541115a-537f-11e2-bad7-50465d4eb8b0'
whole_disk: 1
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 17196042497722925662
path: '/dev/disk/by-id/ata-ST3200822A_3LJ0189C-part2'
phys_path: '/dev/gptid/760a94ee-537f-11e2-bad7-50465d4eb8b0'
whole_disk: 1
create_txg: 4
features_for_read:
**NOTE:** The GUID can be ascertained as 15935140517898495532.
Use the GUID to offline the old device:
zpool offline testpool 15935140517898495532
And check this has worked:
zpool status
pool: testpool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun 9 00:28:24 2013
config:
NAME STATE READ WRITE CKSUM
testpool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-ST3300620A_5QF0MJFP ONLINE 0 0 0
ata-ST3300831A_5NF0552X OFFLINE 0 0 0
ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0
ata-ST3200822A_3LJ0189C ONLINE 0 0 0
errors: No known data errors
and then replace the pool:
zpool replace testpool 15935140517898495532 /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ
And check again this has worked:
zpool status
pool: testpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Jun 9 01:44:36 2013
408M scanned out of 419G at 20,4M/s, 5h50m to go
101M resilvered, 0,10% done
config:
NAME STATE READ WRITE CKSUM
testpool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-ST3300620A_5QF0MJFP ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
ata-ST3300831A_5NF0552X OFFLINE 0 0 0
ata-ST3500320AS_9QM03ATQ ONLINE 0 0 0 (resilvering)
ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0
ata-ST3200822A_3LJ0189C ONLINE 0 0 0
errors: No known data errors
**NOTE:** If the old disk is already removed from the system and a new device has replaced it with the same device name, the following command can be used instead:
zpool offline testpool sdd
zpool remove testpool sdd
zpool attach -f testpool sdc sdd
----
===== Wait For Resilvering to Complete =====
Before the pool will be back to normal it will need to sync data over to the new disk.
* It will remain in a degraded status while the data syncs.
* This data syncing process is called resilvering.
* It may take a __very__ long time depending on the size of the disks and on how much data is on them.
The status of the resilvering can be checked:
zpool status testpool
----
===== Physically Remove the Old Drive =====
Physically remove the old drive.
* If it is hot-swappable then just pull it out.
* Otherwise, shutdown the system, before removing the device.
----
===== References =====
https://docs.joyent.com/private-cloud/troubleshooting/disk-replacement