User Tools

Site Tools


zfs:troubleshooting:replace_a_disk

This is an old revision of the document!


ZFS - Troubleshooting - Replace a Disk

Check the Pool

Verify that a disk is bad and that it needs to be replaced.

zpool status

returns:

  pool: testpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun  9 00:28:24 2013
config:
 
        NAME                         STATE     READ WRITE CKSUM
        testpool                     DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            ata-ST3300620A_5QF0MJFP  ONLINE       0     0     0
            ata-ST3300831A_5NF0552X  UNAVAIL      0     0     0
            ata-ST3200822A_5LJ1CHMS  ONLINE       0     0     0
            ata-ST3200822A_3LJ0189C  ONLINE       0     0     0
 
errors: No known data errors

NOTE: This shows that one disk is unavailable.

  • This is ata-ST3300831A_5NF0552X.

Add a New Disk

  • Add a new disk.
  • Optionally remove the old disk.

NOTE: The new disk is ata-ST3500320AS_9QM03ATQ.

  • This can be seen at /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ.
  • Only remove the old drive at this point if it is a redundant setup.

Replace the Old Device

zpool replace testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ
zpool offline testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X
zpool detatch testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X

NOTE: Here the old device is specified first followed by the new device.

  • If the pool is a redundant configuration, data will be copied from other good disks to the new disk.
  • If the pool is not redundant, data will be copied from the old device to the new device.
  • Once that is complete, the old device can be physically removed.

Potential Issues

If the bad device has already been removed from the system, this might fail with the following error.

cannot offline /dev/disk/by-id/ata-ST3300831A_5NF0552X: no such device in pool
  • This is because the label of the drive that died does not exist in the system any more.
  • Therefore the bad device cannot be specified by ID.
  • If this case, try specifying it by device name or by GUID.

Get the GUID:

root@zeus:/dev# zdb
hermes:
    version: 28
    name: 'hermes'
    state: 0
    txg: 162804
    pool_guid: 14829240649900366534
    hostname: 'zeus'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 14829240649900366534
        children[0]:
            type: 'raidz'
            id: 0
            guid: 5355850150368902284
            nparity: 1
            metaslab_array: 31
            metaslab_shift: 32
            ashift: 9
            asize: 791588896768
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 11426107064765252810
                path: '/dev/disk/by-id/ata-ST3300620A_5QF0MJFP-part2'
                phys_path: '/dev/gptid/73b31683-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 15935140517898495532
                path: '/dev/disk/by-id/ata-ST3300831A_5NF0552X-part2'
                phys_path: '/dev/gptid/746c949a-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 7183706725091321492
                path: '/dev/disk/by-id/ata-ST3200822A_5LJ1CHMS-part2'
                phys_path: '/dev/gptid/7541115a-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[3]:
                type: 'disk'
                id: 3
                guid: 17196042497722925662
                path: '/dev/disk/by-id/ata-ST3200822A_3LJ0189C-part2'
                phys_path: '/dev/gptid/760a94ee-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
    features_for_read:

NOTE: The GUID can be ascertained as 15935140517898495532.


zdb               # Find GUID.
zdb -l /dev/sda1  # In case the 'zdb' command does not work.
zpool status -g   # Find GUID.
zpool status -L   # Find device name, resolving links.
  • If zdb does not output anything, try specifying the device.

NOTE: If the old disk is already removed from the system and a new device has replaced it with the same device name, the following command can be used instead:

zpool offline testpool sdd
zpool remove testpool sdd
zpool attach -f testpool sdc sdd

Wait For Resilvering to Complete

Before the pool will be back to normal it will need to sync data over to the new disk.

  • It will remain in a degraded status while the data syncs.
  • This data syncing process is called resilvering.
  • It may take a very long time depending on the size of the disks and on how much data is on them.

The status of the resilvering can be checked:

zpool status testpool

Physically Remove the Old Drive

Physically remove the old drive.

  • If it is hot-swappable then just pull it out.
  • Otherwise, shutdown the system, before removing the device.

References

zfs/troubleshooting/replace_a_disk.1634168913.txt.gz · Last modified: 2021/10/13 23:48 by peter

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki