Replace Drive

The replace drive operation substitutes the current drive in the RAID logical position with another drive or inserts a drive to the logical position if there is no Online drive at that position.

To replace a drive in a RAID, use the command:

xnr_cli raid replace --name xnraid --position 1 --bdev 0000:06:0b.0n1

where “position” is the drive number reported by the “raid show” command and “bdev” is the drive name reported by the “device-manager show” command.

The drive replacement logic initiated by the “raid replace” command involves the following steps:

  1. If an Online drive is at the specified position, the drive is marked as Offline.
  2. Wait for any outstanding IO operations to the drive to be completed.
  3. Close the BDEV block device corresponds to the RAID logical drive at the specified position.
  4. Open the BDEV block device specified by the 'bdev” option of the command.
  5. Connect the specified BDEV as the RAID logical devices at the specified position.
  6. Read the connected drive metadata to determine if the drive is brand new, a known drive at the correct position, a foreign RAID drive, or a known drive inserted at an incorrect position.
  7. Update the connected drive metadata to synchronize it with the latest RAID metadata.
  8. Enable the drive for user IO
  9. Mark the drive as Online or Degraded depending on the data state

A new disk is always marked as Degraded. A known drive can be marked as Online if there has been no IO activity between the removal and reinsertion events (no data has changed on the drive), and marked as Degraded otherwise.

If the drive specified by “bdev” option cannot be opened or is foreign or inserted incorrectly, the replace operation will fail with the corresponding message. The previous drive will not be reinserted if the operation fails

For debugging and experimental purposes, it is possible to replace a drive with nothing, effectively removing a logical RAID drive without replacing it. Logically the operation is equal to the physical drive removal but without actual removing and detaching the physical device. However, this operation is useless in a production environment. It can be run by he following command

xnr_cli raid replace --name xnraid --position 1 --bdev null

The replace operation is needed in the following cases:

  • A drive fails and the corresponding device is damaged and must be replaced by another physical device re-inserted in the same or located at a different physical slot.
  • A device is unstable and needs to be proactively replaced by another physical device in the same or a different physical slot.
  • A device or several devices are disconnected due to hardware fail such as a cable disconnect or disks controller failure. In this case, the disconnected disks should be reconnected (replaced with same drives used before) after the hardware problem is fixed.
  • A device or several devices need to be migrated to another physical slot(s) due to a controller fail or another reason. This may involve changing PCIe address and device name. In this case, disks should be reconnected (replaced with the drive used before in the same positions).

Drive replace recommendations:

  • If a device changes its PCIe address and the drive name is changed, the drive UUID has to be used to identify which drive should be reconnected to which RAID logical position. The UUID is persisted in the RAID configuration and in the drive metadata (misc. area at the disk). The UUID is reported by the “device-manager show” and by the “raid show” commands to establish a connection between a physical device’s drives (reported by device manager) and logical RAID drives (reported by the “raid show” command).
  • If the connection is lost for any reason and you forget the corresponded logical slot, you can try reinserting the drive to any Offline position. The replace logic checks the UUID and blocks inserting the drive to incorrect slot (except when inserting a new disk with zeroed metadata area). Therefore, you can safely try to reinsert the disk into each Offline position one by one until the operation is not success.

The upcoming product releases will include a feature to scan drive states and automatically reinsert temporarily disconnected drives if possible.