Recovering From a Disk Failure with Software RAID

Recovering from a disk failure in a software RAID is a very straightforward and easy process. If your server, chipset and kernel module allow, it may be possible to replace the offending drive without downtime. This is generally not the case with cheap SATA setups, where the server will have to be powered down for the drive replacement. Although it might be safe to remove a SATA drive while the system is running your kernel module may not support recognising the new drive without reloading or rebooting.

The failed drive will have to be identified. This is a very easy process with hardware controllers (typically an indicator light will flash on the failed drive) but not quite as simple when using software RAID. The surefire way to identify a drive is by serial number; find the device node (either through cat /proc/mdstat or dmesg) then run either:

# hdparm -i /dev/{disk node}

# smartctl -a /dev/{disk node}

If your configuration does not support hot swapping power down the machine and replace the bad drive. If it does, you may need to remove other partitions on the target disk from their respective arrays:

# mdadm -f /dev/{md node} /dev/{partition node}

will mark the partition as FAILED, then it can be removed from the set:

# mdadm -r /dev/{md node} /dev/{partition node}

If you are running swap space on the failed drive be sure to disable it with swapoff before removing the disk.

It should be possible to boot the machine off of the remaining set but if you run in to trouble it is just as easy to perform these operations from a livecd.

First we need to copy the partition table from one of the healthy disks exactly. Let sda represent a healthy disk and sdb represent the new one:

# sfdisk -d /dev/sda | sfdisk /dev/sdb

Now we need to add the new partition(s) to our existing RAID set. If you are booting off a livecd and your sets were not automatically configured on boot (they should have been) use mdadm --assemble to assemble them. Then:

# mdadm /dev/md0 -a /dev/sdb1

for each set and partition which needs to be added.

We can watch the resynchronization status with:

# watch cat /proc/mdstat

Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[2] sda2[1]
      243922816 blocks [2/1] [_U]
      [================>....]  recovery = 83.8% (204426880/243922816) finish=58.3min speed=11271K/sec

md0 : active raid1 sdb1[0] sda1[1]
      272960 blocks [2/2] [UU]

The resync process will slow down considerably if there is heavy disk i/o at the same time and you can expect below-average performance until recovery has completed.

foxpa.ws

pitter patter on the keyboard

Recovering From a Disk Failure with Software RAID

Comments

Comments New Comment

Comments