<<< Headline | Index | http | bash | basics | x11 | net | vpn | humor | sles | cXX | php | db | perl | soft | unsorted | hw | ppp | tips | linux | fbsd | mail
[Timeline] [View Photos] [rtfm] [Search] [Index by Title] [Index by Date]
linux => replacing_failed_raid_centos: == post:linux/posts/replacing_failed_raid_centos

Replacing failed RAID drive

Scenario / Question
A drive has failed in my raid 1 configuration, and I need to replace it with a new drive.

Solution / Answer

Use mdadm to fail the drives partition(s) and remove it from the RAID array.
Physically add the new drive to the system and remove the old drive.
Create the same partitioning tables on the new drive that existed on the old drive.
Add the drive partition(s) back into the RAID array.
In this example I have two drives /dev/sda and /dev/sdb. Each drive has 5 partitions and each partition is configured into a RAID 1 array denoted by md#. We will assume that /dev/sdb has failed and that hard drive needs to be replaced.
Note that in Linux Software RAID you can create RAID Arrays by mirroring partitions and not entire disks.
Fail and Remove the failed partitions and disk:
Identify which RAID Arrays have failed:
To identify if a RAID Array has failed look at the string containing [UU]. Each U represents an healthy partition in the RAID Array. 
If you see [UU] then the RAID Array is healthy. If you see a missing U like [_U] then the RAID Array is degraded or faulty.
$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      104320 blocks [2/1] [UU]

md2 : active raid1 sda3[0] sdb3[1]
      2048192 blocks [2/2] [UU]

md3 : active raid1 sda5[0]
      2048192 blocks [2/2] [_U]

md4 : active raid1 sda6[0] sdb6[1]
      2048192 blocks [2/2] [UU]

md5 : active raid1 sda7[0] sdb7[1]
      960269184 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      10241344 blocks [2/2] [UU]
From the above out put we can see that RAID Array &#8220;md3? is missing a “U” and is degraded or faulty.
Removing the failed partition(s) and disk:

Before we can physically remove the hard drive from the system we must first &#8220;fail” the disks partition(s) from all RAID Arrays that they belong to. Even though only partition /dev/sdb5 or RAID Array md3 has failed, we must manually fail all the other /dev/sdb# partitions that belong to RAID Arrays, before we can remove the hard drive from the system.

To fail the partition we issue the following command:
# mdadm --manage /dev/md0 --fail /dev/sdb1
Repeat this command for each partition changing /dev/md# and /dev/sdb# to match the output from “cat /proc/mdstat”
# mdadm --manage /dev/md1 --fail /dev/sdb2
Removing:

Now that all the partitions are failed we can remove then from the RAID Arrays.
# mdadm --manage /dev/md0 --remove /dev/sdb1
Repeat this command for each partition changing /dev/md# and /dev/sdb# to match the output from “cat /proc/mdstat”
# mdadm --manage /dev/md1 --remove /dev/sdb2
Power off the system and physically replace the hard drive:
# shutdown -h now
Adding the new disk to the RAID Array:

Now that the new hard drive has been physically installed we can add it to the RAID Array.

In order to use the new drive we must create the exact same partition table structure that was on the old drive.

We can use the existing drive and mirror its partition table structure to the new drive. There is an easy command to do this:
# sfdisk -d /dev/sda | sfdisk /dev/sdb
* Note that sometimes when removing drives and replacing them the drives device name may change. Make sure the drive you replaced is listed as /dev/sdb, by issueing command &#8220;fdisk -l /dev/sdb” and no partitions exist.

Add the partitions back into the RAID Arrays:

Now that the partitions are configured on the newly installed hard drive, we can add the partitions to the RAID Array.
# mdadm --manage /dev/md0 --add /dev/sda1

mdadm: added /dev/sda1
Repeat this command for each partition changing /dev/md# and /dev/sdb#

# mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: added /dev/sdb2

Now we can check that the partitions are being synchronized by issuing:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      104320 blocks [2/2] [UU]

md2 : active raid1 sda3[2] sdb3[1]
      2048192 blocks [2/1] [_U]
      	resync=DELAYED

md3 : active raid1 sda5[2] sdb5[1]
      2048192 blocks [2/1] [_U]
      	resync=DELAYED

md4 : active raid1 sda6[2] sdb6[1]
      2048192 blocks [2/1] [_U]
      	resync=DELAYED

md5 : active raid1 sda7[2] sdb7[1]
      960269184 blocks [2/1] [_U]
      [>....................]  recovery =  1.8% (17917184/960269184) finish=193.6min speed=81086K/sec

md1 : active raid1 sda2[0] sdb2[1]
      10241344 blocks [2/2] [UU]

Once all drives have synchronized your RAID Array will be back to normail again.
Install Grub on new hard drive MBR:

We need install grub on the MBR of the newly installed hard drive. So that in case the other drive fails the new drive will be able to boot the OS.

Enter the Grub command line:

# grub

Locate grub setup files:

grub> find /grub/stage1

On a RAID 1 with two drives present you should expect to get

(hd0,0)
(hd1,0)

Install grub on the MBR:

grub> device (hd0) /dev/sdb (or /dev/hdb for IDE drives)
grub> root (hd0,0)
grub> setup (hd0)
grub>quit

We made the second drive /dev/sdb device (hd0) because putting grub on it this way puts a bootable mbr on the 2nd drive and when the first drive is missing the second drive will boot.

This will insure that if the first drive in the Raid Array fails or has already failed that you can boot to the Operating System with the second drive.

   32  cat /proc/mdstat
   43  /sbin/mdadm --manage /dev/md0 --fail /dev/sdb2
   46  /sbin/mdadm --manage /dev/md1 --fail /dev/sdb7
   51  /sbin/mdadm --manage /dev/md5 --fail /dev/sdb2
   55  /sbin/mdadm --manage /dev/md5 --remove /dev/sdb2
   57  /sbin/mdadm --manage /dev/md0 --remove /dev/sdb1
   59  /sbin/mdadm --manage /dev/md1 --remove /dev/sdb7
   61  /sbin/mdadm --manage /dev/md2 --remove /dev/sdb3
   63  /sbin/mdadm --manage /dev/md3 --remove /dev/sdb5
   65  /sbin/mdadm --manage /dev/md4 --remove /dev/sdb6
   67  /sbin/mdadm -D /dev/md1
# выясняем серийный номер диска, для того чтобы вытащить нужный
  103  /usr/sbin/smartctl --all -T permissive /dev/sdb
# после замены накатываем таблицу разделов
  108   /sbin/sfdisk -d /dev/sda | /sbin/sfdisk /dev/sdb
# добавляем разделы в raid
  111  /sbin/mdadm -D /dev/md0
  112  mdadm --manage /dev/md0 --add /dev/sdb1
  113  /sbin/mdadm --manage /dev/md0 --add /dev/sdb1
  115  /sbin/mdadm -D /dev/md1
  116  /sbin/mdadm --manage /dev/md1 --add /dev/sdb7
  118  /sbin/mdadm -D /dev/md2
  119  /sbin/mdadm --manage /dev/md2 --add /dev/sdb3
  120  /sbin/mdadm -D /dev/md3
  121  /sbin/mdadm --manage /dev/md3 --add /dev/sdb5
  124  /sbin/mdadm -D /dev/md4
  125  /sbin/mdadm --manage /dev/md4 --add /dev/sdb6
  126  /sbin/mdadm -D /dev/md5
  127  /sbin/mdadm --manage /dev/md5 --add /dev/sdb2
  130  cat /proc/mdstat

[root@nslb4 ~]# cat /proc/mdstat
Personalities : [raid1]
md5 : active raid1 sdb2[2] sda2[0]
      55038592 blocks [2/1] [U_]
        resync=DELAYED

md2 : active raid1 sdb3[2] sda3[0]
      5245120 blocks [2/1] [U_]
        resync=DELAYED

md3 : active raid1 sdb5[2] sda5[0]
      3148608 blocks [2/1] [U_]
        resync=DELAYED

md4 : active raid1 sdb6[2] sda6[0]
      2096384 blocks [2/1] [U_]
        resync=DELAYED

md1 : active raid1 sdb7[2] sda7[0]
      2096384 blocks [2/1] [U_]
        resync=DELAYED

md0 : active raid1 sdb1[2] sda1[0]
      4192832 blocks [2/1] [U_]
      [====>................]  recovery = 24.5% (1031488/4192832) finish=24.7min speed=2124K/sec

unused devices: <none>




linux/posts/replacing_failed_raid_centos -- Last updated 2010-06-03 Thursday 14:25:00 Edit

© copyright 2010
Design by: lev