About a month ago, I started getting daily mail from mdadm (the service that creates RAID arrays in software) that there was a degraded-array problem:
Subject: DegradedArray event on /dev/md0:myserver
This is an automatically generated mail message from mdadm
running on myserver
A DegradedArray event had been detected on md device /dev/md0.
Uh-oh. This is one reason why I don't use RAID for booting the system. I use a totally separate hard drive for the OS. My RAID is for data.
Let's login as root and take a look:
# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu Mar 17 10:42:16 2011
Raid Level : raid1
Array Size : 1171873823 (1117.59 GiB 1200.00 GB)
Used Dev Size : 1171873823 (1117.59 GiB 1200.00 GB)
Raid Devices : 1
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Fri Jan 4 07:25:43 2013
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : myserver:0
UUID : b8d85003:03eb3cd0:fbb98516:361fb411
Events : 6596
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sdb1
1 0 0 1 faulty removed
So one of my RAID drives has apparently desynced, gone bad, or otherwise fallen off the array beyond the ability of mdadm to fix. sdb1 is the remaining good drive.
# blkid | grep linux_raid_member
/dev/sda1: UUID="b8d85003-03eb-3cd0-fbb9-8516361fb411" LABEL="myserver:0" TYPE="linux_raid_member"
/dev/sdb1: UUID="b8d85003-03eb-3cd0-fbb9-8516361fb411" LABEL="myserver:0" TYPE="linux_raid_member"
My two RAID drives are /dev/sda1 and /dev/sdb1. sdb1 is good. Therefore the faulty drive is sda1. Let's get the serial number of the bad drive, since that is printed on the outside of the drive.
# hdparm -I /dev/sda1 | grep "Serial Number"
Serial Number: WD-WMAZA3691591
Alternately, I could look for the serial number of the good drive(s).
Let's try re-adding the bad drive. Perhaps it was a onetime anomaly
# mdadm --add /dev/md0 /dev/sda1
Watch the progress if the sync using
cat /proc/mdstat
. For a 1200GB partition, my system takes about three hours for the mirroring to complete.
Often, re-adding the drive to the array is all you need to do, and the nonfaulty drive will happily rejoin it's friends in RAID.
However in this case, an hour later, the mirroring failed. Another degraded-array email showed up in my inbox. And a basic information command like
hdparm -I /dev/sda1
failed with an input/output error. The drive is no longer trustworthy and must be replaced.
I ordered two slightly larger drives, to increase size to a three disk array. Replacement drives must be the same size or larger. I happen to know I have space in the server case for an additional drive, and that I have additional power connections and motherboard data connections available (and cables) for the new drive.
With the new drives in hand, it's time to poweroff the server (er, don't forget to tell network users that you're doing this!). Since we're adding and removing drives, there is a good chance that we will need access to BIOS...so that means moving the server to the desk and hooking up a keyboard and monitor.
First, remove the bad drive. Good thing we know the serial number!
Second, add the new drives. Test them by booting into BIOS and ensuring they are detected. While in BIOS, make sure the boot drive is correct. Mine had changed and I had to fix it.
Finally, boot the machine and gain root.
# ls /dev | grep sd
sda
sda1
sdb
sdb1
sdb2
sdb5
sdc
sdd
It looks like sda1 is our raid partition, sdb* is our non-raid system disk, and sdc and sdd are the new unpartitioned drives. We can verify that using
parted
# parted -l
Model: ATA WDC WD20EARS-00M (scsi)
Disk /dev/sda: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 17.4kB 1200GB 1200GB ext3 Server
Model: ATA WDC WD3200AVJB-6 (scsi)
Disk /dev/sdb: 320GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 319GB 319GB primary ext3 boot
2 319GB 320GB 1546MB extended
5 319GB 320GB 1546MB logical linux-swap(v1)
In order to add the new drives to the RAID array, we must first create a partition table (I chose GPT becasue these are large drives). Next, we partition them to match the existing drive. Then we can add them to the array.
# parted /dev/sdc mklabel gpt
# parted -a optimal /dev/sdc mkpart 1 ext3 17.4kb 1200GB
# mdadm --add /dev/md0 /dev/sdc1
# parted /dev/sdd mklabel gpt
# parted -a optimal /dev/sdd mkpart 1 ext3 17.4kb 1200GB
# mdadm --add /dev/md0 /dev/sdd1
See how each partition has three commands? See how we took the partition sizes directly from the existing partition above?
And now we are back in familiar territory - we have added drives to an array before. We can monitor the process using
cat /proc/mdstat
, and expect it to take a few hours.
And the array is humming along smoothly again.