Setting up an encrypted LVM over RAID 5
What I wanted
All I wanted was a software RAID-5 on three disks with a whole disk encryption on Fedora 12. For some reason, I thought the installation script would do that for me.
The relevant part in the installation procedure was kind enough to allow me to set it up in the GUI, but when I went for the installation, I got a window saying “An error was encountered while setting up device sdc1″. sdc1, by the way, is just a plain unencrypted partition. But who cares. I insisted on looking at the “details” where it said “device has not been created”.
Hurray! Now I get it all. Not.
A quick tour in the command line console (I wonder why I always end up doing things with my bare hands) revealed that the partition tables were intact. Simply put, nothing was done.
The setup
The catch about software RAID is that its drivers have to be loaded from somewhere, so obviously a non-RAID boot partition is needed for that. My decision was to allocate ~250 MB on all three disks, exactly the same number of cylinders, and put the boot on one of them. I don’t know why, but it feels right to me that the disks will access the same geometrical points when running as RAID, even thought I’m not forced to do so.
The rest of the partition (around 1000 GB) is allocated as one big software RAID partition. With three disks like this forming a RAID-5 I’ll get one big (fake) ~2TB disk, which will be encrypted completely. On top of that, I’ll put one big LVM physical volume, on which I’ll have 4 GB swap and then a root partition. The precise sizes don’t matter anymore, since I’m under LVM.
Setting up the RAID
Since the LVM tools are not active in Fedora’s rescue mode, I went for booting Ubuntu 9.10 as a LiveCD. The catch is that it doesn’t support neither LVM nor mdadm, so both had to be installed (after setting up a network connection, of course):
# apt-get install lvm2 # apt-get install mdadm
(the latter forcing me to configure postfix. Yuck!)
On /dev/sda: For the boot partition I allocated cylinders 1 to 30. For Raid Autodetect (type 0xfd) I took all the rest. Then I brutally raw copied the first 128 sectors to /dev/sdb and /dev/sdc. That was a bad idea, since the the partition table contains the disk’s GUID. So I cleaned up both disks with some zeros, and ran fdisk on each.
Following this I created a software RAID:
mdadm --create /dev/md0 --level=raid5 --raid-devices=3 --chunk=128 /dev/sda2 /dev/sdb2 /dev/sdc2
And the hard disks started to work. /dev/md0 was up and running pretty much immediately. To monitor the progress:
# mdadm --detail /dev/md0
(yey!)
The whole disk encrypted
# cryptsetup -v luksFormat /dev/md0
After saying “YES” to kill all data and entering my secret passphrase, cryptsetup said all was successful and a window popped up saying that “gvfs-gdu-volume-monitor closed unexpectedly”. How I love when everything is so automated and I don’t need to worry about anything technical.
But who cares? I opened my new secret candy box:
# cryptsetup luksOpen /dev/md0 candybox
and found /dev/mapper/candybox in place (yey II)
Setting up LVM
It’s worth mentioning that there’s an interactive shell-like environment for manipulating LVM volumes. Just go
# lvm
Regardless, following the same HOWTO (more or less) I went
root@ubuntu:/home/ubuntu# pvcreate /dev/mapper/candybox Physical volume "/dev/mapper/candybox" successfully created root@ubuntu:/home/ubuntu# vgcreate vg_raid -s 32M /dev/mapper/candybox Volume group "vg_raid" successfully created
Noted the “-s 32M”? That sets the physical extent size to 32MB instead of the default 4MB. Since the maximal number of extents for a volume is 65534 (more or less…?), and the whole disk is around 2TB, that’s the smallest number which does the work (32 MB x 65534 ~ 2 TB).
OK, now let’s put the swap and boot in place:
root@ubuntu:/home/ubuntu# lvcreate --size 4G vg_raid -n lv_swap Logical volume "lv_swap" created root@ubuntu:/home/ubuntu# lvcreate --size 10G vg_raid -n lv_root Logical volume "lv_root" created root@ubuntu:/home/ubuntu# ls /dev/mapper/ candybox control vg_raid-lv_root vg_raid-lv_swap
Installing…
To my delight, (and somewhat surprise) the Fedora 12 installation machinery detected both software RAID and all that was underneath, prompted me for my passphrase, and allowed my to allocate the mounting points on the existing logical volumes. Which is the sensible thing to do, but I couldn’t believe it actually happened!
All in all, the installation went smooth, so did the bootup, and everything seems to be OK (fingers crossed).
When bad gets worse
So what happens if a disk suddenly decides to commit suicide? The answer is nothing special. Due to the redundancy, the system will keep on working as usual. Even worse, nobody will be notified (except for an email to root from mdadm). The system just runs on. In a way, that’s good and pretty bad at the same time.
Here’s a typical mail, which is sent to root:
From root@localhost.localdomain Sat Jan 16 17:51:27 2010 Return-Path: <root@localhost.localdomain> Date: Sat, 16 Jan 2010 17:51:27 +0200 From: mdadm monitoring <root@localhost.localdomain> To: root@localhost.localdomain Subject: DegradedArray event on /dev/md0:ocho.localdomain Status: RO This is an automatically generated mail message from mdadm running on localhost.localdomain A DegradedArray event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sda2[0] sdc2[3] sdb2[1] 1953037824 blocks level 5, 128k chunk, algorithm 2 [3/2] [UU_] [==================>..] recovery = 94.0% (918267264/976518912) finish=920.7min speed=1053K/sec unused devices: <none>
It looks like there’s no dedicated software for sounding the alarm. The solution seems to be a simple cronjob script, which runs mdadm every hour or so, and checks if all is OK. The word “degraded” in the “detail” report looks like a good indicator that something isn’t like it should be. My script is at the bottom of this page.
I tried plugging out the spare disk’s SATA cable while the computer was up and running (which is, by all means a violent thing to do). Nothing happened. A few lines in /var/log/messages telling a short story about a disk which doesn’t respond, and the RAID going down to two disks. The log of the boot afterwards (with two disks) is not more dramatic about it. RAID-5, only two disks detected, too bad, let’s go on. The disk is declared “removed” in the “detail” report, and that’s it.
So I turned the computer off, replugged the disk, and turned it on again. The system showed no particular interest in it. To get it back to the RAID array, I did
# mdadm /dev/md0 --add /dev/sdc2
This kicked off the rebuild off this disk. Thinking about it, it’s pretty clever that nothing happens without human intervention. But I’ll consider having the smartd service running.
When worse turns into a catastrophe
Since I was about to wipe my disks soon anyhow, I figured to take the test to the extreme. After all, there’s no point in having a spare disk if it doesn’t work, does it?
So I let the spare disk recover up to 25% (so I know that the relevant disk are is indeed OK, but not letting it finish). Then I pulled disk #2′s SATA plug. So now we have disk #1 which is OK, disk #2 missing, and disk #3 spare but not completely recovered. Don’t try this on real data.
The system lost its stability this time, but it’s not like that connector was intended for hot removal. The attempt to reboot failed with “no root device found”. This is no wonder. I couldn’t really expect the RAID array to rely on one disk and one spare which never got the time to recover, could I? Well, I tried.
So I went for Ubuntu again. Keep in mind that former /dev/sdc2 is now /dev/sdb2. The general music is “everything is clean, but forget it”:
root@ubuntu:/home/ubuntu# mdadm --assemble --scan mdadm: /dev/md0 assembled from 1 drive and 1 spare - not enough to start the array. root@ubuntu:/home/ubuntu/mnt# mdadm --run /dev/md0 mdadm: failed to run array /dev/md0: Input/output error root@ubuntu:/home/ubuntu# cat /proc/mdstat Personalities : md0 : inactive sda2[0](S) sdb2[3](S) 1953037952 blocks root@ubuntu:/home/ubuntu# mdadm --examine /dev/sdb2 /dev/sdb2: Magic : a92b4efc Version : 00.90.00 UUID : fb16d869:ffd27a50:e368bf24:bd0fce41 (local to host ubuntu) Creation Time : Fri Jan 15 11:11:40 2010 Raid Level : raid5 Used Dev Size : 976518912 (931.28 GiB 999.96 GB) Array Size : 1953037824 (1862.56 GiB 1999.91 GB) Raid Devices : 3 Total Devices : 3 Preferred Minor : 0 Update Time : Fri Jan 15 14:34:58 2010 State : clean Active Devices : 1 Working Devices : 2 Failed Devices : 2 Spare Devices : 1 Checksum : 15648d3e - correct Events : 2500 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 3 8 34 3 spare 0 0 8 2 0 active sync /dev/sda2 1 1 0 0 1 faulty removed 2 2 0 0 2 faulty removed 3 3 8 34 3 spare
But I won’t let this turn me off. There’s a guy who had this kind of problem for real, and was kind enough to document his findings. The bottom line was to tell mdadm to create the RAID array from the start, only assume that everything is already there with “–assume-clean”. Extremely dangerous. I would rawcopy all data to a new hard disk and try it there, if this was for real. But it wasn’t. So I went:
root@ubuntu:/home/ubuntu/mnt# mdadm --create /dev/md0 --assume-clean --level=5 --verbose --chunk=128 --raid-devices=3 /dev/sda2 missing /dev/sdb2 mdadm: layout defaults to left-symmetric mdadm: /dev/sda2 appears to be part of a raid array: level=raid5 devices=3 ctime=Fri Jan 15 11:11:40 2010 mdadm: /dev/sdb2 appears to be part of a raid array: level=raid5 devices=3 ctime=Fri Jan 15 11:11:40 2010 mdadm: size set to 976518912K Continue creating array? y mdadm: array /dev/md0 started. root@ubuntu:/home/ubuntu/mnt# cryptsetup luksOpen /dev/md0 candybox Enter LUKS passphrase: key slot 0 unlocked. Command successful.
But I didn’t get the LVM devices kicked off. So I went:
root@ubuntu:/dev/mapper# dmsetup resume /dev/mapper/candybox root@ubuntu:/dev/mapper# ls candybox control vg_raid-lv_root vg_raid-lv_swap
And of course, in real life I would fsck the disk and such. But the bottom line is clear: If the data is there, it’s there.
Summary
Let’s hope I won’t ever need this stuff. Let’s hope that all three disks will live forever. But it’s comforting to know, that if one of those suddenly dies, there is a good chance the whole story will end with the purchase of some hardware. Nothing else.
Sort-of appendix
When the RAID doesn’t come up by itself
If the RAID array is known to be fine, but doesn’t come up:
# mdadm --assemble --scan
which worked under Ubuntu, since it was nice enough to create an /etc/mdadm/mdadm.conf file. Otherwise we need to be more explicit:
# mdadm --assemble /dev/md0 /dev/sda2 /dev/sdb2 /dev/sdc2
Script for checking the RAID’s health
This is the script I’ve put as a cronjob. Note that it’s completely silent when all is OK, and starts to say things when they’re not. The thing is that the cron daemon sends an email message to whoever it sends (usually root) when there was an output from the cronjob. This doesn’t solve the problem with the email going to root, but it’s a good guard if the mails to root are forwarded to someone attentive.
I’ve also made a log in /var/log/raidlog. The purpose of this log is to allow me to verify that the script is indeed running every now and then. After all, the whole issue is that I don’t expect a hard disk failure tomorrow, but rather when I’ve forgotten about this script altogether. But I hope I’ll have the sense to peek at the log every now and then.
#!/bin/bash device=/dev/md0 now=`date` report=""; checkraid() { mdadm --detail $device | grep State | { read s; if ! echo $s | grep -i -q state ; then echo Problem: mdadm gave bad output for $device return 1; fi if echo $s | grep -i -q degraded ; then echo Problem: Device $device is degraded return 1; fi return 0; } } if ! checkraid ; then echo "" echo mdadm output follows: echo "" mdadm --detail $device echo "Bad RAID at $now" >> /var/log/raidlog exit 1; fi echo "RAID OK at $now" >> /var/log/raidlog exit 0;
A useless adventure
There is a basic flaw in the above: The LVM is generated on /dev/md0. If we view /dev/md0 as a hard disk, it means it has no partition table!
So somewhere in the middle of the route above, I tried this: With fdisk, I set up an LVM partition on /dev/md0p1
So what I wanted to do was:
# pvcreate /dev/md0p1 # vgcreate lvm-raid -s 32M /dev/md0p1
The difference is that I went for /dev/md0p1 rather than /dev/md0, so that a descent partition table is in place.
But the first one failed with a “/dev/md0p1 not found (or ignored by filtering)” because there is some kernel issue. Or is it? Maybe it’s the whole world telling me I should stop being so fussy.
What I needed was a kernel of 2.6.24 and down, because whoever reported the kernel problem had things running on 2.6.24. I wanted to run an earlier Ubuntu (8.10 instead of 9.10), assuming that it was a kernel issue. I will never know, since that distro got stuck during boot.
So I went for a small rescue distro, namely SystemRescueCD version 1.0.0 (loading altker64, since the default kernel caused a kernel panic). And there I encountered a brand new problem: /dev/md0p1 never appeared in the /dev directory, even through the partition was there. Using mdadm to kick the RAID of did create /dev/md0, but not its subpartitions.
At this point I realized, that even if I manage to get it my way, odds are that not many others did it my way, meaning that nobody really tests things on my settings. In other words, things are expected to go wrong in the long run. Which I why I dropped this.
Reader Comments
Tried the same thing today, also without success, on CentOS 5.4.
My main goal was to emulate a HUGE disk (6 TB), for experimenting with EFI/GPT partitions in stead of standard Master Boot Record. So LVM on a md-device is not an option.
Virtual disks are really cheap. So I created a VMware virtual machine (VMware workstation) and added 6 1TB thin-provisioned disks to it. It doesn’t take any diskspace (almost, approx. 100M for a 1 TB thin provisioned disk).
Next step was to partition the disks (1 fd-type primary partition on each disk) en to create a RAID-array in the virtual machine: 4 1TB disks in RAID0 with
“mdadm –create /dev/md0 –auto=part -n 4 -l 0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1″.
So far so good. The option “–auto-part” does make it a partitionable array, according to the manpage.
BUT IT DOESN’T WORK!
Main problem is the devicenumbering, in particular the minor numbers.
On a ordinary disk, minor number 0 is the whole disk, minor number 1 is the first partition, and so on.
On a md-device, minor number 0 is /dev/md0, minor number 1 is /dev/md1, and so on.
Don’t know how to work around this.
Hi,
I desperately tried today to make RAID5 and LVM too.
I have the following settings:
4 2TB drives on an Atom ION mobo and I’m booting Debian Lenny on an SD Card.
At 1st, I have made GPT labels on the 2TB drives, then created a RAID5 with all 4 drives getting /dev/md0. So far so good… as you say.
Then, I tried to parted /dev/md0. So wrote again a GPT label on it and then, when I create partitions, it tells me that the kernel will not be aware of changed until I reboot… I created 11 500GB partitions and a last one with the remaining space in idea to be flexible with lvm. Reboot… no device node for any partitions. I’m stuck on this for many hours now :-(
I tried to put msdos labels on the physical disks, and then gpt on the RAID volume. Same problem. Anyhow, as soon as anything is GPT, it fails…