So I have this EC2 server, which is a MySQL replication slave (henceforth known as “victim”). This server was originally running Alestic’s 64-bit paravirtual AMI for Ubuntu Oneiric 11.10 (ami-a8ec6098), but had been previously upgraded to Precise with do-release-upgrade
.
Yesterday, I performed an apt-get dist-upgrade
on victim
in order to upgrade its apps and the kernel. Then I rebooted the server, but it refused to boot, hanging before SSH could come online. Checking the system log, I saw that it was unable to mount its root disk:
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done. Begin: Running /scripts/local-premount ... done. [11081144.842424] EXT4-fs (xvdf): mounted filesystem with ordered data mode. Opts: (null) Begin: Running /scripts/local-bottom ... done. done. Begin: Running /scripts/init-bottom ... done. [11081145.506225] random: nonblocking pool is initialized lxcmount stop/pre-start, process 179 * Starting configure network device security[74G[ OK ] * Starting Mount network filesystems[74G[ OK ] * Starting Mount filesystems on boot[74G[ OK ] * Stopping Mount network filesystems[74G[ OK ] * Starting Populate and link to /run filesystem[74G[ OK ] * Stopping Populate and link to /run filesystem[74G[ OK ] * Stopping Track if upstart is running in a container[74G[ OK ] * Starting Bridge socket events into upstart[74G[ OK ] * Starting configure network device[74G[ OK ] * Starting Initialize or finalize resolvconf[74G[ OK ] The disk drive for / is not ready yet or not present. keys:Continue to wait, or Press S to skip mounting or M for manual recovery
Since this is an Amazon server, pressing a key is not an option, there is no interactive console. Though interestingly, telling the instance to stop causes more system log to be printed as it performs a clean shut down.
The obvious question was, why couldn’t it find or mount the root disk?
At this point I took a snapshot of the root disk so that I didn’t mess it up further with any of my experiments. And I started up a new server, “rescue”, from a pristine Alestic Precise AMI, and dist-upgraded it, which should make it nearly identical to victim
. But rescue
restarted fine without hanging. There must be some critical difference between my pristine rescue
server and the hanging victim
server, I just had to find it.
In the AWS control panel, I noticed that victim
was using an older Amazon “kernel” (which I think is just a PV-GRUB kernel used to boot the kernel installed on the instance’s disk). So I used the AWS CLI tool to make it the same as rescue's
, which should be fine since rescue
is the same architecture and guest kernel version:
aws ec2 modify-instance-attribute --instance-id i-000000 --kernel "{\"Value\":\"aki-fc8f11cc\"}"
No luck! It still had the same issue.
I unmounted the root disk (/dev/sda1
) from victim
, attached it to rescue
(as /dev/xvdf
) and mounted it at /mnt
:
root@rescue# mount /dev/xvdf /mnt
One reason for not being able to mount the root disk would be that its label had changed. The /etc/fstab
file specifies how the root disk should be identified, so let’s take a look:
root@rescue# cat /mnt/etc/fstab LABEL=cloudimg-rootfs / ext4 defaults 0 0
Okay, so it looks for a disk with the label “cloudimg-rootfs”. What label does the disk have?
root@rescue# e2label /dev/xvdf cloudimg-rootfs
So the disk label is correct, rats. Just for fun, I tried replacing the “LABEL=cloudimg-rootfs” part with plain old “/dev/xvda1”, but it still wouldn’t boot.
Maybe the new kernel or initrd was corrupt? I checked out /boot/grub/menu.lst
and both victim
and rescue
were trying to boot the exact same kernel version. So I just removed victim
‘s /boot, initrd.img and vmlinuz and replaced them with the pristine ones from rescue
:
root@rescue# rm -rf /mnt/initrd.img* /mnt/vmlinuz* /mnt/boot root@rescue# cp -a /boot /mnt/ root@rescue# cp -a /vmlinuz /mnt/ root@rescue# cp -a /initrd.img /mnt/
Now I tried booting victim
, but it still hung for the same reason! So the kernel or initrd wasn’t the issue.
Now I’m getting desperate, so I tried something really crazy. I deleted victim's
/etc
directory and replaced it with the one from rescue
:
root@rescue# rm -rf /mnt/etc root@rescue# cp -a /etc /mnt
I booted up victim
with this new fiddled root disk, and it worked!! It booted fine! So the reason that victim
wouldn’t boot is due to some bad configuration in /etc
. But the thing I most want to save from victim
is its custom configuration, so I can’t just use rescue
‘s out-of-the-box defaults. So I deleted the victim
disk I had mangled and restored it from its snapshot (back to its broken condition) and remounted it to rescue
.
Now I just had to find the difference between the pristine configuration and the broken one that causes it not to boot. I started by installing all the packages I could remember from victim
on to rescue
in order to minimise the size of the diff between their configurations. Finally, I ran a recursive diff, ignoring any “*~” emacs backup files:
root@rescue# diff -r --exclude "*~" /mnt/etc /etc > /root/victim-diff
Since the system hung before anything exciting like the network or root disk was initialised, I knew I could ignore MySQL, Apache, Postfix, Nagios, etc, since they are all started too late in the boot process. That didn’t leave many interesting changed files. There were some changed grub settings:
diff -r /mnt/etc/default/grub /etc/default/grub 7c7 < #GRUB_HIDDEN_TIMEOUT=0 --- > GRUB_HIDDEN_TIMEOUT=0 9c9 < GRUB_TIMEOUT=5 --- > GRUB_TIMEOUT=0 11c11 < GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 nomdmonddf nomdmonisw nomdmonddf nomdmonisw" --- > GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"
But moving the newer grub config in didn’t fix it. There was a change to the hardware clock settings that I had seen mentioned a couple of times in the system log:
diff -r /mnt/etc/default/rcS /etc/default/rcS 27a28 > HWCLOCKACCESS=no
But porting that change over didn’t fix it either. Finally, I noticed this:
Only in /mnt/etc/init: lxcguest.conf Only in /mnt/etc/init: lxcmount.conf
That’s odd, I don’t remember ever using LXC on this system, and it’ll never be an LXC guest. One of those configuration files contains “mount” in the name and both are in the “/init” directory, so these files could easily be related to mounting problems during init! Was I really using LXC? I chroot’d into the old drive in order to interrogate its dpkg catalog:
root@rescue# chroot /mnt root@rescue# dpkg -l | grep lxc rc lxcguest 0.7.5-0ubuntu8.6 amd64 Linux container guest package root@rescue# exit
The “rc” at the start indicates that the lxcguest package was installed at some point, was removed, but still has configuration files left behind. Great, that means that I can blow away those old files:
root@rescue# rm /mnt/etc/init/lxcguest.conf /mnt/etc/init/lxcmount.conf
And now, glory of glories, the server booted!
[64304636.707613] random: nonblocking pool is initialized * Starting Mount filesystems on boot[74G[ OK ] * Starting Populate and link to /run filesystem[74G[ OK ] * Stopping Populate and link to /run filesystem[74G[ OK ] * Stopping Track if upstart is running in a container[74G[ OK ] [64304638.007641] EXT4-fs (xvda1): re-mounted. Opts: discard * Starting Initialize or finalize resolvconf[74G[ OK ] * Starting Signal sysvinit that the rootfs is mounted[74G[ OK ] [64304638.205979] init: mounted-tmp main process (302) terminated with status 1 ...
Just to check that it wasn’t fixed due to some combination of changes I made, I deleted victim's
fiddled root disk, restored it from the snapshot, and only removed those two lxc config files. It booted fine, so nothing else I changed was required in solving it!
I hope that these debugging steps help someone out in repairing their own EC2 server. (And yes I realise that in most cases, building a new instance with Chef or Puppet would be a better solution, but this is what I had to work with!)