Fix for macOS [High] Sierra 10.12.4+ “don’t steal mac OS” error on boot on Proxmox 4

In Sierra 10.12.4, macOS added some extra copy protection which is able to tell that the SMC emulation that QEMU provides is not a real Mac. This causes a fatal error during boot on Proxmox 4 and earlier. Proxmox 5.1 now includes the fix for this problem in its regular QEMU package so a patch for 5.1 is no longer necessary.

One way of fixing this would be to remove the SMC device from the virtual machine’s arguments, and use FakeSMC.kext instead, like a regular Hackintosh, but this is inelegant.

Instead, we can patch QEMU to fix the SMC support, using the fixes from here: Continue reading Fix for macOS [High] Sierra 10.12.4+ “don’t steal mac OS” error on boot on Proxmox 4

Accelerate IO for macOS Sierra Proxmox guests by passing through an NVMe SSD

Recently I migrated my MacBook Pro into a Proxmox virtual machine to use as my daily-driver. This made for a rather large stepdown in IO performance, since my MacBook used an SSD, and Proxmox was using a RAIDZ1 array of spinning disks. On top of the IOPS penalty for spinning disks, there are currently no macOS drivers for the virtio SCSI paravirtual device, so we have to use IDE/SATA emulation instead, which is very slow (although this may change in the near future).

One way to improve things would be to use PCIe passthrough to pass through a whole physical SATA controller to the guest. This would eliminate almost all of the performance penalty of the virtualised SATA controller. But there’s a new option for drive passthrough: NVMe SSDs.

NVMe is a new standard for operating systems to communicate with a disk controller, which has been specifically designed to extract the most speed possible from SSDs. NVMe SSDs are PCIe devices (typically x4), so we can pass them straight through to macOS. I’m using the Samsung 950 Pro. You might also consider the faster 960 Pro.

The only missing piece of the puzzle is NVMe support in macOS Sierra. Thankfully, modern macs have begun shipping with NVMe SSDs inside, so we have an official Apple driver we can use. It just needs to be patched to accept our SSDs.

Note that in High Sierra, the built-in NVMe driver already supports most SSDs, and we don’t have to mess with it any more! Continue reading Accelerate IO for macOS Sierra Proxmox guests by passing through an NVMe SSD

Using Clover UEFI boot with Sierra on Proxmox

My previous Proxmox post described how to install Sierra into Proxmox using the Enoch bootloader (SeaBIOS boot). Since then, I’ve been using it as my daily-use desktop, and it has generally been working out great for me. However, I had some real struggles getting the graphics card passthrough to work reliably. I managed to fix these by updating to UEFI boot with Clover.

One of the problems with legacy BIOS boot and GPU passthrough is VGA arbitration. From what I understand, the video cards in the host and guest can end up both contending to own the VGA resources, which can cause a deadlock on boot. When a Sierra guest loads its video driver during boot, my Proxmox host hangs, and the screen fills with black and white bars.

UEFI boot doesn’t suffer from this problem, since it does away with the legacy VGA interface. So if your video card’s firmware supports UEFI/EFI boot (my R9 280X already does), you can switch the guest to boot using OVMF instead. This requires us to use a macOS bootloader that supports UEFI. I chose Clover. Continue reading Using Clover UEFI boot with Sierra on Proxmox

Creating a CrashPlan container on Proxmox to back up your files

I’m migrating from FreeNAS to Proxmox 4.3. On FreeNAS, there was a built-in plugin for CrashPlan support, which I was using to back up the files that FreeNAS was serving from ZFS over the network. However, keeping this plugin running was a chore, with forced automatic CrashPlan updates frequently breaking it and requiring manual intervention to fix, and headless operation requiring an unsupported, tedious procedure to achieve, with lots of opportunities for getting it wrong.

On top of this, CrashPlan doesn’t actually support FreeBSD, and instead relies on the Linux emulation that the FreeNAS jail system provides, which puts the plugin at risk of being broken by CrashPlan relying on unsupported Linux kernel features.

By contrast, Proxmox provides the perfect environment for CrashPlan. Having a real Linux kernel available for the LVM container system to use means there’s no kernel incompatibility to worry about. Continue reading Creating a CrashPlan container on Proxmox to back up your files

Installing macOS Sierra on Proxmox 4.4 / QEMU 2.7.1

With the release of macOS High Sierra 10.13, this guide is now outdated! Click here to view the new High Sierra guide!

This tutorial for installing macOS Sierra has been adapted for Proxmox 4.4 from this tutorial for Yosemite, and this GitHub project for installing into vanilla KVM.

Requirements

I’ll assume you already have Proxmox 4.4 installed. You also need a real Mac available in order to download Sierra from the App Store and build the installation ISO. Your host computer must have an Intel CPU at least as new as Penryn. I think you may need a custom Mac kernel to use an AMD CPU.

These installation instructions have been tested with Sierra 10.12.4. Although it’s been a while since I performed a fresh install, I’m currently running Sierra 10.12.6 on Proxmox 5 using a VM built with these instructions.

First step: Create an installation ISO

On a Mac machine, download the macOS Sierra installer from the App Store (this will download it into your Applications folder).

download

Download the contents of this repository to your mac.

From inside that directory, run “sudo ./create_install_iso.sh” to create the install CD for you:

create-iso

Once that’s done, connect to your Proxmox server using Transmit (or some other SCP/SFTP client) and upload the ISO you created to /var/lib/vz/template/iso.

While you’re there, upload the enoch_rev2902_boot bootloader file from the GitHub repository to /var/lib/vz/template/qemu/enoch_rev2902_boot.

Fetch the OSK authentication key

macOS checks that it is running on real Mac hardware, and refuses to boot on third-party hardware. You can get around this by reading an authentication key out of your real Mac hardware (the OSK key). Run the first bit of C code from this page (you’ll need XCode installed) and it’ll print out the 64 character OSK for you. Make a note of it.

Create the VM

From the Proxmox web UI, create a new virtual machine as shown below.

In the Options page for the VM, change “Use tablet for pointer” to “No”.

In the Hardware page for the VM, change the the Display to Standard VGA (std).

Don’t try to start the VM just yet. First, SSH into your Proxmox server so we can make some edits to the configuration files.

Edit /etc/pve/qemu-server/YOUR-VM-ID-HERE.conf (with nano or vim). Add these two lines, being sure to subtitute the OSK you extracted earlier into the right place:

machine: pc-q35-2.4
args: -device isa-applesmc,osk="THE-OSK-YOU-EXTRACTED-GOES-HERE" -smbios type=2 -kernel /var/lib/vz/template/qemu/enoch_rev2902_boot -cpu Penryn,kvm=off,vendor=GenuineIntel

Find the line that specifies the ISO file, and remove the “,media=cdrom” part from the end of the line (otherwise you’ll get stuck at the bootloader).

On the net0 line, change “e1000” to “e1000-82545em”. This variant is supported by OS X.

macOS doesn’t support the PS2 keyboard and mouse that QEMU will emulate, nor does it support the tablet, so edit /usr/share/qemu-server/pve-q35.cfg and add these USB input devices to the bottom of the file instead:

[device "mouse1"]
 driver = "usb-mouse"
 bus = "ehci.0"
 port = "1"

[device "keyboard1"]
 driver = "usb-kbd"
 bus = "ehci.0"
 port = "2"

We’ve added those to the config file instead of to the VM’s args directly. If we were to add them to the VM’s args, then when Proxmox constructs its call to KVM to launch the VM, those device definitions would appear before the pve-q35.cfg file is included, which defines the USB busses. However, the device definitions must appear after the definitions of the USB bus that they refer to.

Note that this file is whitespace-sensitive, make you you don’t add any blank lines that have extraneous spaces on them.

Configure Proxmox

On Proxmox, run “echo 1 > /sys/module/kvm/parameters/ignore_msrs” to avoid a bootloop during macOS boot. To make this change persist across Proxmox reboots, run:

echo "options kvm ignore_msrs=Y" >>/etc/modprobe.d/kvm.conf && update-initramfs -k all -u

If you’re installing Sierra 10.12.4 or newer, you’ll also need to patch Proxmox’s copy of QEMU in order to be able to boot until this patch is merged by the upstream.

Install Sierra

Now start up your VM.

If you get an error “file system may not support O_DIRECT / Could not open iso: invalid argument” when starting the VM, you may need to edit the CD drive on the hardware tab and change its cache setting to “writeback (unsafe)”.

Go to the Console tab:

boot-menu

Press enter to choose the “install macOS Sierra” entry and the installer should boot up.

If you are unable to move the mouse cursor at the Welcome screen, and a beachball-of-doom appears on the host, you might be using Safari. It seems to get overwhelmed with the number of screen updates on the animated Welcome screen and become unresponsive. Try Chrome instead.

Our virtual hard drive needs to be erased/formatted before we can install to it, so go to Utilities -> Disk Utility and do that now:

installer-erase-disk

Before we start installation, we have some files to copy over to the newly-formatted drive. Choose Utilities -> Terminal, and copy the /Extras directory to your main volume (/Volumes/Main, for example) using “cp -av /Extra /Volumes/Main/” like so:

Quit terminal. Now you can begin installation to the Main drive.

installer-installing

After the first stage of installation, the VM should reboot itself and continue installation by booting from the hard drive. After answering the initial install questions, you’re ready to go!

installed

Sleep management

I found that I was unable to wake Sierra from sleep using my mouse or keyboard. You can either disable system sleep in Sierra’s Energy Saver settings to avoid this, or you can manually wake the VM up from sleep from Proxmox by running:

qm monitor YOUR-VM-ID-HERE
system_wakeup
quit

USB passthrough

Using noVNC gets pretty annoying due to the Mac’s absence of tablet support for absolute cursor positioning. You can solve this by turning on the Mac’s screen sharing feature and using that instead. But I want to use this as my primary computer, so I’m using USB input devices plugged directly into Proxmox.

Proxmox has good documentation for USB passthrough. Basically, run “qm monitor YOUR-VM-ID-HERE”, then “info usbhost” to get a list of the USB devices connected to Proxmox:

qm> info usbhost
 Bus 3, Addr 12, Port 6, Speed 480 Mb/s
 Class 00: USB device 8564:1000, Mass Storage Device
 Bus 3, Addr 11, Port 5.4, Speed 12 Mb/s
 Class 00: USB device 04d9:0141, USB Keyboard
 Bus 3, Addr 10, Port 5.1.2, Speed 12 Mb/s
 Class 00: USB device 046d:c52b, USB Receiver
 Bus 3, Addr 9, Port 14.4, Speed 12 Mb/s
 Class 00: USB device 046d:c227, G15 GamePanel LCD
 Bus 3, Addr 8, Port 14.1, Speed 1.5 Mb/s
 Class 00: USB device 046d:c226, G15 Gaming Keyboard
 Bus 3, Addr 6, Port 11, Speed 12 Mb/s
 Class e0: USB device 0b05:17d0,
 Bus 3, Addr 2, Port 1, Speed 1.5 Mb/s
 Class 00: USB device 068e:00f2, CH PRO PEDALS USB

In this case I can add my keyboard and mouse to USB passthrough by quitting qm, then running:

qm set YOUR-VM-ID-HERE -usb1 host=04d9:0141
qm set YOUR-VM-ID-HERE -usb2 host=046d:c52b

This saves the devices to the VM configuration for you. It’s possible to hot-add USB devices, but I just rebooted my VM to have the new settings apply.

PCIe GPU passthrough

For native graphics performance, I wanted to pass through my graphics card for the macOS VM’s exclusive use (driving a monitor connected to Proxmox). Follow the instructions from the Proxmox manual. Use the “GPU Seabios PCI EXPRESS PASSTHROUGH” section for this installation.

Note that your CPU and motherboard need to support VT-d (be sure to enable it in your BIOS as it’s often disabled by default), and your CPU needs to support IOMMU interrupt remapping.

After following the instructions to blacklist video drivers in the Proxmox manual, I found I had to run “update-initramfs -u” in order for the blacklist to be applied.

Check that your graphics card has been reserved correctly by running “lspci -k” on Proxmox and checking which driver is assigned to the graphics card (if done correctly, it should be “vfio-pci”).

After following through all the steps in that guide, I ended up with a new “hostpci0: 01:00,pcie=1,x-vga=on” line in my VM’s configuration, and after a reboot of Proxmox, my graphics card (Radeon R9 280X) was working! Only some cards are natively supported by macOS, check out the tonymacx86 Radeon compatibility list for your card. I also found a list of supported Nvidia cards (some using Nvidia’s Web Driver).

I have had success passing through my EVGA GeForce GTX 750Ti SC 2G, driving a 4K screen over DisplayPort and another display over HDMI. This required me to use Clover/UEFI boot, install the NVidia web drivers, and update my SMBIOS to “iMac 14,2” and enable “NvidiaWeb” in Clover Configurator.

Using Clover as a bootloader

I’ve also written up a guide on converting this VM to use Clover for booting instead of Enoch.

Raw images display corrupt in Lightroom when using FreeNAS 9.10 / afpd 3.1.8

I have so many raw Canon CR2 photos in my Lightroom library that they won’t fit on my MacBook, so I built myself a FreeNAS-based NAS to put them on. I access my NAS’s photo drive over the network from my MacBook. This worked great for many years.

However, a while back I noticed that many of my older photos displayed corrupt in Lightroom. They’d cut off halfway through and dissolve into a repeating pattern of stripes:

As you can imagine, this was pretty devastating. My newer photos all displayed fine, but a good percentage of my old photos displayed corrupted, as if the bits had rotted on the disk over time. So I restored those photos to my desktop using my Crashplan backup, and those restored photos were fine. Phew!

But how had the photos ended up getting corrupted? My FreeNAS runs ZFS, which detects and repairs/reports corruption. And regular ZFS scrubs on my drive array never turned up any problems. Corruption should be impossible.

I checked the MD5-sum of the perfect photos from the backup, and checked the MD5 of the corrupt photo on my NAS. They were identical. Huh? In fact, if I just copied the photos from my NAS to my laptop using Finder, then opened them in Lightroom, the previously corrupt-looking photos displayed perfectly.

So the problem wasn’t the image files at all, the problem was the connection between my MacBook and FreeNAS. Since I’m using a Mac, I figured I should use AFP (the Apple Filing Protocol) to connect the two together. This is provided by a package on FreeNAS called “netatalk”, and it turns out that this package received an update between FreeNAS 9.2 and 9.3. Rolling back to FreeNAS 9.2 fixed the AFP corruption issue for me.

With the recent release of FreeNAS 9.10, I decided it was probably time to track down the bug and get it fixed so I could upgrade. So with the help of Ralph Boehme at netatalk, I enabled debug logging in netatalk (now at version 3.1.8) to track down the cause of the issue. In FreeNAS, this can be achieved by:

nano /usr/local/etc/afp.conf

Add these lines to the [Global] section:

log level = default:maxdebug
log file = /tmp/netatalk.log

Save and exit, then:

service netatalk stop
service netatalk start

This will start logging a ton of information to /tmp/netatalk.log (don’t let it fill up your boot drive by leaving it enabled for too long!)

I then browsed around my photos in Lightroom until a photo displayed corrupted, and noted down the filename.

By searching the log for that file, I found the problem. Lightroom would read the photos by sending a AFP_OPENFORK message to open the file, then a series of AFP_READ_EXT messages to read the file contents, then a AFP_CLOSEFORK message to finish up. This normally worked fine. However, sometimes Lightroom would want to fetch some extra attributes from the file while it was reading it by sending a AFP_LISTEXTATTR message:

<== Start AFP command: AFP_LISTEXTATTR
dirlookup(did: 1339): START
…
sys_list_eas(827C5343.CR2): file is already opened
ad_close(HF): BEGIN: {d: 1, m: 0, r: 0} [dfd: 6 (ref: 1), mfd: 6 (ref: 1), rfd: -1 (ref: 0)]
ad_close(HF): END: 0 {d: 1, m: 0, r: 0} [dfd: -1 (ref: 0), mfd: -1 (ref: 0), rfd: -1 (ref: 0)]
Finished AFP command: AFP_LISTEXTATTR -> AFP_OK

That “file is already opened” message looked suspicious to me, since it implies that AFP_LISTEXTATTR was sharing the photo’s file handle with Lightroom’s reading process, and the final part of AFP_LISTEXTATTR was calling ad_close(), hmm.

After this, every subsequent call from Lightroom to continue reading the file would fail with an error of AFPERR_EOF. In other words, fetching attributes from a file that was already opened was wrongly causing the file to be closed, sending a premature EOF to Lightroom and so cutting off the image halfway through.

Ralph came up with a patch to netatalk to fix this problem, so now all I had to do was test it. This requires building FreeNAS from source, which was quite a learning experience. The build repo for FreeNAS 9.10 is here:

https://github.com/freenas/freenas-build

I installed FreeBSD 10.3 in a VirtualBox VM on my laptop, then grabbed a copy of a reasonable-looking tagged release of the build repo (some of the newer commits looked a bit experimental):

pkg install git
git clone https://github.com/freenas/freenas-build.git
cd freenas-build
git checkout 9c46f771d3467b2c2625c752bf51398903cb309b

Now follow the instructions to install the build pre-reqs:

make bootstrap-pkgs
pkg install devel/gmake

portsnap fetch extract
cd /usr/ports/textproc/py-sphinx_numfig
# Just keep hitting OK to accept the defaults for the dependencies:
make config-recursive
make install

# This package adds a hardlink to python in /usr/local/bin/python needed for build scripts:
cd /usr/ports/lang/python
make install

Now back in the freenas-build directory, have FreeNAS fetch its source (note we have to add PROFILE=freenas9 to get FreeNAS 9.10 instead of 10):

make PROFILE=freenas9 checkout

Save the patch for netatalk into freenas-build/_BE/ports/net/netatalk3/files.

Now it’s time to build FreeNAS (this takes many hours)!

make PROFILE=freenas9 release

Once done, you’ll have a FreeNAS ISO in freenas-build/_BE/release/FreeNAS-9.10-MASTER-*/x64. I downloaded my current FreeNAS configuration, then used the ISO to install a fresh new copy of FreeNAS to a new USB stick, booted it, uploaded my old configuration, and everything worked fine! The bug was gone, and all my photos displayed correctly.

You can track FreeNAS’s progress in patching this bug on their tracker, though it’ll probably already be included in the latest build by the time you read this (the patch is already in the repo!):

https://bugs.freenas.org/issues/10284

My EC2 server wouldn’t boot after apt-get dist-upgrade: how I fixed it

So I have this EC2 server, which is a MySQL replication slave (henceforth known as “victim”). This server was originally running Alestic’s 64-bit paravirtual AMI for Ubuntu Oneiric 11.10 (ami-a8ec6098), but had been previously upgraded to Precise with do-release-upgrade.

Yesterday, I performed an apt-get dist-upgrade on victim in order to upgrade its apps and the kernel. Then I rebooted the server, but it refused to boot, hanging before SSH could come online. Checking the system log, I saw that it was unable to mount its root disk:

Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[11081144.842424] EXT4-fs (xvdf): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.

[11081145.506225] random: nonblocking pool is initialized
lxcmount stop/pre-start, process 179

 * Starting configure network device security[ OK ]
 * Starting Mount network filesystems[ OK ]
 * Starting Mount filesystems on boot[ OK ]
 * Stopping Mount network filesystems[ OK ]
 * Starting Populate and link to /run filesystem[ OK ]
 * Stopping Populate and link to /run filesystem[ OK ]
 * Stopping Track if upstart is running in a container[ OK ]
 * Starting Bridge socket events into upstart[ OK ]
 * Starting configure network device[ OK ]
 * Starting Initialize or finalize resolvconf[ OK ]

The disk drive for / is not ready yet or not present.
keys:Continue to wait, or Press S to skip mounting or M for manual recovery

Since this is an Amazon server, pressing a key is not an option, there is no interactive console. Though interestingly, telling the instance to stop causes more system log to be printed as it performs a clean shut down.

The obvious question was, why couldn’t it find or mount the root disk?

At this point I took a snapshot of the root disk so that I didn’t mess it up further with any of my experiments. And I started up a new server, “rescue”, from a pristine Alestic Precise AMI, and dist-upgraded it, which should make it nearly identical to victim. But rescue restarted fine without hanging. There must be some critical difference between my pristine rescue server and the hanging victim server, I just had to find it.

In the AWS control panel, I noticed that victim was using an older Amazon “kernel” (which I think is just a PV-GRUB kernel used to boot the kernel installed on the instance’s disk). So I used the AWS CLI tool to make it the same as rescue's, which should be fine since rescue is the same architecture and guest kernel version:

aws ec2 modify-instance-attribute --instance-id i-000000 --kernel "{\"Value\":\"aki-fc8f11cc\"}"

No luck! It still had the same issue.

I unmounted the root disk (/dev/sda1) from victim, attached it to rescue (as /dev/xvdf) and mounted it at /mnt:

root@rescue# mount /dev/xvdf /mnt

One reason for not being able to mount the root disk would be that its label had changed. The /etc/fstab file specifies how the root disk should be identified, so let’s take a look:

root@rescue# cat /mnt/etc/fstab
LABEL=cloudimg-rootfs	/	 ext4	defaults	0 0

Okay, so it looks for a disk with the label “cloudimg-rootfs”. What label does the disk have?

root@rescue# e2label /dev/xvdf
cloudimg-rootfs

So the disk label is correct, rats. Just for fun, I tried replacing the “LABEL=cloudimg-rootfs” part with plain old “/dev/xvda1”, but it still wouldn’t boot.

Maybe the new kernel or initrd was corrupt? I checked out /boot/grub/menu.lst and both victim and rescue were trying to boot the exact same kernel version. So I just removed victim‘s /boot, initrd.img and vmlinuz and replaced them with the pristine ones from rescue:

root@rescue# rm -rf /mnt/initrd.img* /mnt/vmlinuz* /mnt/boot
root@rescue# cp -a /boot /mnt/
root@rescue# cp -a /vmlinuz /mnt/
root@rescue# cp -a /initrd.img /mnt/

Now I tried booting victim, but it still hung for the same reason! So the kernel or initrd wasn’t the issue.

Now I’m getting desperate, so I tried something really crazy. I deleted victim's /etc directory and replaced it with the one from rescue:

root@rescue# rm -rf /mnt/etc
root@rescue# cp -a /etc /mnt

I booted up victim with this new fiddled root disk, and it worked!! It booted fine! So the reason that victim wouldn’t boot is due to some bad configuration in /etc. But the thing I most want to save from victim is its custom configuration, so I can’t just use rescue‘s out-of-the-box defaults. So I deleted the victim disk I had mangled and restored it from its snapshot (back to its broken condition) and remounted it to rescue.

Now I just had to find the difference between the pristine configuration and the broken one that causes it not to boot. I started by installing all the packages I could remember from victim on to rescue in order to minimise the size of the diff between their configurations. Finally, I ran a recursive diff, ignoring any “*~” emacs backup files:

root@rescue# diff -r --exclude "*~" /mnt/etc /etc > /root/victim-diff

Since the system hung before anything exciting like the network or root disk was initialised, I knew I could ignore MySQL, Apache, Postfix, Nagios, etc, since they are all started too late in the boot process. That didn’t leave many interesting changed files. There were some changed grub settings:

diff -r /mnt/etc/default/grub /etc/default/grub
7c7
< #GRUB_HIDDEN_TIMEOUT=0
---
> GRUB_HIDDEN_TIMEOUT=0
9c9
< GRUB_TIMEOUT=5
---
> GRUB_TIMEOUT=0
11c11
< GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 nomdmonddf nomdmonisw nomdmonddf nomdmonisw"
---
> GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"

But moving the newer grub config in didn’t fix it. There was a change to the hardware clock settings that I had seen mentioned a couple of times in the system log:

diff -r /mnt/etc/default/rcS /etc/default/rcS
27a28
> HWCLOCKACCESS=no

But porting that change over didn’t fix it either. Finally, I noticed this:

Only in /mnt/etc/init: lxcguest.conf
Only in /mnt/etc/init: lxcmount.conf

That’s odd, I don’t remember ever using LXC on this system, and it’ll never be an LXC guest. One of those configuration files contains “mount” in the name and both are in the “/init” directory, so these files could easily be related to mounting problems during init! Was I really using LXC? I chroot’d into the old drive in order to interrogate its dpkg catalog:

root@rescue# chroot /mnt
root@rescue# dpkg -l | grep lxc
rc  lxcguest    0.7.5-0ubuntu8.6   amd64  Linux container guest package
root@rescue# exit

The “rc” at the start indicates that the lxcguest package was installed at some point, was removed, but still has configuration files left behind. Great, that means that I can blow away those old files:

root@rescue# rm /mnt/etc/init/lxcguest.conf /mnt/etc/init/lxcmount.conf

And now, glory of glories, the server booted!

[64304636.707613] random: nonblocking pool is initialized
 * Starting Mount filesystems on boot[ OK ]
 * Starting Populate and link to /run filesystem[ OK ]
 * Stopping Populate and link to /run filesystem[ OK ]
 * Stopping Track if upstart is running in a container[ OK ]
[64304638.007641] EXT4-fs (xvda1): re-mounted. Opts: discard
 * Starting Initialize or finalize resolvconf[ OK ]
 * Starting Signal sysvinit that the rootfs is mounted[ OK ]
[64304638.205979] init: mounted-tmp main process (302) terminated with status 1

...

Just to check that it wasn’t fixed due to some combination of changes I made, I deleted victim's fiddled root disk, restored it from the snapshot, and only removed those two lxc config files. It booted fine, so nothing else I changed was required in solving it!

I hope that these debugging steps help someone out in repairing their own EC2 server. (And yes I realise that in most cases, building a new instance with Chef or Puppet would be a better solution, but this is what I had to work with!)

Solving incorrect exec_time stats for queries in MySQL’s binary log

Due to MySQL Bug #52704, if your server’s clock happens to tick backwards during a query’s execution, the Exec_time listed in the binary log will become a huge number like 4294967295 (which is -1 casted to a 32-bit unsigned quantity):

#140427 13:48:52 server id 1  end_log_pos 23475         Query   thread_id=7782750       exec_time=4294967295    error_code=0
SET TIMESTAMP=1398606532/*!*/;
INSERT INTO phpbb_privmsgs_to  (msg_id, user_id, author_id, folder_id, pm_new, pm_unread, pm_forwarded) VALUES (53165565, 65, 2, -3, 1, 1, 0)
/*!*/;
# at 23475

The clock ticking backwards is likely to be caused by your server’s time being adjusted by NTP.

Since it causes those queries to have stupidly large execution times, when you’re trying to examine the binary log with pt-query-digest, it completely throws off all the statistics and makes the tool unusable.

You can solve this issue by adding a custom “filter” to your pt-query-digest call which sets the exec_time of the query to zero if it looks too large to be real:

mysqlbinlog mysql-bin.000526 | pt-query-digest --type binlog --filter '(($event->{Query_time} || 0) > 2147483648 ? $event->{Query_time} = 0 : 1) || 1'

FreeNAS + rsync to ZFS + AFP filesharing = bad idea

Unicode (as UTF-8) is a very popular format for encoding filenames on disk, but there are some subtly incompatible variants around. In particular, different operating systems have different ideas about how accents should be handled.

Mac applies something called NFD to filenames before they are stored on disk (Normalization Form Canonical Decomposition). This means that a character like the “ū” in the word “Jingū” is stored as two Unicode characters – a plain old LATIN SMALL LETTER U character, followed by a COMBINING MACRON character:

ls | grep 2009_11_03 | od -c -tx1

0000000    2   0   0   9   _   1   1   _   0   3       M   e   i   j   i
           32  30  30  39  5f  31  31  5f  30  33  20  4d  65  69  6a  69
0000020        J   i   n   g   u    ̄  **  \n
           20  4a  69  6e  67  75  cc  84  0a

Windows does the exact opposite (NFC), combining the “u” and the macron together to produce a single LATIN SMALL LETTER U WITH MACRON character.

0000000    2   0   0   9   _   1   1   _   0   3       M   e   i   j   i
           32  30  30  39  5f  31  31  5f  30  33  20  4d  65  69  6a  69
0000020        J   i   n   g   ū  **  \n
           20  4a  69  6e  67  c5  ab  0a

On Linux, I’m not sure if there is a standard, but it’s typical to find filenames encoded with NFC. If anything, the standard in Linux is that a filename is just a series of bytes and the OS shouldn’t try to mangle them by converting the characters with NFC or NFD normalization forms.

I recently set up a second computer as a FreeNAS 9.2.0 storage device, and I wanted to migrate the files from my MacBook to it. The most straightforward way to do this is to enable the Apple “AFP” network filesharing service on FreeNAS, then on the Mac, copy the files to that network share however you like. This automatically takes care of any character conversion for you.

However, I tried this and only achieved about 2MB/s transfer speeds. I would have died of old age before I would have been able to copy my terabytes of data to FreeNAS.

So, instead I used rsync to send my files directly to the ZFS storage on FreeNAS, bypassing any of its network file system protocols. This achieved a steady ~110MB/s, which is wonderful. The issue came when I went to read those same files over AFP: I could briefly see folders that contained accents in the finder, but they would disappear after several seconds, then reappear 10 seconds later, then disappear again!

The issue was that rsync, being Linux-oriented, preserves the filename encoding when sending files, so the filenames on FreeNAS ended up still in Mac’s NFD format (with accents encoded as separate characters). This is a problem because netatalk, the AFP server on FreeNAS, expects filenames on FreeNAS volumes to be encoded in the “vol charset” which is defined in its configuration file:

http://netatalk.sourceforge.net/3.0/htmldocs/configuration.html#charsets

FreeNAS doesn’t set this explicitly, so it defaults to “UTF8”. Although the netatalk manual doesn’t say it, “UTF8” actually implies “UTF8 in NFC form”, so netatalk will be unable to serve the NFD-encoded filenames that originated on my Mac correctly. What happens is that a MacOS X client lists a directory over AFP, so netatalk converts the filenames on disk (that it thinks are NFC) to NFD (a no-op, since we’ve actually put NFD filenames on there to start with). This allows the accented characters to show up properly in MacOS and the listing looks okay. But then MacOS’s finder asks for more information about a file specifically by name. Netatalk converts the NFD filename that MacOS provides to the vol charset (set to NFC form), and then tries to look it up on the filesystem. But the filename doesn’t exist on the disk in NFC form, so it can’t find the file. This causes the file to disappear again in MacOS’s Finder.

Here are three ways of solving the problem, in order from worst to best:

Option 1: Change vol charset to UTF8-MAC

Changing the vol charset to “UTF8-MAC” will fix the issue by letting netatalk know that the filenames on disk are in NFD form.

The “vol charset” setting is found in the afp.conf file at /usr/local/etc/afp.conf. But you shouldn’t edit this file, as it is automatically regenerated at various times by the script /usr/local/libexec/nas/generate_afpd_conf.py. Instead, edit that script:

# Remount root as writable so we can edit the script:
mount -uw /
nano /usr/local/libexec/nas/generate_afpd_conf.py

Find this section:

cf_contents.append("\tmax connections = %s\n" % afp.afp_srv_connections_limit)
cf_contents.append("\tmimic model = RackMac\n")
cf_contents.append("\tvol dbnest = yes\n")
cf_contents.append("\n")

Before the append("\n") line, add:

cf_contents.append("\tvol charset = UTF8-MAC\n")

Save and exit, then:

# Remount root as readonly and commit our changes to disk (takes ages on a USB flash drive, so be patient)
mount -ur /

Now on FreeNAS’s Services/Control Services page, turn AFP off and back on in order to regenerate its configuration file. You should see the new line added when you cat /usr/local/etc/afp.conf. Your Mac-encoded filenames will now serve correctly through AFP!

Option 2: Change the encoding of the filenames on FreeNAS’ disk to NFC

Alternatively, you could leave the AFP configuration alone and change the filename encoding on disk instead. This will make the files available to Mac, but with the caveat that the filenames will no longer be the same as your source files due to the encoding difference, and rsync that runs directly against ZFS will no longer consider them to be the same files.

I did this by creating a new plugin jail, then adding my ZFS volume as additional storage to that jail. From that jail’s console button, I used “pkg install convmv” to install the convmv package, which can change filename encodings. I changed the encoding to NFC like so:

convmv -f utf-8 -t utf-8 --nfc -r --no-test /mnt/my-files

(You should run without –no-test first so convmv can tell you exactly what it plans to do, before you accidentally mangle your filenames more!)

Option 3: Go back in time and copy the filenames correctly in the first place

You can avoid this whole issue in the first place by having rsync convert the filenames to NFC as they are copied to ZFS:

rsync -a --iconv=utf-8-mac,utf-8 my-files/ root@freenas.local:/mnt/my-files/

In fact, if you fix the filenames using Option 2, then do all your future rsyncs with the character conversion specified here, everything will be hunky-dory!

Search engine crawlers have a miserable cache hitrate

I’m looking to migrate some load off my main webserver/database, so I was looking into which of our pages render the slowest. While I was doing that, I discovered that most of our rendering time is due to just a few client IP addresses, and they turned out to be search indexers. If I group all requests by user agent (“robots”, which is only googlebot and bingbot, and “humans”, which is everybody else), I get:

Robot reqs:8062 total_sz:85MB avg_sz:10kB avg_upstr_time: 1925ms total_upstr_time: 15521s cache_hitrate: 51.3%
Human reqs:414898 total_sz:5132MB avg_sz:12kB avg_upstr_time: 32ms total_upstr_time: 13520s cache_hitrate: 98.7%

So the average robot request is some 60 times slower to render than the average human one. This is because they spend most of their time loading old pages that nobody else cares about, which are never in cache and incur heavy disk IO times from our database.

I plan to create a readonly database replica and second webserver which will be dedicated to handling requests from these search crawlers. That will stop the caches on our primary server from being wasted on old content that no humans want to see.