In mid-December I rebooted to upgrade my Proxmox kernel to pve-kernel-5.4.78-2-pve, but I immediately started having an issue where the kernel would trigger a GPF (general protection fault) and reset about 5-20 minutes after starting my macOS VM. I suspected that the new kernel was at fault, but I rolled back to the previous kernel and the problem persisted. I hadn’t experienced this fault before so I was a bit baffled about what change I made before that reboot could have triggered it.
To track down the issue, I built a version of Proxmox’s kernel with KASAN enabled. KASAN is the Kernel Address Sanitiser, it can detect kernel bugs like double-frees or out-of-bounds reads and writes by instrumenting the kernel to add checks around every memory access. This adds a bunch of CPU and memory space overhead, but the impact is bearable so long as your guest doesn’t need much service from the host kernel.
Proxmox’s pve-kernel repository can be found here:
https://git.proxmox.com/?p=pve-kernel.git;a=summary
Enabling KASAN just requires adding some kernel config parameters to the debian/rules file, like so:
https://github.com/thenickdude/pve-kernel/commit/ed67c2118a32efdcaa27c877e5115e2a08f0591b
Note that I had to manually run “git fetch –all” in the pve-kernel/submodules/ubuntu-focal directory, because the kernel commit that pve-kernel is based on was only found within a tag in that repo, and tags aren’t fetched by default.
The end result was a set of debs I installed on Proxmox to replace the current kernel.
And then for a month, silence… I didn’t have any memory errors detected by KASAN, and the kernel didn’t crash.
But today, I rebooted the system to boot up a Linux VM with passthrough, and finally KASAN filled the console with a log:
proxmox kernel: [ 110.491997] BUG: KASAN: double-free or invalid-free in hid_free_buffers.isra.14+0x14a/0x290 [usbhid] proxmox kernel: [ 110.493893] proxmox kernel: [ 110.495737] CPU: 9 PID: 20045 Comm: task UPID:proxm Tainted: P B O 5.4.78-2-pve #1 proxmox kernel: [ 110.499519] Call Trace: proxmox kernel: [ 110.503340] print_address_description.constprop.6+0x20/0x220 proxmox kernel: [ 110.507194] kasan_report_invalid_free+0x69/0xb0 proxmox kernel: [ 110.510997] __kasan_slab_free+0x169/0x180 proxmox kernel: [ 110.514842] kasan_slab_free+0xe/0x10 proxmox kernel: [ 110.518658] hid_free_buffers.isra.14+0x14a/0x290 [usbhid] proxmox kernel: [ 110.522507] hid_device_remove+0xce/0x200 [hid] proxmox kernel: [ 110.526371] ? klist_put+0xcf/0x120 proxmox kernel: [ 110.530249] bus_remove_device+0x292/0x540 proxmox kernel: [ 110.534120] ? usb_hcd_flush_endpoint+0x70/0x3b0 proxmox kernel: [ 110.538027] ? __kasan_check_write+0x14/0x20 proxmox kernel: [ 110.541934] ? _raw_spin_lock+0xd0/0xd0 proxmox kernel: [ 110.545801] usbhid_disconnect+0xa7/0xd0 [usbhid] proxmox kernel: [ 110.549656] ? rpm_idle+0x302/0x730 proxmox kernel: [ 110.553522] ? klist_put+0xcf/0x120 proxmox kernel: [ 110.557381] bus_remove_device+0x292/0x540 proxmox kernel: [ 110.561222] ? kobject_put+0x197/0x430 proxmox kernel: [ 110.565031] ? usb_remove_ep_devs+0x3c/0x80 proxmox kernel: [ 110.568892] usb_disable_device+0x19e/0x4d0 proxmox kernel: [ 110.572741] usb_disconnect+0x1f9/0x820 proxmox kernel: [ 110.576529] ? _raw_spin_lock+0xd0/0xd0 proxmox kernel: [ 110.580601] ? usb_hc_died+0x2d6/0x2d6 proxmox kernel: [ 110.584344] ? usb_hub_create_port_device.cold.9+0x19/0x19 proxmox kernel: [ 110.588027] ehci_pci_remove+0x1a/0x20 [ehci_pci] proxmox kernel: [ 110.591635] ? pcibios_free_irq+0x10/0x10 proxmox kernel: [ 110.595159] device_release_driver_internal+0x1e0/0x4d0 proxmox kernel: [ 110.598610] unbind_store+0x19b/0x210 proxmox kernel: [ 110.602017] ? sysfs_kf_bin_read+0x2d0/0x2d0 proxmox kernel: [ 110.605380] ? drv_attr_show+0xa0/0xa0 proxmox kernel: [ 110.608743] kernfs_fop_write+0x223/0x410 proxmox kernel: [ 110.612068] ? _cond_resched+0x19/0x30 proxmox kernel: [ 110.615311] ksys_write+0x104/0x220 proxmox kernel: [ 110.618514] __x64_sys_write+0x73/0xb0
Searching for the faulting routine revealed this bug report:
https://bugzilla.kernel.org/show_bug.cgi?id=210241
And now I knew exactly why these symptoms were so intermittent, and why they had suddenly started in mid-December. My Magic Trackpad 2 is normally connected via Bluetooth, and my Proxmox doesn’t have any Bluetooth config set-up, so it normally never connects to Proxmox.
But when my trackpad battery went flat in December, and I plugged it into USB to charge, I had inadvertently set up half of the trigger condition for the fault. There was no issue until the next time I rebooted Proxmox. Then during boot Proxmox loaded the driver for the Magic Trackpad, since it was now connected to it by USB. Then when my guest started, it grabbed the USB controller using PCIe passthrough, so Proxmox disconnected the Magic Trackpad from its own drivers, corrupting the heap. That kernel heap corruption would cause a crash later on in some unrelated routine when it touched that part of the heap (multiple minutes later). But KASAN was able to pinpoint the corruption at the site where it occurred rather than at the victim location that crashed, saving the day.
As long as the trackpad was only plugged in while macOS was running, no fault would occur since the Linux driver for it would never be loaded.
Workaround
To solve this, create or append to the file /etc/modprobe.d/blacklist.conf
, and add a line:
blacklist hid-magicmouse
Then reboot Proxmox, plug in the Magic Trackpad 2, and just confirm with “lsmod” that the hid-magicmouse module didn’t get loaded.