The Linux kernel feature known as ‘kexec‘ allows you to boot from the currently running kernel into a new kernel – effectively turning a Linux distribution into a feature-rich bootloader. This shouldn’t be confused with virtualisation technologies that allow you to run Linux as a guest. This capability has been around since 2005 (2.6.13) and is now available on most architectures, though you’d be forgiven for not being aware of its existence. In this post we’re going to give you a brief introduction.
The capability exists in the form of an additional system call in the kernel and some user space tools. The system call, enabled via CONFIG_KEXEC, is ‘kexec_load‘ which allows for loading a new kernel into memory. Interestingly, the new kernel is booted via the existing ‘reboot’ system call whenever a LINUX_REBOOT_CMD_KEXEC flag is passed to it. The user space tools are provided by the kexec-tools package and provide utilities such as ‘kexec‘ (unsurprisingly).
Let’s see how this works in practice on a Raspberry Pi (64bit). This is a two step process – first we need to load a kernel into memory by using the ‘kexec’ utility:
$ kexec -l /mnt/Image
The ‘-l’ argument specifies the kernel we wish to load, on ARM64 this can be a ‘vmlinux’ ELF, a U-Boot ‘uImage’ file or a binary ‘Image’. In this case we’ve used the arch/arm64/boot/Image file.
Next we will reboot into our new kernel – we do this via the ‘-e’ argument to ‘kexec’. There are a variety of architecture specific options available, though here we will use ‘–dtb’ to specify the DTB file and ‘–reuse-cmdline’ to use the same kernel command line that we used with the currently running kernel.
$ kexec -e --dtb /mnt/bcm2837-rpi-3-b-plus.dtb --reuse-cmdline [ 4603.030872] kvm: exiting hardware virtualization [ 4603.044176] kexec_core: Starting new kernel [ 4603.055839] Bye! [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034] [ 0.000000] Linux version 5.10.0-rc4-00368-g27bba9c532a8-dirty (andy@big-machine) (aarch64-linux-gnu-gcc (Linaro GCC 7.5-2019.12) 7.5.0, GNU ld (Linaro_Binutils-2019.12) 22.214.171.12470706) #8 SMP PREEMPT Tue Nov 24 11:21:19 GMT 2 020 [ 0.000000] Machine model: Raspberry Pi 3 Model B+ [ 0.000000] efi: UEFI not found. [ 0.000000] Reserved memory: created CMA memory pool at 0x0000000037400000, size 64 MiB [ 0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool [ 0.000000] NUMA: No NUMA configuration found [ 0.000000] NUMA: Faking a node at [mem 0x0000000000000000-0x000000003b3fffff] [ 0.000000] NUMA: NODE_DATA [mem 0x37211b00-0x37213fff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000000000000-0x000000003b3fffff] [ 0.000000] DMA32 empty [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node
After a slightly uncomfortable wait, we see the new kernel spring into life and eventually get a prompt. We can use ‘uname’ to verify the timestamp of the kernel we’ve just booted. Hooray!
Welcome to Buildroot buildroot login: root $ uname -a Linux buildroot 5.10.0-rc4-00368-g27bba9c532a8-dirty #8 SMP PREEMPT Tue Nov 24 11:21:19 GMT 2020 aarch64 GNU/Linux
There are many reasons why this is useful – the most common use-case for this is as a crash-kernel – the idea here is that at start-of-day you load a kernel into reserved memory, then if the kernel panics or does something undesirable you automatically boot into the crash-kernel which is used to take a crash dump of the crashed kernel for analysis via gdb or the crash utility. There is plenty of documentation for kdump here.
Another use for kexec is to achieve quicker reboots. Normally a reboot involves some request to the hardware to reset, which then results in firmware loading, the bootloader starting and the bootloader initialising hardware – which all takes time. Booting Linux with kexec skips these steps. Though because it skips these steps the new kernel may start with hardware in a potentially different state prior to how the original kernel started. The kernel, upon a kexec reboot, will attempt to cleanly shutdown devices and put them in a good state (e.g. it calls ->shutdown on device drivers) – however bugs can occur as these shutdown paths may not be well tested.
Another use for kexec is as a bootloader – a quick Google search shows lots of examples including using kexec to boot Windows! The main benefit for using Linux as a bootloader is that Linux is feature-rich and has great support for a wide range of hardware. Quite often it will be less effort to use Linux than to port missing functionality to U-Boot (which borrows a lot from Linux anyway) – this slide deck provides some insight into why Linux may be used for trusted boot.
And finally, as a developer it may be easier to boot a new kernel from Linux than work with a limited, secure or locked-down bootloader.
Finally its worth pointing out that if you attempt to kexec from a Raspberry Pi, you may see an error like this:
$ kexec -l /mnt/Image [ 54.311475] Can't kexec: CPUs are stuck in the kernel. kexec_load failed: Device or resource busy entry = 0x1f646b0 flags = 0xb70000 nr_segments = 3 segment.buf = 0xffff99781010 segment.bufsz = 0x1ed0200 segment.mem = (nil) segment.memsz = 0x1f60000 segment.buf = 0x98d4040 segment.bufsz = 0x4000 segment.mem = 0x1f60000 segment.memsz = 0x4000 segment.buf = 0x98d83c0 segment.bufsz = 0x31e0 segment.mem = 0x1f64000 segment.memsz = 0x4000
A quick workaround for this is to add ‘nr_cpus=1‘ to the kernel command line which limits the kernel to using a single CPU. This error occurs because the kernel doesn’t have a way to park the secondary CPUs – ideally it would use PSCI to do this – which is essentially a call to firmware running at a higher privilege level via an SMC instruction (thus allowing for some abstraction of the hardware from the kernel, helping make ARM platforms more server-ready and ‘boring‘). However it’s likely that there is no such firmware – this can be fixed in a variety of ways by booting with ARM Trusted Firmware (ATF) which includes a PSCI service, using a poor-man’s PSCI monitor, by using the ARM64 boot wrapper – Or simply by parking the CPU directly from the kernel via this rejected patch.