ARM Morello with Linux

ARM has recently made their Morello development on the Linux kernel public and since we are lucky enough to have access to the Morello board we decided to give this a spin. This is in contrast to our last blog post which used an Android stack and ran in a simulator.

Kernel development in the open is a great move as having Morello-enabled Linux is increasing the probability of widespread adoption for the new experimental architecture due to the vast user base and pool of engineers associated with the OS. Linux is one of the most widely used operating system in IoT, in a 2022 survey done by the Eclipse Foundation, Linux was running on 43% of constrained devices and 51% of edge nodes and gateways, in previous years the share was even higher. Interestingly the leading industry for IoT is currently agriculture and not surprisingly given the technology is being adopted everywhere now – the main workload is artificial intelligence. One of the main concerns amongst IoT developers in this and other surveys is security, which supports the need of a more secure compute. These surveys may have limited sample sizes, nevertheless it can be asserted that Linux is at the forefront of IoT.

Briefly and from the top level view the Morello hardware consists of two dual-core Armv8-A architecture CHERI-enabled clusters, Mali GPU and Mali DPU. The Morello board also has the usual IO connectors like usb, ethernet, SATA, PCIe among others and peripherals which are implemented on a FPGA (I2C,UART, I2S, etc.). Interconnect and dynamic memory controllers within Morello are capability aware, CHERI-enabled and can handle tagged memory. In short, how the tag is handled depends on the operating mode of the controller and the type of memory installed: in server mode it requires Error Correction Code DRAM as the tag bit is stored in one of the ECC bits. In client mode the tags are stored in DRAM at the top of the physical address space and consumes 1/128th of the system memory. Detailed information can be found here under appendix D. Finally we have a Platform Controller Chip (board power up, temperature etc.) and a Motherboard Configuration Controller (board management), both are implemented on a Cortex M4.

The available transitional kernel-user ABI is still a work in progress and it is mainly focused on enabling capabilities on a limited number of commonly used syscalls. Capabilities are not enforced in the kernel itself but the PCuABI guarantees that the capabilities passed from the user space will be returned untouched. So lets dive in into what enabling this transitional ABI will entail. First of all, the capability support can be enabled in the kernel via CONFIG_CHERI_PURECAP_UABI. The Linux Morello kernel is backwards compatible with aarch64 via CONFIG_COMPAT, the same define that enabled 32-bit compatibility for 64-bit processors in the past. So it appears that at the time of writing this config option has been re-purposed in the case of Morello purecap kernels (we presume that this was done because c64 is not backwards compatible with arm 32-bit architectures anyway). This is a very useful feature as one can start developing in the a64 user space, test some aspects of c64 as they go along and then port the project to c64 when Linux kernel is in a more advanced phase of development. Well written C++ user space application should not require too much work in porting to c64, an interesting paper on the subject of porting desktop environments to CHERI can be found here. The paper states that out of more than 6 million lines of code only 0.026% lines related to CHERI had to be changed, which equates to 1584 lines. To give a more specific example, in the case of KDE, a single line of code had to be changed.

We start with checking out the ARM’s fork of cheri LLVM that supports Morello, as one needs a capability aware compiler to build the capability aware kernel. Morello-enabled GNU toolchain is still in its alpha phase and supports only bare-metal devices. We can compile the compiler manually or use cheribuild and pass llvm-morello option to the build script, in both cases clang will be required. We tried both methods and both were successful. With the compiler built, in your path and the kernel checked out we then run the following command within the kernel tree:

$ make CROSS_COMPILE=aarch64-none-linux-gnu- ARCH=arm64 CC=clang morello_transitional_pcuabi_defconfig
$ make CROSS_COMPILE=aarch64-none-linux-gnu- ARCH=arm64 CC=clang

ARM is providing a default config for Morello. Great, so we have a purecap kernel … now we will only need to prepare a bootable image for our hardware. We will diverge for a moment from the Linux side of things to describe how to achieve this, lets start with the boot process.

To boot the hardware one first needs an SD card image with: Motherboard Configuration Controller (MCC), Platfrom Controller Chip (PCC), System Control Processor (SCP), Manageability Control Processor binaries (MCP), a FIP binary with Trusted Firmware-A and UEFI as non trusted BL33, followed by IOFPGA binary and some other config files. Luckily, one can just use a ready made image. One can also use ARM’s Morello stack to build these, however it is possible to build SCP, MCP, ATF and UEFI from scratch if required as the source code is hosted on ARM’s git. For Morello in purecap mode to be operational one needs capability-aware firmware as well, so that the whole of the stack needs to know what a capability instruction is and how to handle it, thus TF-A and UEFI must be compiled using capability aware LLVM.

In simplified terms, after power is switched on the MCC does the initial board setup, enables UART, power supply and the PCC is taken out of reset. The MCC then enables the SCP, IOFPGA image is loaded from the SD card, SCP/MCP run their code, the SCP does SoC setup and then the application code is executed. ARM trusted firmware runs and hands off the control to the UEFI bootloader which then in turn initialises the memory map and other drivers. As a consequence we end up in the boot up screen where we can choose where to go next…

To progress further we need a bootable USB stick with the following partitions: the bootable grub.efi, grub.cfg, our kernel and a rootfs, the first three items will form the EFI system partition, one can refer to the Morello stack to assemble this correctly, the build sources are available in their git. With that out of the way we are then presented with the familiar console output as shown below.

In the user space, applications can get rid off the shimming layer as the pointers passed on to the kernel (and from) during a context switch are now to be replaced with capabilities. This entails that the musl-libc and apps can now be compiled in purecap only mode. The version of musl-libc used at the time of writing supported statical linking only, which will be improved upon in future releases, so your *.crt objects and the library *.a need to be linked manually.

How does the system knows what to do with your application? The answer is in the elf header. The below figures are outputs obtained from running llvm-readelf on the same minimal application that will contain some operations on pointers compiled with both a64 and c64 ABI. The processor-specific e_flags bits set to EF_AARCH64_CHERI_PURECAP (0x1000) will indicate that we deal with c64 compatible ABI, if it is 0x0 we deal with the good old a64 pointers. Moreover, we can see that the c64 file had its entry address LSB set to 0x1 and that in the a64 file the LSB of address is 0x0. This is a flag used to switch between the two instruction sets in the Morello architecture, if the LSB of the address is 0x1 then the instruction set is switched to c64 and if LSB is 0x0 to a64. (This trick is very similar to how ARM Thumb mode is used). The actual entry point is the same in both cases. More info can be found in chapter 2 of the Morello extensions ISA or in the kernel documentation.

A64 ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x2102C0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          8616 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         20
  Section header string table index: 18
There are 20 section headers, starting at offset 0x21a8:

C64 ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x2102C1
  Start of program headers:          64 (bytes into file)
  Start of section headers:          11888 (bytes into file)
  Flags:                             0x10000, purecap
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         22
  Section header string table index: 20
There are 22 section headers, starting at offset 0x2e70:

The next thing that we can observe is that there are more sections in our segments in the c64 *.elf file, as shown in more detail below:

 A64 Section to Segment mapping:
  Segment Sections...
   00     
   01     .rodata .eh_frame_hdr .eh_frame 
   02     .text .init .fini 
   03     .ctors .dtors .got 
   04     .data .bss 
   05     .ctors .dtors .got 
   06     .eh_frame_hdr

 C64 Section to Segment mapping:
  Segment Sections...
   00     
   01     .rodata .eh_frame_hdr .eh_frame 
   02     .text .init .fini 
   03     .ctors .dtors .data.rel.ro __cap_relocs .got 
   04     .data .bss 
   05     .ctors .dtors .data.rel.ro __cap_relocs .got 
   06     .eh_frame_hdr

The extra sections are .data.rel.ro and __capp_relocs, which need to exist as capabilities cannot be statically initialised because the base and existing capabilities in the system are unknown until the actual runtime. The static capabilities will have to be derived from other valid capabilities already in the system as per monotonicity doctrine from the CHERI specification. Therefore all static capabilities originating from the code will infer a relocation entry in the __capp_relocs section that contains an array of all static capability descriptors. That table is then iterated through at runtime and the actual capabilities are created from base capabilities and only then they can finally be stored, all executed according to the metadata found in the __capp_relocs table entries. This is done by the bootstrap function__morello_init_static which is part of the purecap C runtime library. More detailed discussion can be found in the appendix C here.

The *.elf file can contain both type of instructions and extra mapping symbols are used to indicate start of sequences of these, $c will stand for c64 sequence and $x for a64, similarly the capability-enabled instructions are either prefixed with or replaced with c in the assembler.

To build user space applications we need to build musl-libc in purecap only mode first, this can be achieved by downloading the source and running the following script which will install the library to a sysroot folder of your choosing:

CC=$LLVM_PATH/clang LD=$LLVM_PATH/ld OBJCOPY=$LLVM_PATH/llvm-objcopy AR=$LLVM_PATH/llvm-ar \
OBJDUMP=$LLVM_PATH/llvm-objdump READELF=$LLVM_PATH/llvm-readelf NM=$LLVM_PATH/llvm-nm \
STRIP=$LLVM_PATH/llvm-strip LLVM=$LLVM_PATH LLVM_IAS=1 ./configure --prefix=${SYSROOT}/aarch64-linux-musl_purecap \
--target=aarch64-linux-musl_purecap --disable-shared --enable-static \
--disable-libshim --enable-morello
make
make install

Then one needs to build the the crt objects and libraries, the steps are described here. We then compile the app using the following make file (interesting bits shown for brevity) which is based on what we have found by digging through the stack, it will link crt objects and the c library with our app. The app’s elf header is also patched with the EF_AARCH64_CHERI_PURECAP flag using the following elf tool.

TARGET=aarch64-linux-musl_purecap
ARCH=-march=morello+c64 -mabi=purecap

SYSROOT_LIB=$(SYSROOT)/$(TARGET)/lib
SYSROOT_INC=$(SYSROOT)/$(TARGET)/include
COMPILER_INC=$(CLANG_RESOURCE_DIR)/include

$(CC) -c -g -O0 -isystem $(SYSROOT_INC) -I$(COMPILER_INC) \
        $(ARCH) $(NAME).c -o $(OUT)/$(NAME).c.o \
        --target=$(TARGET)
$(CC) $(CFLAGS) --target=$(TARGET) -fuse-ld=lld $(ARCH) \
        $(SYSROOT_LIB)/crt1.o \
        $(SYSROOT_LIB)/crti.o \
        $(CLANG_LIB)/clang_rt.crtbegin.o \
        $(OUT)/$(NAME).c.o \
        $(CLANG_LIB)/libclang_rt.builtins.a \
        $(CLANG_LIB)/clang_rt.crtend.o \
        $(SYSROOT_LIB)/crtn.o \
        -nostdlib -L$(SYSROOT_LIB) -lc -o $(OUT)/$(NAME) -static
    $(ELF_PATCH) $(OUT)/$(NAME)

Congratulations, you can now put your apps on the rootfs of the Linux image, we have confirmed to get similar results as in our Android blogpost.

Moving on, lets explore what happens to pointers created on the stack with the help of the below example, which shows a trivial and pointless C function compiled with the method shown above:

int main(int argc, char **argv)
{
    int k = 42;
    int * pk = &k;
    return *(pk) + k;
}

Which un-optimised assembles into:

A64: 0000000000210458 <main>:
// Prologue and reserve stack
  210458: ff 83 00 d1   sub sp, sp, #32             // =32
  21045c: ff 1f 00 b9   str wzr, [sp, #28]
  210460: e0 1b 00 b9   str w0, [sp, #24]
  210464: e1 0b 00 f9   str x1, [sp, #16]
// Register and stack pointer juggling
  210468: e8 33 00 91   add x8, sp, #12             // =12
  21046c: 49 05 80 52   mov w9, #42
  210470: e9 0f 00 b9   str w9, [sp, #12]
  210474: e8 03 00 f9   str x8, [sp]
  210478: e8 03 40 f9   ldr x8, [sp]
  21047c: 08 01 40 b9   ldr w8, [x8]
  210480: e9 0f 40 b9   ldr w9, [sp, #12]
// Store for return
  210484: 00 01 09 0b   add w0, w8, w9
// Release the stack and return
  210488: ff 83 00 91   add sp, sp, #32             // =32
  21048c: c0 03 5f d6   ret

C64: 00000000002105a4 <main>:
// Prologue and reserve stack, 
  2105a4: ff 03 81 02   sub csp, csp, #64           // =64
  2105a8: 22 d0 c1 c2   mov c2, c1
  2105ac: e8 03 00 2a   mov w8, w0
// Create pointers to stack and set stack bounds for capabilities
  2105b0: e0 f3 00 02   add c0, csp, #60            // =60
  2105b4: 05 38 c2 c2   scbnds  c5, c0, #4              // =4
  2105b8: e0 e3 00 02   add c0, csp, #56            // =56
  2105bc: 04 38 c2 c2   scbnds  c4, c0, #4              // =4
  2105c0: e0 83 00 02   add c0, csp, #32            // =32
  2105c4: 03 38 c8 c2   scbnds  c3, c0, #16             // =16
  2105c8: e0 73 00 02   add c0, csp, #28            // =28
  2105cc: 00 38 c2 c2   scbnds  c0, c0, #4              // =4
  2105d0: e1 d3 c1 c2   mov c1, csp
  2105d4: 21 38 c8 c2   scbnds  c1, c1, #16             // =16
// Initialise stack pointers
  2105d8: e9 03 1f 2a   mov w9, wzr
  2105dc: a9 00 00 b9   str w9, [c5]
  2105e0: 88 00 00 b9   str w8, [c4]
  2105e4: 62 00 00 c2   str c2, [c3, #0]
// Register and stack pointer juggling
  2105e8: 48 05 80 52   mov w8, #42
  2105ec: 08 00 00 b9   str w8, [c0]
  2105f0: 20 00 00 c2   str c0, [c1, #0]
  2105f4: 21 00 40 c2   ldr c1, [c1, #0]
  2105f8: 28 00 40 b9   ldr w8, [c1]
  2105fc: 09 00 40 b9   ldr w9, [c0]
// Store for return
  210600: 00 01 09 0b   add w0, w8, w9
// Release the stack and return
  210604: ff 03 01 02   add csp, csp, #64           // =64
  210608: c0 53 c2 c2   ret c30

The cn CPU registers are the 129-bit capability enabled general purpose registers, xn registers are the lower 64-bit alias, wn registers are the 32-bit alias, so c is describing operations on a capability where x and w, effectively an operation on a number. If the instruction set is c64 as determined by the LSB of the entry address, branch and link will write the address to register c30, otherwise to register x30 and so on. We can see that the capability enabled assembly is larger, with a64 requiring 14 instructions to complete and c64 requiring 26. This is mainly due to the fact that bounds for the stack pointer registers are set during the prologue using the scbnds instruction. We can also see that the stack pointer sp is replaced with csp, so the stack pointer has its own permissions and all of the other capability metadata.

To sum up, we have shown that you can boot into Linux and briefly went over the user space application structure and workings. We will comeback to the subject, then we can explore how dynamic linking works and will exercise a few syscalls. It is nice to see progress and we are looking forward to playing more with the hardware.

We’re currently working on creating a Yocto distribution for the Morello board which we plan to make public soon. This includes the building of the firmware, kernel and userspace – you can follow our progress on our Github repo and the Morello Linux Distros mailing list.