May 2016 – Work Of Ard

Booting a big-endian kernel from UEFI

One recurring question I get regarding UEFI on ARM systems is when we will introduce support for booting big-endian kernels. If you think of UEFI as simply a bootloader, this sounds like a reasonable question, but when you take a closer look, this is actually much more complicated than it sounds.

UEFI is a specification, not an implementation

UEFI originated in the Intel world, which is little-endian only. This means that from the specification side, no attention or effort whatsoever has been spent on making the interfaces, data structures and other software visible objects deal with endianness. Also, the PE/COFF executable format that UEFI heavily relies on does not take endianness into account at all.

This means that it is impossible to recompile a UEFI implementation in big-endian mode, and still adhere to the specification. Whether you could get away with it in practice is irrelevant, since the reason we like UEFI is the fact that is a specification, not an implementation, and every UEFI compliant OS should be able to interact with every UEFI compliant firmware (provided that they were built for the same architecture).

One possible approach could be to introduce BE-AArch64 as a completely new architecture both in PE/COFF and in UEFI, but that would result in BE firmwares that can only boot BE kernels, which does not sound that appealing either.

Running a big-endian OS on little-endian firmware

So if building a big-endian UEFI firmware is out of the question, can we boot a big-endian kernel from a little-endian UEFI? Again, if you think of UEFI as a bootloader with a single handover point to the OS, this does not sound unreasonable, but there are still a couple of concerns.

UEFI exposes firmware tables to the OS, such as the System Table, the memory map and other data structures containing multibyte quantities that need to be endian swabbed before consumption. In Linux, none of the existing code that handles these tables takes endianness into account.
The UEFI stub in Linux makes the kernel executable pose as a PE/COFF binary, and the UEFI stub code is called in the execution context of the firmware. In order to support big-endian kernels, we would have to build some objects in LE mode, some in BE mode, and objects that are shared between the stub and the kernel proper would need to be built twice. It is unlikely that the ARM and arm64 Linux maintainers will be eager to adopt such changes, since they complicate the build system configuration considerably.
Invoking UEFI Runtime Services will require an endianness switch at EL1. This involves endian swabbing the in-memory representation of by-reference arguments, but this is the easy part. The hard part is taking exceptions, not only faults, but interrupts as well (Since v4.6, UEFI runtime services execute with interrupts enabled). None of the exception handling machinery is set up to deal with exceptions raised in the wrong endianness, and complicating those code paths to deal with this is unlikely to be upstreamable.

A standalone stub

If we assume that point #1 above is something that we can address and upstream, either by making the code deal with endianness, or by disabling some UEFI related features when building a BE kernel, and if we deal with point #3 above by accepting the fact that such a kernel will not have access to UEFI Runtime Services, we still need to address point #2.

Since the UEFI stub executes in the context of the firmware, while the kernel proper executes in its own context, there is a handover protocol that is described inÂ Documentation/arm/uefi.txtÂ in the Linux kernel tree. This handover protocol basically comes down to populating some DT nodes under /chosen with a description of the firmware context, and there is no reason we cannot implement the same handover protocol in a separate UEFI OS loader application.

So what we will need to support BE boot under UEFI is a standalone stub. This UEFI application should load the kernel image at an appropriate location in system memory, populate the DT /chosen node with the kernel command line, potentially an initrd, and information about the location of the UEFI system table and the UEFI memory map. Then it can branch straight into the core kernel entry point, and boot the BE kernel with full access to UEFI features (to the extent that they were made endianness agnostic)

If anyone is interested in implementing this, and needs a hand, don’t hesitate to contact me.

Memory protection in UEFI

One of the most important principles of secure system design is distinguishing between code and data, where ‘code’ means sequences of CPU instructions, and ‘data’ means the data manipulated by those instructions. In some cases, ‘data’ is promoted to ‘code’ in a program, for instance by a shared library loader or a JIT, but in most cases, they are completely disjoint, and a program that manipulates its own code as if it were data is misbehaving, either due to a bug or due to the fact that it is under attack.

The typical approach to address this class of attacks is to use permission attributes in the page tables,Â on the one hand to prevent a program from manipulating its own code, and to prevent it from executing its data on the other. This is usually referred to as W^X,Â i.e., the permission attributes of any memory region belonging to a program may eitherÂ have the writable attribute, or the executable attribute, but never both (W xor X).

UEFI implementations typically map all of memory as both writable and executable, both during boot and at runtime. This makes UEFI vulnerable to this kind of attacks, especially the memory regions that are retained by the OS at runtime.

Runtime memory protection in UEFI

Booting via UEFI consists of two distinct phases, the boot phase and the runtime phase. During the boot phase, the UEFI firmware owns the system, i.e., the interrupt controller, the MMU and all other core resources and devices. Once an OS loader calls the ExitBootServices() boot service, the UEFI firmware relinquishes ownership to the OS.

This means that, if we want to apply the W^X principle to UEFI runtime services regions (the memory regions that contain the code and data that implement the firmware services that UEFI exposes to the OS), the firmware needs to tell the OS which attributes it can use when mapping those regions into its address space. For this purpose, version 2.6 of the UEFI specification introduces a new configuration table, the Memory Attributes Table, that breaks down each RuntimeServicesCode and RuntimeServicesData region in the UEFI memory map into sub-regions that can be mapped with strict permissions. (Note that, while RuntimeServicesData contain strictly data, RuntimeServicesCode regions describe PE/COFF executables in memory that consist of both code and data, and so the latter cannot be simply mapped with R-X attributes)

In Linux on ARM and arm64, as an additional layer of protection, the page tables that describe the UEFI runtime services regions are only live when necessary, which is during the time that a UEFI runtime service call is in progress. At all other times, the regions are left unmapped.

Support for the memory attributes table in the ARM and arm64 ports of Linux is queued for the v4.7 release. The x86 implementation is currently in development.

Boot time memory protection in UEFI

NOTE: As of 24 March 2017, this blog post is out of date. I have collaborated with Jiewen Yao of the Intel Firmware team to get full memory protection implemented in upstream EDK2, both for PE/COFF images, based on section attributes, and for all remaining memory regions, using a policy PCD.

At boot time, it is up to UEFI itself to manage the permission attributes of its page tables. Unfortunately, most (all?) implementations based on EDK2/Tianocore simply map all of memory both writable and executable, and the only enhancement that was made recently in this area is to map the stack of the boot CPU non-executable during the DXE phase.

As a proof of concept, I implemented strict memory protections for ArmVirtQemu, the UEFI build for the QEMU AArch64 mach-virt platform, which maps all of memory non-executable, and remaps code regions read-only/executable when required. Since EDK2 heavily relies on PE/COFF internally, this is simply a matter of using existing hooks in the PE/COFF loader to set the permissions bits according to the section attributes in the PE/COFF header.

Since such permissions can only be applied at page granularity, it does require that we increase the PE/COFF section alignment to 4 KB. Since most of the PE/COFF executables that make up the firmware live in a compressed firmware volume, this does not affect the memory footprint of the boot image significantly, but it is something to take into account when porting this to a bare metal platform with limited flash space.

With the above changes in place, we can update the default attributes used for the 1:1 mapping of system memory to include the XN bits, completing our W^X implementation for ArmVirtQemu.

KASLR in the arm64 Linux kernel

Kernel Address Space Layout Randomization (KASLR) is a hardening feature that aims to make it more difficult to take advantage of known exploits in the kernel, by placing kernel data structures at a random address at each boot. The Linux kernel currently implements this feature for 32-bit and 64-bit x86, and an implementation for the 64-bit ARM architecture (arm64) is queued for the v4.6 release which is due in a couple of weeks.

For the arm64 implementation, the kernel address space layout is randomized in the following ways:

loading the core kernel at a random physical address
mapping the core kernel at a random virtual address in the vmalloc area
loading kernel modules at a random virtual address in the vmalloc area
mapping system memory at a random virtual address in the linear area

Physical address randomization

Since the physical address at which the kernel executes is decided strictly by the bootloader (or on UEFI systems, by the UEFI stub), and not by the kernel itself, implementing physical address randomization consists primarily of removing assumptions in the core kernel that it has been loaded at the base of physical memory. Since the kernel text mapping, the virtual mapping of RAM and the physical mapping of RAM are all tightly coupled, the first step is to decouple those, and move the kernel into the vmalloc region. Once that is done, the bootloader is free to choose any area in physical RAM to place the kernel at boot.

Note that this move of the kernel VA space into the vmalloc region is by far the most intrusive change in the KASLR patch set, and some other patch sets that were under review at the same time required non-trivial rework to remain compatible with the new VA layout configuration.

For v4.7, some enhancement work has been queued to relax the alignment requirement of the core kernel from ‘2 MB aligned base + 512 KB’ to any 64 KB aligned physical offset. The actual number of random bits in the physical address of the kernel depends on the size of system memory, but for a system with 4 GB, it adds up to around 15 bits.

Virtual randomization of the core kernel

The virtual address the core kernel executes at is typically fixed, and thus the kernel binary is a non-relocatable binary where all memory addresses are calculated and emitted into the executable image at build time. With virtual randomization, these memory addresses need to be recalculated at runtime, and updated inside the running image. This means the kernel binary needs to be converted into a relocatable binary, and one that is able to relocate itself (in the absence of a loader such as the one used under the OS to load shared libraries) When the random virtual mapping is created at early boot time, the self relocation routines can take this random virtual offset into account when applying the relocation fixups, after which the kernel will be able to execute from this random virtual address.

The above is supported by the standard binutils toolchain. By linking ordinary (non-PIC) small model code (i.e., relative symbol references with a +/- 4 GB range) in PIE mode, we end up with a binary that has a .rela section consisting of standard ELF64 RELA entries, which are processed by the early startup code.

The RELA relocation format keeps the addend in the relocation entry rather than in the memory location that the relocation targets, and for R_AARCH64_ABS64 relocations, this memory location itself is filled with zeroes until the relocation code executes. This has a couple of downsides:

The executable image needs to be relocated to its runtime address even if this address is equal to the link time address.
The EXTABLE entries cannot be sorted at build time. This was addressed by switching to relative EXTABLE entries, which -as a bonus- reduces the size of the exception table by 50%.
Static Image header fields can no longer rely on 64-bit relocations to be populated by the linker at build time. Instead, we need to split them into 32-bit halves.

Since the .rela section can grow fairly big, an additional optimization has been implemented that turns the kallsyms symbol table into a relative table as well. This saves a 24 byte RELA entry per kernel symbol, which adds up to around 1.5 MB for a arm64 defconfig kernel. Due to the obvious benefit, this optimization was enabled by default for all architectures except IA-64 and Tile in 64-bit mode (whose address space is too sparse to support this feature).

With the enhancement mentioned above, a 48-bit VA kernel (the default for arm64 defconfig) can reside at any 64 KB offset in the first half of the vmalloc space, which means the addresses allow for 30 bits of entropy to be used in randomization.

Virtual randomization of the module region

To prevent modules leaking the virtual address of core kernel data structures, the module region can be randomized fully independently from the core kernel. To this end, a 128 MB virtual region is chosen at boot time, and all module allocations are served from this area. Since the modules and the core kernel are likely to be loaded far away from each other (more than 128 MB, which is the maximum range of relative jump instructions), we also need to implement support for module PLTs, which contain veneers (i.e., trampolines) to bridge the distance between the jump instructions and their targets. Since module PLTs may have a performance impact, it is also possible to choose the module region such that it intersects the .text section of the core kernel, so that jumps via PLT veneers are only required in the unlikely event that the module region runs out of space.

Virtual randomization of the linear region

The linear mapping covers all RAM pages in the order that they appear in the physical address space. Since the virtual area reserved for the linear mapping is typically much larger than the actual physical footprint of RAM (i.e., the distance between the first and the last usable RAM pages, including all holes between them), the placement of those pages inside the linear mapping can be randomized as well. This will make heap allocations via the linear mapping (i.e., kmalloc()) less predictable. Since there is a performance concern associated with the relative alignment between physical and virtual mappings (e.g., on 4 KB pages, RAM will be mapped at 1 GB granularity if the virtual and physical addresses modulo 1 GB are equal), this randomization is coarse grained, but still an improvement over a fully deterministic one. (The size of the linear region is typically at least 256 GB)

How to enable it

Randomization requires a good source of entropy, and arm64 does not have an architected means of obtaining entropy (e.g., via an instruction), nor does its early execution environment have access to platform specific peripherals that can supply such entropy. This means it is left to the bootloader to generate a KASLR seed, and pass it to the core kernel via the /chosen/kaslr-seed DT property.

For platforms that boot via UEFI, the UEFI stub in the arm64 kernel will attempt to locate the EFI_RNG_PROTOCOL, and invoke it to supply a kaslr-seed. On top of that, it will use this protocol to randomize the physical load address of the kernel Image.
QEMU in UEFI mode supports this protocol if the virtio-rng-pci device is made available. Bare metal platforms like the Celloboard or QDF2432 implement this protocol natively as well.

To enable the KASLR feature, the kernel needs to be built with CONFIG_RANDOMIZE_BASE=y.