Upstream support for AMD Overdrive in EDK2

Mainline EDK2 used to carry support for a number of ARM development platforms, such as TC2 and Juno (both of which are based on Versatile Express). These have been moved to OpenPlatformPkg, a separate platforms tree that is intended to complement the EDK2 mainline tree, and carries support for a number of platforms based on the ARM architecture (although non-ARM platforms are more than welcome as well).

Recently, EDK2 has gone back to only supporting various emulators (custom emulators built for Windows or X11, but also the QEMU system emulator in X86 or ARM mode) in its mainline tree, but the intention is to merge the entirety of OpenPlatformPkg back into EDK2 once a reorganization of the directory structure is completed. Until then, OpenPlatformPkg could be considered ‘upstream’ for all intents and purposes, as far as bare metal ARM platforms are concerned.

Upstream support for AMD Overdrive in EDK2

AMD is widely recognized for its efforts in open source, and as one of the founding members of the Linaro Enterprise Group (LEG), it has put its weight behind the work Linaro is doing to improve support for ARMv8 based servers in the enterprise.

As part of this effort, the UEFI engineers in LEG have been collaborating with AMD engineers to get support for AMD’s Overdrive platform into the EDK2 upstream. Due to its similarity to Overdrive, UEFI support for the upcoming LeMaker Celloboard is now public as well.

Special sauce

Unlike the Linux kernel community, which has a strict, GPL-based open source policy, the EDK2 community is lax about mixing open and closed source modules, and the fact that the EDK2 upstream by itself can only run on emulators attests to that. However, there is another way to combine the open source core components of EDK2 with closed source special sauce, by combining sources and binaries at the module level.

Binary modules in EDK2

The snippet below was taken from AmdSataInitLib.inf, showing how a static library that was built separately from the platform has been included as a binary module in the Overdrive build. The Binaries section appears instead of the usual Sources section, and contains the static library that makes up the module. (The .lib file in question was simply taken from a build that includes the module in source form, i.e., a .inf file containing the various sources in a Sources section, and a .dsc file that lists the .inf in a Components section.) The trailing asterisk means that the same file should be used for DEBUG and RELEASE builds.



Note the FixedPcd section: a static EDK2 library will contain symbol references to the exact name/type combinations of these PCDs, and so it is recommended to use a strict match here (FixedPcd rather than Pcd)

In a similar way, complete PEI or DXE phase PE/COFF executables can be distributed in binary form as well, with the caveat that dynamic PCDs should be avoided (they simply don’t work)

Taking Gionb.inf as another example,



the first thing that stands out is the PEI_DEPEX line. The .depex file it refers to was taken from the same build that produced the .efi file, and is required by the runtime dispatcher to decide when the Gionb PEI module can be dispatched.

What is especially interesting about this module is the non-standard looking PCD references in the PatchPcd section. It lists the patchable PCDs that are referenced by the module, their default values, and their offsets into the binary (the .efi file from the Binaries section). If this module is incorporated into a platform .DSC that uses different values for these PCDs, the EDK2 build system will patch the desired values into a copy of the binary before incorporating it into the final firmware image. This is an especially powerful feature that allows us to share the Gionb module, which performs the PCIe link training, between the Overdrive and Cello platforms, which have different PCIe slot configurations.

In addition to AmdSataInitLib and Gionb, there are a few other modules that are distributed as binaries: IscpPei and IscpDxe, which produce the protocols to communicate with the SCP, and SnpDxePort0 and SnpDxePort1, which drive the two 10GigE ports.

Pre-UEFI firmware

The Overdrive platform is based on the AMD Seattle SOC, which combines a 32-bit ARM Cortex-A5 based System Control Processor (SCP) with up to 8 64-bit Cortex-A57 cores. The firmware that runs on the A5, and the secure world (EL3) firmware that runs on the A57s has not been published as source code, and is incorporated into the firmware image as a single binary blob. This means that only the code that executes in the same context as UEFI (EL2) has been released (modulo the binary modules mentioned above)

Call for collaboration

Apart from the pieces described above, the Overdrive UEFI firmware is completely open, and can be built and studied by anyone who is interested. This means anyone can upgrade their EDK2 core components if they want to enable things like new hardening features or HTTP boot. It also means people can contribute improvements and enhancements to the existing platform. One thing that is particularly high on my wish list is support for the Overdrive/Cello SD slot, which is simply an SD slot wired to an ARM standard PL022 SPI controller (and the Linux kernel already supports it). If anyone is interested in contributing that, please contact me with a proposal, and I will try to arrange support for it.




Booting a big-endian kernel from UEFI

One recurring question I get regarding UEFI on ARM systems is when we will introduce support for booting big-endian kernels. If you think of UEFI as simply a bootloader, this sounds like a reasonable question, but when you take a closer look, this is actually much more complicated than it sounds.

UEFI is a specification, not an implementation

UEFI originated in the Intel world, which is little-endian only. This means that from the specification side, no attention or effort whatsoever has been spent on making the interfaces, data structures and other software visible objects deal with endianness. Also, the PE/COFF executable format that UEFI heavily relies on does not take endianness into account at all.

This means that it is impossible to recompile a UEFI implementation in big-endian mode, and still adhere to the specification. Whether you could get away with it in practice is irrelevant, since the reason we like UEFI is the fact that is a specification, not an implementation, and every UEFI compliant OS should be able to interact with every UEFI compliant firmware (provided that they were built for the same architecture).

One possible approach could be to introduce BE-AArch64 as a completely new architecture both in PE/COFF and in UEFI, but that would result in BE firmwares that can only boot BE kernels, which does not sound that appealing either.

Running a big-endian OS on little-endian firmware

So if building a big-endian UEFI firmware is out of the question, can we boot a big-endian kernel from a little-endian UEFI? Again, if you think of UEFI as a bootloader with a single handover point to the OS, this does not sound unreasonable, but there are still a couple of concerns.

  1. UEFI exposes firmware tables to the OS, such as the System Table, the memory map and other data structures containing multibyte quantities that need to be endian swabbed before consumption. In Linux, none of the existing code that handles these tables takes endianness into account.
  2. The UEFI stub in Linux makes the kernel executable pose as a PE/COFF binary, and the UEFI stub code is called in the execution context of the firmware. In order to support big-endian kernels, we would have to build some objects in LE mode, some in BE mode, and objects that are shared between the stub and the kernel proper would need to be built twice. It is unlikely that the ARM and arm64 Linux maintainers will be eager to adopt such changes, since they complicate the build system configuration considerably.
  3. Invoking UEFI Runtime Services will require an endianness switch at EL1. This involves endian swabbing the in-memory representation of by-reference arguments, but this is the easy part. The hard part is taking exceptions, not only faults, but interrupts as well (Since v4.6, UEFI runtime services execute with interrupts enabled). None of the exception handling machinery is set up to deal with exceptions raised in the wrong endianness, and complicating those code paths to deal with this is unlikely to be upstreamable.

A standalone stub

If we assume that point #1 above is something that we can address and upstream, either by making the code deal with endianness, or by disabling some UEFI related features when building a BE kernel, and if we deal with point #3 above by accepting the fact that such a kernel will not have access to UEFI Runtime Services, we still need to address point #2.

Since the UEFI stub executes in the context of the firmware, while the kernel proper executes in its own context, there is a handover protocol that is described in Documentation/arm/uefi.txt in the Linux kernel tree. This handover protocol basically comes down to populating some DT nodes under /chosen with a description of the firmware context, and there is no reason we cannot implement the same handover protocol in a separate UEFI OS loader application.

So what we will need to support BE boot under UEFI is a standalone stub. This UEFI application should load the kernel image at an appropriate location in system memory, populate the DT /chosen node with the kernel command line, potentially an initrd, and information about the location of the UEFI system table and the UEFI memory map. Then it can branch straight into the core kernel entry point, and boot the BE kernel with full access to UEFI features (to the extent that they were made endianness agnostic)

If anyone is interested in implementing this, and needs a hand, don’t hesitate to contact me.



Memory protection in UEFI

One of the most important principles of secure system design is distinguishing between code and data, where ‘code’ means sequences of CPU instructions, and ‘data’ means the data manipulated by those instructions. In some cases, ‘data’ is promoted to ‘code’ in a program, for instance by a shared library loader or a JIT, but in most cases, they are completely disjoint, and a program that manipulates its own code as if it were data is misbehaving, either due to a bug or due to the fact that it is under attack.

The typical approach to address this class of attacks is to use permission attributes in the page tables, on the one hand to prevent a program from manipulating its own code, and to prevent it from executing its data on the other. This is usually referred to as W^X, i.e., the permission attributes of any memory region belonging to a program may either have the writable attribute, or the executable attribute, but never both (W xor X).

UEFI implementations typically map all of memory as both writable and executable, both during boot and at runtime. This makes UEFI vulnerable to this kind of attacks, especially the memory regions that are retained by the OS at runtime.

Runtime memory protection in UEFI

Booting via UEFI consists of two distinct phases, the boot phase and the runtime phase. During the boot phase, the UEFI firmware owns the system, i.e., the interrupt controller, the MMU and all other core resources and devices. Once an OS loader calls the ExitBootServices() boot service, the UEFI firmware relinquishes ownership to the OS.

This means that, if we want to apply the W^X principle to UEFI runtime services regions (the memory regions that contain the code and data that implement the firmware services that UEFI exposes to the OS), the firmware needs to tell the OS which attributes it can use when mapping those regions into its address space. For this purpose, version 2.6 of the UEFI specification introduces a new configuration table, the Memory Attributes Table, that breaks down each RuntimeServicesCode and RuntimeServicesData region in the UEFI memory map into sub-regions that can be mapped with strict permissions. (Note that, while RuntimeServicesData contain strictly data, RuntimeServicesCode regions describe PE/COFF executables in memory that consist of both code and data, and so the latter cannot be simply mapped with R-X attributes)

In Linux on ARM and arm64, as an additional layer of protection, the page tables that describe the UEFI runtime services regions are only live when necessary, which is during the time that a UEFI runtime service call is in progress. At all other times, the regions are left unmapped.

Support for the memory attributes table in the ARM and arm64 ports of Linux is queued for the v4.7 release. The x86 implementation is currently in development.

Boot time memory protection in UEFI

At boot time, it is up to UEFI itself to manage the permission attributes of its page tables. Unfortunately, most (all?) implementations based on EDK2/Tianocore simply map all of memory both writable and executable, and the only enhancement that was made recently in this area is to map the stack of the boot CPU non-executable during the DXE phase.

As a proof of concept, I implemented strict memory protections for ArmVirtQemu, the UEFI build for the QEMU AArch64 mach-virt platform, which maps all of memory non-executable, and remaps code regions read-only/executable when required. Since EDK2 heavily relies on PE/COFF internally, this is simply a matter of using existing hooks in the PE/COFF loader to set the permissions bits according to the section attributes in the PE/COFF header.

Since such permissions can only be applied at page granularity, it does require that we increase the PE/COFF section alignment to 4 KB. Since most of the PE/COFF executables that make up the firmware live in a compressed firmware volume, this does not affect the memory footprint of the boot image significantly, but it is something to take into account when porting this to a bare metal platform with limited flash space.

With the above changes in place, we can update the default attributes used for the 1:1 mapping of system memory to include the XN bits, completing our W^X implementation for ArmVirtQemu.



KASLR in the arm64 Linux kernel

Kernel Address Space Layout Randomization (KASLR) is a hardening feature that aims to make it more difficult to take advantage of known exploits in the kernel, by placing kernel data structures at a random address at each boot. The Linux kernel currently implements this feature for 32-bit and 64-bit x86, and an implementation for the 64-bit ARM architecture (arm64) is queued for the v4.6 release which is due in a couple of weeks.

For the arm64 implementation, the kernel address space layout is randomized in the following ways:

  • loading the core kernel at a random physical address
  • mapping the core kernel at a random virtual address in the vmalloc area
  • loading kernel modules at a random virtual address in the vmalloc area
  • mapping system memory at a random virtual address in the linear area

Physical address randomization

Since the physical address at which the kernel executes is decided strictly by the bootloader (or on UEFI systems, by the UEFI stub), and not by the kernel itself, implementing physical address randomization consists primarily of removing assumptions in the core kernel that it has been loaded at the base of physical memory. Since the kernel text mapping, the virtual mapping of RAM and the physical mapping of RAM are all tightly coupled, the first step is to decouple those, and move the kernel into the vmalloc region. Once that is done, the bootloader is free to choose any area in physical RAM to place the kernel at boot.

Note that this move of the kernel VA space into the vmalloc region is by far the most intrusive change in the KASLR patch set, and some other patch sets that were under review at the same time required non-trivial rework to remain compatible with the new VA layout configuration.

For v4.7, some enhancement work has been queued to relax the alignment requirement of the core kernel from ’2 MB aligned base + 512 KB’ to any 64 KB aligned physical offset. The actual number of random bits in the physical address of the kernel depends on the size of system memory, but for a system with 4 GB, it adds up to around 15 bits.

Virtual randomization of the core kernel

The virtual address the core kernel executes at is typically fixed, and thus the kernel binary is a non-relocatable binary where all memory addresses are calculated and emitted into the executable image at build time. With virtual randomization, these memory addresses need to be recalculated at runtime, and updated inside the running image. This means the kernel binary needs to be converted into a relocatable binary, and one that is able to relocate itself (in the absence of a loader such as the one used under the OS to load shared libraries) When the random virtual mapping is created at early boot time, the self relocation routines can take this random virtual offset into account when applying the relocation fixups, after which the kernel will be able to execute from this random virtual address.

The above is supported by the standard binutils toolchain. By linking ordinary (non-PIC) small model code (i.e., relative symbol references with a +/- 4 GB range) in PIE mode, we end up with a binary that has a .rela section consisting of standard ELF64 RELA entries, which are processed by the early startup code.

The RELA relocation format keeps the addend in the relocation entry rather than in the memory location that the relocation targets, and for R_AARCH64_ABS64 relocations, this memory location itself is filled with zeroes until the relocation code executes. This has a couple of downsides:

  • The executable image needs to be relocated to its runtime address even if this address is equal to the link time address.
  • The EXTABLE entries cannot be sorted at build time. This was addressed by switching to relative EXTABLE entries, which -as a bonus- reduces the size of the exception table by 50%.
  • Static Image header fields can no longer rely on 64-bit relocations to be populated by the linker at build time. Instead, we need to split them into 32-bit halves.

Since the .rela section can grow fairly big, an additional optimization has been implemented that turns the kallsyms symbol table into a relative table as well. This saves a 24 byte RELA entry per kernel symbol, which adds up to around 1.5 MB for a arm64 defconfig kernel. Due to the obvious benefit, this optimization was enabled by default for all architectures except IA-64 and Tile in 64-bit mode (whose address space is too sparse to support this feature).

With the enhancement mentioned above, a 48-bit VA kernel (the default for arm64 defconfig) can reside at any 64 KB offset in the first half of the vmalloc space, which means the addresses allow for 30 bits of entropy to be used in randomization.

Virtual randomization of the module region

To prevent modules leaking the virtual address of core kernel data structures, the module region can be randomized fully independently from the core kernel. To this end, a 128 MB virtual region is chosen at boot time, and all module allocations are served from this area. Since the modules and the core kernel are likely to be loaded far away from each other (more than 128 MB, which is the maximum range of relative jump instructions), we also need to implement support for module PLTs, which contain veneers (i.e., trampolines) to bridge the distance between the jump instructions and their targets. Since module PLTs may have a performance impact, it is also possible to choose the module region such that it intersects the .text section of the core kernel, so that jumps via PLT veneers are only required in the unlikely event that the module region runs out of space.

Virtual randomization of the linear region

The linear mapping covers all RAM pages in the order that they appear in the physical address space. Since the virtual area reserved for the linear mapping is typically much larger than the actual physical footprint of RAM (i.e., the distance between the first and the last usable RAM pages, including all holes between them), the placement of those pages inside the linear mapping can be randomized as well. This will make heap allocations via the linear mapping (i.e., kmalloc()) less predictable. Since there is a performance concern associated with the relative alignment between physical and virtual mappings (e.g., on 4 KB pages, RAM will be mapped at 1 GB granularity if the virtual and physical addresses modulo 1 GB are equal), this randomization is coarse grained, but still an improvement over a fully deterministic one. (The size of the linear region is typically at least 256 GB)

How to enable it

Randomization requires a good source of entropy, and arm64 does not have an architected means of obtaining entropy (e.g., via an instruction), nor does its early execution environment have access to platform specific peripherals that can supply such entropy. This means it is left to the bootloader to generate a KASLR seed, and pass it to the core kernel via the /chosen/kaslr-seed DT property.

For platforms that boot via UEFI, the UEFI stub in the arm64 kernel will attempt to locate the EFI_RNG_PROTOCOL, and invoke it to supply a kaslr-seed. On top of that, it will use this protocol to randomize the physical load address of the kernel Image.
QEMU in UEFI mode supports this protocol if the virtio-rng-pci device is made available. Bare metal platforms like the Celloboard or QDF2432 implement this protocol natively as well.

To enable the KASLR feature, the kernel needs to be built with CONFIG_RANDOMIZE_BASE=y.