Due to the way the stack of a thread (or task in kernelspeak) is shared between control flow data (frame pointer, return address, caller saved registers) and temporary buffers, overflowing such buffers can completely subvert the control flow of a program, and the stack is therefore a primary target for attacks. Such attacks are referred to as Return Oriented Programming (ROP), and typically consist of a specially crafted array of forged stack frames, where each return from a function is directed at another piece of code (called a gadget) that is already present in the program. By piecing together gadgets like this, powerful attacks can be mounted, especially in a big program such as the kernel where the supply of gadgets is endless.
One way to mitigate such attacks is the use of stack canaries, which are known values that are placed inside each stack frame when entering a function, and checked again when leaving the function. This forces the attacker to craft his buffer overflow attack in a way that puts the correct stack canary value inside each stack frame. That by itself is rather trivial, but it does require the attacker to discover the value first.
GCC implements support for stack canaries, which can be enabled using the various
‑fstack-protector[‑xxx] command line switches. When enabled, each function prologue will store the value of the global variable __stack_chk_guard inside the stack frame, and each epilogue will read the value back and compare it, and branch to the function __stack_chk_fail if the comparison fails.
This works fine for user programs, with the caveat that all threads will use the same value for the stack canary. However, each program will pick a random value at program start, and so this is not a severe limitation. Similarly, for uniprocessor (UP) kernels, where only a single task will be active at the same time, we can simply update the value of the __stack_chk_guard variable when switching from one task to the next, and so each task can have its own unique value.
However, on SMP kernels, this model breaks down. Each CPU will be running a different task, and so any combination of tasks could be active at the same time. Since each will refer to __stack_chk_guard directly, its value cannot be changed until all tasks have exited, which only occurs at a reboot. Given that servers don’t usually reboot that often, leaking the global stack canary value can seriously compromise security of a running system, as the attacker only has to discover it once.
x86: per-CPU variables
To work around this issue, Linux/x86 implements support for stack canaries using the existing Thread-local Storage (TLS) support in GCC, which replaces the reference to __stack_chk_guard with a reference to a fixed offset in the TLS block. This means each CPU has its own copy, which is set to the stack canary value of that CPU’s current task when it switches to it. When the task migrates, it just takes its stack canary value along, and so all tasks can use a unique value. Problem solved.
On arm64, we are not that lucky, unfortunately. GCC only supports the global stack canary value, although discussions are underway to decide how this is best implemented for multitask/thread environments, i.e., in a way that works for userland as well as for the kernel.
Per-CPU variables and preemption
Loading the per-CPU version of __stack_chk_guard could look something like this on arm64:
adrp x0, __stack__chk_guard add x0, x0, :lo12:__stack_chk_guard mrs x1, tpidr_el1 ldr x0, [x0, x1]
There are two problems with this code:
- the arm64 Linux kernel implements support for Virtualization Host Extensions (VHE), and uses code patching to replace all references to
tpidr_el2on VHE capable systems,
- the access is not atomic: if this code is preempted after reading the value of
tpidr_el1but before loading the stack canary value, and is subsequently migrated to another CPU, it will load the wrong value.
In kernel code, we can deal with this easily: every emitted reference to
tpidr_el1 is tagged so we can patch it at boot, and on preemptible kernels we put the code in a non-preemtible block to make it atomic. However, this is impossible to do in GCC generated code without putting elaborate knowledge of the kernel’s per-CPU variable implementation into the compiler, and doing so would severely limit our future ability to make any changes to it.
One way to mitigate this would be to reserve a general purpose register for the per-CPU offset, and ensure that it is used as the offset register in the
ldr instruction. This addresses both problems: we use the same register regardless of VHE, and the single
ldr instruction is atomic by definition.
However, as it turns out, we can do much better than this. We don’t need per-CPU variables if we can load the task’s stack canary value directly, and each CPU already keeps a pointer to the task_struct of the current task in system register
sp_el0. So if we replace the above with
movz x0, :abs_g0:__stack__chk_guard_offset mrs x1, sp_el0 ldr x0, [x0, x1]
we dodge both issues, since all of the values involved are per-task values which do not change when migrating to another CPU. Note that the same sequence could be used in userland for TLS if you swap out
tpidr_el0 (and use the appropriate relocation type), so adding support for this to GCC (with a command line configurable value for the system register) would be a flexible solution to this problem.
Proof of concept implementation
I implemented support for the above, using a GCC plugin to replace the default sequence
adrp x0, __stack__chk_guard add x0, x0, :lo12:__stack_chk_guard ldr x0, [x0]
mrs x0, sp_el0 add x0, x0, :lo12:__stack_chk_guard_offset ldr x0, [x0]
__stack_chk_guard_offset to 4 KB, but this is not an issue in practice unless struct randomization is enabled. Another caveat is that it only works with GCC’s small code model (the one that uses
adrp instructions) since the plugin works by looking for those instructions and replacing them.
Code can be found here.