GHASH for high-end ARM cores

After years of using Cortex-A57 or A53 based systems as both my development machines as well as my optimization targets, I was recently given a ThunderX2 workstation, and after having moved my development environment to it, I have started  looking into whether the existing accelerated crypto drivers in the Linux kernel perform optimally on this micro-architecture.

Marvell’s ThunderX2 core (née Broadcom Vulcan) is a server class CPU with a deep pipeline. This typically means that it takes longer for the result of an instruction to become available to subsequent instructions, and instead of stalling the pipeline waiting for that result, we should try to do some other useful work in the meantime.

So for the simpler AES modes such as CBC and CTR, I have increased the interleave factor from 4x to 5x, resulting in a speedup of around 11% on ThunderX2. This result fueled my suspicion that the GCM code, which was using an interleave factor of only 2x, could benefit even more from this approach, and since I already added an implementation of GHASH (one of the constituent parts of GCM) that uses aggregation to operate on 4 blocks of input at a time, it should simply be a matter of combining this code with a 4-way interleaved implementation of CTR.

GHASH aggregation

GHASH is not a general purpose cryptographic hash function, but one that was specifically designed for (and should only be used in) AES in GCM mode. It is based on multiplication in GF(2128), and basically comes down to the following

X[0] = (H * I[0]) mod M
X[1] = (H * (X[0] ^ I[1])) mod M
X[n] = (H * (X[n-1] ^ I[n])) mod M

where I[] is an input array of 16 byte blocks, and the output of the hash function is the last value in the array X[]. H is a quantity that is derived from the AES key by encrypting a single AES block consisting of NUL bytes, and M is GHASH’s characteristic polynomial x128 + x7 + x2 + x + 1 over GF(2128).

Each line involves a multiplication in GF(2128), which means that it consists a multiplication step and a modulo division step, and it is possible the amortize the cost of the latter step over multiple multiplications, increasing performance. For example, we can process two blocks at a time like so

X[n] = (H * (((H * (X[n-2] ^ I[n-1])) mod M) ^ I[n])) mod M
     = ((H^2 * (X[n-2] ^ I[n-1])) ^ (H * I[n])) mod m

This is called GHASH aggregation, and was implemented for the core GHASH transform here. Note that this requires powers of H to be precomputed, and so we cannot apply this transformation an arbitrary number of times.

Another thing to note is that this algorithm combines the highest power of H with the first input block, which means that if we want to reuse the same code sequence to process fewer than n blocks, we need to populate the input blocks from right to left, and jump to the appropriate point in the routine so that only the lower powers of H are used in the computation.

Handling the tail

GCM is an AEAD (Authenticated Encryption with Associated Data) which can operate on inputs of any size. Since GCM is mainly used in the networking space, we should expect having to deal mostly with the typical size of a network packet, which is ~1500 bytes. This means that our 4-way aggregated code operating on 4 blocks or 64 bytes at a time will always have to deal with a tail block of less than 64 bytes in size.

Since out-of-order cores with deep pipelines perform poorly when running code with lots of branches and tight loops that load small quantities at a time, we should try to handle the tail with as few branches as possible, and use large, potentially overlapping loads to get the data into SIMD registers.

This is what the sequence below implements. It unconditionally performs four 16-byte loads, and manipulates the input address and the post-increment values so that we end up with I[] populated right to left with between 1 and 63 bytes of remaining input. (x0 contains the source address, and x2 the size of the remaining input minus 64)

mov     x15, #16
ands    x19, x0, #0xf
csel    x19, x19, x15, ne
adr_l   x17, .Lpermute_table + 16

sub     x11, x15, x19
add     x12, x17, x11
ld1     {T1.16b}, [x12]
sub     x11, x2, x11

cmp     x0, #-16
csel    x14, x15, xzr, gt
cmp     x0, #-32
csel    x15, x15, xzr, gt
cmp     x0, #-48
csel    x16, x19, xzr, gt
csel    x2, x2, x11, gt

ld1     {I0.16b}, [x2], x14
ld1     {I1.16b}, [x2], x15
ld1     {I2.16b}, [x2], x16
ld1     {I3.16b}, [x2]
tbl     I3.16b, {I3.16b}, T1.16b

Since we always load at least 16 bytes, it is up to the caller to present the input in a way that permits doing so if the entire input is less than 16 bytes in size (e.g, by copying the input to the end of a 16 byte buffer)

CTR encryption

While the 4-way GHASH code requires taking another code path through the 4-way routine when it is invoked with less than 4 blocks of input, the CTR code can be modified more easily to support this, by incrementing the overall counter by the remaining number of blocks, and then subtracting 4-n for each keystream block n. This naturally populates the keystream blocks from right to left as well, and instead of using a different encryption routine depending on the remaining input size, we can just unconditionally generate a 64 byte keystream and discard the parts that we don’t need.

Generating the tag

The final step in performing GCM is generating the authentication tag, which involves one final round of GHASH on a block of input containing the input size, and a final pass of CTR encryption.

While the existing code does not handle the tail or the tag directly, and jumps back to C code instead to copy the tail into a stack buffer and process it via the split GHASH and CTR routines, we have now incorporated this into the core GCM transform. This means we can go one step further, and fold the tag generation in as well.

ld1     {INP3.16b}, [x10]             // load lengths[]
mov     w9, #1                        // one block of input
bl      pmull_gcm_ghash_4x

mov     w11, #(0x1 << 24)             // BE '1U'
ld1     {KS0.16b}, [x5]               // load upper counter
mov     KS0.s[3], w11                 // set lower counter

enc_block KS0, x7, x6, x12

ext     X.16b, X.16b, X.16b, #8
rev64   X.16b, X.16b
eor     X.16b, X.16b, KS0.16b
st1     {XL.16b}, [x10]               // store tag

Conclusion

When combining all the changes above, we end up with a routine that can consume the entire encryption input buffer including the tail and the input block for the tag, and process it in a way that is up to 25% faster than the old code for 1500 byte blocks. Note that 4 KiB blocks are only 11-16% faster, and on other micro-architectures, performance regresses slightly for large inputs. This is not a use case worth obsessing about, though. More performance numbers below, the full implementation can be found here.

AES-128 AES-192 AES-256
#bytes 512 1500 4k 512 1500 4k 512 1500 4k
TX2 35% 23% 11% 34% 20% 9% 38% 25% 16%
EMAG 11% 6% 3% 12% 4% 2% 11% 4% 2%
A72 8% 5% -4% 9% 4% -5% 7% 4% -5%
A53 11% 6% -1% 10% 8% -1% 10% 8% -2%