haselkern

Compiler Optimizations

While solving this year's Advent of Code puzzle for day 22, I had written this function (which is a minor spoiler for the puzzle):

fn mix_prune(mut secret: usize) -> usize {
    const MOD: usize = 16777216;

    let n = secret * 64;
    secret = (secret ^ n) % MOD;

    let n = secret / 32;
    secret = (secret ^ n) % MOD;

    let n = secret * 2048;
    secret = (secret ^ n) % MOD;

    secret
}

Starting from a given number, called secret, we do three "mix" and "prune" operations to calculate a new number.

Mixing is the process of deriving a new number from the secret and performing an XOR operation with the secret. Pruning is a modulo operation that will make sure the secret does not grow too large.

I noticed that all constants that are involved here are a power of two: 64=2^6, 32=2^5, 2048=2^11 and 16777216=2^24. This opens some room for optimization. Instead of performing a multiplication or division by a power of two, we can do a bit shift left or right. Just before obliterating the readability of my code by replacing all operations with bit shifts, I was able to stop myself and think about it a bit. The code was fast enough and a runtime profile showed that not much time was spent in this function. Optimizing this function would not be time well spent.

However, I was still curious. Do I even need to optimize it?

LLVM, the compiler backend that Rust uses to emit machine code, is a beast. It does all sorts of interesting optimizations. Scalar Evolution is an impressive one, where it can detect specific loops and transform them into constant time operations.

What does LLVM make of the function from above? Let's fire up Godbolt (with -C opt-level=3) and see what the optimized assembly looks like.

mix_prune:
    ; let n = secret * 64;
    ; secret = (secret ^ n) % MOD;
    mov     eax, edi
    shl     eax, 6
    xor     eax, edi
    and     eax, 16777215

    ; let n = secret / 32;
    ; secret = (secret ^ n) % MOD;
    mov     ecx, eax
    shr     ecx, 5
    xor     ecx, eax

    ; let n = secret * 2048;
    ; secret = (secret ^ n) % MOD;
    mov     eax, ecx
    shl     eax, 11
    and     eax, 16775168
    xor     eax, ecx

    ret

The initial function had three blocks of mixing and pruning which I separated by newlines here. Each block has something interesting in it, which we will now take a look at.

First Block

; let n = secret * 64;
; secret = (secret ^ n) % MOD;
mov     eax, edi
shl     eax, 6
xor     eax, edi
and     eax, 16777215

The first block moves our secret, which I suppose is in edi, into the working register eax. It then shifts the bits left by six places, which is equivalent to a multiplication by 2^6=64, as we discussed earlier. So the compiler does this optimization for us! That's great, as we can keep our code nice and readable, but still know that the compiler will bring us maximum performance.

The first block ends with an and instruction on eax with 16777215 that is one lower than our modulus 16777216 from earlier. This is a clever optimization when computing modulo with a power of two. Calculating modulo 2^n is equivalent to keeping the rightmost n-1 bits, discarding all higher bits. To keep the rightmost n-1 bits we can use a bit mask with n-1 ones, which is 2^n-1, or one lower than the modulus. Applying this bit mask to our working register with and will perform the modulo operation and end the first block.

Second Block

; let n = secret / 32;
; secret = (secret ^ n) % MOD;
mov     ecx, eax
shr     ecx, 5
xor     ecx, eax

Since the first block wrote its result into eax, the second block uses ecx as its working register. Since 32=2^5 we can do a division with 32 by shifting the bits to the right five places. This is similar to the multiplication from earlier. The difference being that the bits are shifted right to perform division, instead of left to perform multiplication. Then the "mixing" happens with a xor instruction.

What about the modulo operation? It is missing completely. Why? eax from the first block is smaller than our modulus, since we just calculated the modulo there. By shifting the value 5 bits to the right, it will only ever be smaller still. XOR-ing these two values will never make them bigger than the modulus. LLVM has noticed that taking the modulo of a value that is strictly smaller than the modulus has no effect and omitted the operation. This optimization marks the end of the second block.

Third Block

; let n = secret * 2048;
; secret = (secret ^ n) % MOD;
mov     eax, ecx
shl     eax, 11
and     eax, 16775168
xor     eax, ecx

The third block looks similar to the first one, except for two things. One, the order of and and xor is swapped. Two, the value used for and is different. Here are the last two instructions from the first block as a reminder:

xor     eax, edi
and     eax, 16777215

I do not know why LLVM reordered the instructions here. I think it would work fine if the xor and and instructions from the first block were used. Nevertheless, the number 16775168 used as a bit mask can be explained when put into context of being applied after the left shift. The left shift by 11 bits guarantees that the rightmost 11 bits of eax are zero. Since they are zero, we do not have to use a mask to set them to zero. In our modulo trick from the first block, LLVM used 16777215. When we set the lowest 11 bits in this number to zero, we get 16775168. It remains a mystery to me why LLVM reordered these instructions.

Compilers Are Smart

I think it is interesting to peek behind the curtains of the "low level" sometimes and see what the computer is actually doing with the code you write. It is very comforting to know that I do not have to put effort into micro-optimizations. I can write readable code and be reasonably sure that I get the fastest possible machine code. This is all thanks to a lot of smart people who put a whole lot of effort into making sure that LLVM knows how to optimize a lot of things.

Tags: #programming