While solving this year's Advent of Code puzzle for day 22, I had written this function (which is a minor spoiler for the puzzle):
fn mix_prune(mut secret: usize) -> usize {
const MOD: usize = 16777216;
let n = secret * 64;
secret = (secret ^ n) % MOD;
let n = secret / 32;
secret = (secret ^ n) % MOD;
let n = secret * 2048;
secret = (secret ^ n) % MOD;
secret
}
Starting from a given number, called secret
,
we do three "mix" and "prune" operations
to calculate a new number.
Mixing is the process of deriving a new number from the secret and performing an XOR operation with the secret. Pruning is a modulo operation that will make sure the secret does not grow too large.
I noticed that all constants that are involved here are a power of two:
64=2^6
, 32=2^5
, 2048=2^11
and 16777216=2^24
.
This opens some room for optimization.
Instead of performing a multiplication or division by a power of two,
we can do a bit shift left or right.
Just before obliterating the readability of my code
by replacing all operations with bit shifts,
I was able to stop myself and think about it a bit.
The code was fast enough and a runtime profile showed
that not much time was spent in this function.
Optimizing this function would not be time well spent.
However, I was still curious. Do I even need to optimize it?
LLVM, the compiler backend that Rust uses to emit machine code, is a beast. It does all sorts of interesting optimizations. Scalar Evolution is an impressive one, where it can detect specific loops and transform them into constant time operations.
What does LLVM make of the function from above?
Let's fire up Godbolt (with -C opt-level=3
)
and see what the optimized assembly looks like.
mix_prune:
; let n = secret * 64;
; secret = (secret ^ n) % MOD;
mov eax, edi
shl eax, 6
xor eax, edi
and eax, 16777215
; let n = secret / 32;
; secret = (secret ^ n) % MOD;
mov ecx, eax
shr ecx, 5
xor ecx, eax
; let n = secret * 2048;
; secret = (secret ^ n) % MOD;
mov eax, ecx
shl eax, 11
and eax, 16775168
xor eax, ecx
ret
The initial function had three blocks of mixing and pruning which I separated by newlines here. Each block has something interesting in it, which we will now take a look at.
First Block
; let n = secret * 64;
; secret = (secret ^ n) % MOD;
mov eax, edi
shl eax, 6
xor eax, edi
and eax, 16777215
The first block moves our secret
,
which I suppose is in edi
,
into the working register eax
.
It then shifts the bits left by six places,
which is equivalent to a multiplication by 2^6=64
,
as we discussed earlier.
So the compiler does this optimization for us!
That's great, as we can keep our code nice and readable,
but still know that the compiler will bring us maximum performance.
The first block ends with an and
instruction on eax
with 16777215 that is one lower than our modulus 16777216 from earlier.
This is a clever optimization when computing modulo with a power of two.
Calculating modulo 2^n
is equivalent to keeping the rightmost n-1
bits,
discarding all higher bits.
To keep the rightmost n-1
bits
we can use a bit mask with n-1
ones,
which is 2^n-1
,
or one lower than the modulus.
Applying this bit mask to our working register with and
will perform the modulo operation
and end the first block.
Second Block
; let n = secret / 32;
; secret = (secret ^ n) % MOD;
mov ecx, eax
shr ecx, 5
xor ecx, eax
Since the first block wrote its result into eax
,
the second block uses ecx
as its working register.
Since 32=2^5
we can do a division with 32
by shifting the bits to the right five places.
This is similar to the multiplication from earlier.
The difference being that the bits are shifted right to perform division,
instead of left to perform multiplication.
Then the "mixing" happens with a xor
instruction.
What about the modulo operation?
It is missing completely. Why?
eax
from the first block is smaller than our modulus,
since we just calculated the modulo there.
By shifting the value 5 bits to the right,
it will only ever be smaller still.
XOR-ing these two values will never make them bigger than the modulus.
LLVM has noticed that taking the modulo of a value
that is strictly smaller than the modulus has no effect
and omitted the operation.
This optimization marks the end of the second block.
Third Block
; let n = secret * 2048;
; secret = (secret ^ n) % MOD;
mov eax, ecx
shl eax, 11
and eax, 16775168
xor eax, ecx
The third block looks similar to the first one, except for two things.
One, the order of and
and xor
is swapped.
Two, the value used for and
is different.
Here are the last two instructions from the first block as a reminder:
xor eax, edi
and eax, 16777215
I do not know why LLVM reordered the instructions here.
I think it would work fine if the xor
and and
instructions from the first block were used.
Nevertheless, the number 16775168 used as a bit mask can be explained
when put into context of being applied after the left shift.
The left shift by 11 bits guarantees that the rightmost 11 bits of eax
are zero.
Since they are zero, we do not have to use a mask to set them to zero.
In our modulo trick from the first block, LLVM used 16777215.
When we set the lowest 11 bits in this number to zero,
we get 16775168.
It remains a mystery to me why LLVM reordered these instructions.
Compilers Are Smart
I think it is interesting to peek behind the curtains of the "low level" sometimes and see what the computer is actually doing with the code you write. It is very comforting to know that I do not have to put effort into micro-optimizations. I can write readable code and be reasonably sure that I get the fastest possible machine code. This is all thanks to a lot of smart people who put a whole lot of effort into making sure that LLVM knows how to optimize a lot of things.