How do Computers Compute Modulus?
by Matt Torrence
A while back, in a mathematics course on cryptography, my professor made an offhand question directed towards me. We had been discussing certain cryptographic algorithms (which ones I don’t remember) and we were breaking down the steps. At a certain step, as with most cryptographic algorithms, there was a step that computed some integer modulus some other integer. His question, which no doubt was rhetorical in some way, was “how do computers compute mods, by the way?”
Since it was directed at me, and having a small knowledge of the DIV
instruction in assembly, I of course answered “it’s complicated”, knowing that it was probably more complicated than I knew.
If one was coding assembly by hand, for those who aren’t completely familiar, the way you’d compute a mod b
is by using the x86 instruction DIV
, which finds a / b
, and then looking in the d
register for the remainder of the division.
Now, if you’re not directly writing assembly by hand, as most programmers never do, and you’re using a compiled programming language, then the answer to the question “What does the computer do when I type a % b” is it depends. If a and b truly can be any value, and are not constants, then really the best way is probably the way outlined above, and most compilers will use the above technique to compute a % b
. Instead, let’s consider the case that often happens when b is a constant.
If b is a power of 2, all compilers will be clever enough to realize that a mod 2^n
is the same as the last n bits of a.
In particular, we’ll be examining the following code in the Rust programming language:
// ...
let mut x = 0;
for i in 0..1_000_000_000u64 {
if i % 3 == 0 {
x += 1;
}
}
// ...
Compiling this code on nightly (we’ll need nightly for inline assembly later) with full --release
optimizations, we can find the relevant section of assembly:
...
movl $0, 48(%rsp)
movl $-1, %r11d
movl $1, %ecx
xorl %edx, %edx
movl $2863311531, %r8d
xorl %edi, %edi
xorl %r9d, %r9d
.p2align 4, 0x90
.LBB8_15:
movl %ecx, %esi
imulq %r8, %rsi
shrq $33, %rsi
leal (%rsi,%rsi,2), %r10d
movl %edx, %ebx
imulq %r8, %rbx
shrq $33, %rbx
leal (%rbx,%rbx,2), %ebx
leal 1(%r9), %esi
movl %edi, %eax
addl %ebx, %eax
cmovnel %r9d, %esi
leal 1(%rsi), %r9d
addl %r11d, %r10d
cmovnel %esi, %r9d
addl %edi, %ebx
je .LBB8_17
testl %r10d, %r10d
jne .LBB8_18
.LBB8_17:
movl %r9d, 48(%rsp)
.LBB8_18:
addl $-2, %r11d
addl $2, %ecx
addq $-2, %rdi
addl $2, %edx
cmpq $-1000000000, %rdi
jne .LBB8_15
...
If seeing this assembly makes you nervous and confused, don’t worry, me too! A few things stand out immediately:
- There’s no
div
instruction! The compiler does a lot of random-looking stuff, but we’ll see later this is all just to avoid an expensivediv
instruction - There’s some super strange looking magical bit operations: notice the constant
$2863311531
which gets multiplied to rsi, and theshrq $33, %rsi
step after that
In fact, what’s happening here is a trick which turns a modulus operation into a multiplication instead of a division. I will not go into the full details on what this optimization does, but you can read the full gory details in this fantastic breakdown. I’m also not going to breakdown all the assembly for this example, because I assume a lot of it is just fluff to make sure that things get carried around correctly. The question I want to answer is, is this optimization actually worth it? After all, to my eye, this assembly looks like a whole lot of spaghetti, and surely then a single div
instruction might clean it up, or maybe even speed it up?
As it turns out, the answer is it is definitely worth it. Let’s create an example of the same function which will do the loop in the same way while using our div
instruction instead. Unfortunately, we can’t selectively turn off optimizations for the inside of the loop, so we will have to write the assembly for the inside of the loop ourselves. After digging through some documentation of inline assembly in Rust and a few segfaults, I came up with the following code:
let mut x = 0u32;
for i in 0..1_000_000_000u64 {
unsafe {
asm!("mov rdx, 0
mov rcx, 3
div rcx
cmp rdx, 0
jne .out
add rbx, 1
.out:
nop" : "={rbx}"(x) : "{rax}"(i), "{rbx}"(x) : "rcx", "rax", "rbx", "rdx", "cc" : "intel");
}
}
This code compiles into the much nicer looking assembly:
...
movl $0, 48(%rsp)
xorl %ebx, %ebx
xorl %esi, %esi
.p2align 4, 0x90
.LBB8_35:
movq %rsi, %rax
leaq 1(%rsi), %rsi
#APP
movq $0, %rdx
movq $3, %rcx
divq %rcx
cmpq $0, %rdx
jne .out
addq $1, %rbx
.out:
nop
#NO_APP
cmpq $1000000000, %rsi
jne .LBB8_35
...
This also, thankfully, doesn’t have much copying and gluing in the loop (starting at .LBB8_35) to bridge to my assembly, so it should perform as well as the div
bottleneck permits. The full code for the binary used can be found here. With this in mind, let’s see how it does! In the below test (ran using hyperfine), a
is the compiler’s version without div
, and b
is my version:
Benchmark #1: ./optim_tests a Time (mean ± σ): 690.2 ms ± 3.2 ms [User: 690.1 ms, System: 0.0 ms] Range (min … max): 686.0 ms … 694.6 ms 5 runs Benchmark #2: ./optim_tests b Time (mean ± σ): 6.305 s ± 0.064 s [User: 6.302 s, System: 0.002 s] Range (min … max): 6.253 s … 6.414 s 5 runs Summary './optim_tests a' ran 9.14 ± 0.10 times faster than './optim_tests b'
Subscribe via RSS