How do Computers Compute Modulus?

A while back, in a mathematics course on cryptography, my professor made an offhand question directed towards me. We had been discussing certain cryptographic algorithms (which ones I don’t remember) and we were breaking down the steps. At a certain step, as with most cryptographic algorithms, there was a step that computed some integer modulus some other integer. His question, which no doubt was rhetorical in some way, was “how do computers compute mods, by the way?”

Since it was directed at me, and having a small knowledge of the DIV instruction in assembly, I of course answered “it’s complicated”, knowing that it was probably more complicated than I knew.

If one was coding assembly by hand, for those who aren’t completely familiar, the way you’d compute a mod b is by using the x86 instruction DIV, which finds a / b, and then looking in the d register for the remainder of the division.

Now, if you’re not directly writing assembly by hand, as most programmers never do, and you’re using a compiled programming language, then the answer to the question “What does the computer do when I type a % b” is it depends. If a and b truly can be any value, and are not constants, then really the best way is probably the way outlined above, and most compilers will use the above technique to compute a % b. Instead, let’s consider the case that often happens when b is a constant.

If b is a power of 2, all compilers will be clever enough to realize that a mod 2^n is the same as the last n bits of a.

In particular, we’ll be examining the following code in the Rust programming language:

// ...
let mut x = 0;
for i in 0..1_000_000_000u64 {
  if i % 3 == 0 {
    x += 1;
  }
}
// ...

Compiling this code on nightly (we’ll need nightly for inline assembly later) with full --release optimizations, we can find the relevant section of assembly:

...
  movl	$0, 48(%rsp)
  movl	$-1, %r11d
  movl	$1, %ecx
  xorl	%edx, %edx
  movl	$2863311531, %r8d
  xorl	%edi, %edi
  xorl	%r9d, %r9d
.p2align	4, 0x90
.LBB8_15:
  movl	%ecx, %esi
  imulq	%r8, %rsi
  shrq	$33, %rsi
  leal	(%rsi,%rsi,2), %r10d
  movl	%edx, %ebx
  imulq	%r8, %rbx
  shrq	$33, %rbx
  leal	(%rbx,%rbx,2), %ebx
  leal	1(%r9), %esi
  movl	%edi, %eax
  addl	%ebx, %eax
  cmovnel	%r9d, %esi
  leal	1(%rsi), %r9d
  addl	%r11d, %r10d
  cmovnel	%esi, %r9d
  addl	%edi, %ebx
  je	.LBB8_17
  testl	%r10d, %r10d
  jne	.LBB8_18
.LBB8_17:
  movl	%r9d, 48(%rsp)
.LBB8_18:
  addl	$-2, %r11d
  addl	$2, %ecx
  addq	$-2, %rdi
  addl	$2, %edx
  cmpq	$-1000000000, %rdi
  jne	.LBB8_15
...

If seeing this assembly makes you nervous and confused, don’t worry, me too! A few things stand out immediately:

There’s no div instruction! The compiler does a lot of random-looking stuff, but we’ll see later this is all just to avoid an expensive div instruction
There’s some super strange looking magical bit operations: notice the constant $2863311531 which gets multiplied to rsi, and the shrq $33, %rsi step after that

In fact, what’s happening here is a trick which turns a modulus operation into a multiplication instead of a division. I will not go into the full details on what this optimization does, but you can read the full gory details in this fantastic breakdown. I’m also not going to breakdown all the assembly for this example, because I assume a lot of it is just fluff to make sure that things get carried around correctly. The question I want to answer is, is this optimization actually worth it? After all, to my eye, this assembly looks like a whole lot of spaghetti, and surely then a single div instruction might clean it up, or maybe even speed it up?

As it turns out, the answer is it is definitely worth it. Let’s create an example of the same function which will do the loop in the same way while using our div instruction instead. Unfortunately, we can’t selectively turn off optimizations for the inside of the loop, so we will have to write the assembly for the inside of the loop ourselves. After digging through some documentation of inline assembly in Rust and a few segfaults, I came up with the following code:

let mut x = 0u32;
for i in 0..1_000_000_000u64 {
    unsafe {
        asm!("mov rdx, 0
              mov rcx, 3
              div rcx
              cmp rdx, 0
              jne .out
              add rbx, 1
        .out:
              nop" : "={rbx}"(x) : "{rax}"(i), "{rbx}"(x) : "rcx", "rax", "rbx", "rdx", "cc" : "intel");
    }
}

This code compiles into the much nicer looking assembly:

  ...
	movl	$0, 48(%rsp)
	xorl	%ebx, %ebx
	xorl	%esi, %esi
	.p2align	4, 0x90
.LBB8_35:
	movq	%rsi, %rax
	leaq	1(%rsi), %rsi
	#APP

	movq	$0, %rdx
	movq	$3, %rcx
	divq	%rcx
	cmpq	$0, %rdx
	jne	.out
	addq	$1, %rbx
.out:
	nop

	#NO_APP
	cmpq	$1000000000, %rsi
	jne	.LBB8_35
  ...

This also, thankfully, doesn’t have much copying and gluing in the loop (starting at .LBB8_35) to bridge to my assembly, so it should perform as well as the div bottleneck permits. The full code for the binary used can be found here. With this in mind, let’s see how it does! In the below test (ran using hyperfine), a is the compiler’s version without div, and b is my version:

Benchmark #1: ./optim_tests a
  Time (mean ± σ):     690.2 ms ±   3.2 ms    [User: 690.1 ms, System: 0.0 ms]
  Range (min … max):   686.0 ms … 694.6 ms    5 runs

Benchmark #2: ./optim_tests b
  Time (mean ± σ):      6.305 s ±  0.064 s    [User: 6.302 s, System: 0.002 s]
  Range (min … max):    6.253 s …  6.414 s    5 runs

Summary
  './optim_tests a' ran
    9.14 ± 0.10 times faster than './optim_tests b'