Euclidean Division

Truncated, Floored, and Euclidean Division

Haskell provides two kinds of integral division: quotRem and divMod. It’s common for beginners to ask what the difference is and when you might prefer one over the other. If you came here just trying to Google the difference between quotRem and divMod, here it is: quotRem always rounds the quotient toward zero (upward if it’s less then zero and downward if it’s greater than zero) and is more efficient than divMod, but divMod rounds the quotient toward negative infinity (always downward).

I’ve been writing a library for binary fixed-point arithmetic, and when implementing the (/) operation I was forced to consider the latter question. I already knew that quotRem is more efficient than divMod, which made me really hope that quotRem was an appropriate choice. I also already knew the difference between the two. Regardless, I had no idea whether div was a better choice then quot for my purposes, so I asked and Googled around.

First, I discovered cdsmith’s blog post “Learning Number Theory and Haskell: The Division Algorithm.” He cites Gareth and Mary Jones’ Elementary Number Theory for the “division algorithm,” which goes as follows:

∀ ab ∈ ℤ : b ≠ 0 ⇒ ∃ qr ∈ ℤ : a = qb + r ∧ 0 ≤ r < ∣b

He then goes on to demonstrate that neither quotRem nor divMod satisfy this property. rem always gives the remainder the same sign as the dividend, and mod always gives the remainder the same sign as the divisor. That means that if the dividend is negative then quotRem gives the incorrect result, and if the divisor is negative then divMod gives the incorrect result. Both functions satisfy one part of the above theorem: a = qb + r. It’s just that they don’t necessarily choose q and r such that 0 ≤ r < ∣b.

How do other languages handle this? According to the Wikipedia article on the modulo operation, there is very little consistency among programming languages, although most of them follow the rules either of quotRem or divMod, sometimes providing both as in the case of Haskell. According to the article, quotRem is classified as truncated division, and divMod is classified as floored division. A function that satisfies the division algorithm in its entirety is called Euclidean division. The Wikipedia article cites two interesting papers. One is Raymond T. Boute’s “The Euclidean definition of the functions div and mod,” which analyzes various definitions for quotient and remainder, highlighting areas where some are more appropriate than others. You can probably tell from the title that Boute heavily favors Euclidean division. I won’t go into any depth here, but if you’re curious, it’s an interesting paper and pretty easy to read. The other paper is Daan Leijen’s “Division and Modulus for Computer Scientists,” which among other things includes a simple implementation for Euclidean division and a proof of its correctness.

Deriving a More Efficient Euclidean Division Function

The implementation given by Leijen looks less efficient than it could be, but we can derive a more efficient version of it. Here is the implementation proven correct in the paper, which I have translated directly into Haskell:

euclidean :: Integral a => a -> a -> (a, a)
a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> (q - i, r + i*b)
      where i = if r >= 0
                then 0
                else ( if b > 0
                       then 1
                       else (-1)
                     )

The first time I went through this, I ended up substituting a couple subexpressions with signum and abs, but it turns out that those are not primitive operations in GHC, so they weren’t actually any faster. So, despite the fact that right off the bat we could substitute if b > 0 then 1 else (-1) with signum (since we know that b /= 0 anyway), let’s resist that temptation.

First, let’s inline i:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> ( q - if r >= 0 then 0 else (if b > 0 then 1 else (-1))
              , r + (if r >= 0 then 0 else (if b > 0 then 1 else (-1)))*b
              )

Applying an operation to the result of an if expression is the same as applying it in both the branches of the if expression. We can use this knowledge to move the subtraction from q into the first if statement and the addition to r and multiplication with b into the second.

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> ( if r >= 0 then q - 0 else q - if b > 0 then 1 else (-1)
              , if r >= 0 then r + 0*b else r + (if b > 0 then 1 else (-1))*b
              )

We can simplify parts where we subtract zero and multiply by zero:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> ( if r >= 0 then q else q - if b > 0 then 1 else (-1)
              , if r >= 0 then r else r + (if b > 0 then 1 else (-1))*b
              )

Let’s reformat this to use the outer (,) in prefix notation, in order to make another change more obvious:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> (,) (if r >= 0 then q else q - if b > 0 then 1 else (-1))
              (if r >= 0 then r else r + (if b > 0 then 1 else (-1))*b)

Now we can put the (,) into the branches of the first if statement:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> (if r >= 0 then (,) q else (,) (q - if b > 0 then 1 else (-1)))
              (if r >= 0 then r else r + (if b > 0 then 1 else (-1))*b)

Next, we can treat the entire first if expression as a function and move it to the branches of the next if statement:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> if r >= 0
              then (if r >= 0 then (,) q else (,) (q - if b > 0 then 1 else (-1))) r
              else (if r >= 0 then (,) q else (,) (q - if b > 0 then 1 else (-1)))
                   (r + (if b > 0 then 1 else (-1))*b)

Once we are inside either of the branches of the outer if expression, the value of r >= 0 is known already, so we can simplify some of the inner if expressions by picking the appropriate branch:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> if r >= 0
              then (,) q r
              else (,) (q - if b > 0 then 1 else (-1))
                   (r + (if b > 0 then 1 else (-1))*b)

Now let’s push the \x -> q - x and \x -> r + x*b operations into the inner if expressions:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> if r >= 0
              then (,) q r
              else (,) (if b > 0 then q - 1 else q + 1)
                   (if b > 0 then r + b else r - b)

Playing the same game with (,) that we did earlier, we’ll push the (,) into the first if expression:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> if r >= 0
              then (,) q r
              else (if b > 0 then (,) (q - 1) else (,) (q + 1))
                   (if b > 0 then r + b else r - b)

Then push the first if expression into the branches of the second:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> if r >= 0
              then (,) q r
              else if b > 0
                   then (if b > 0 then (,) (q - 1) else (,) (q + 1)) (r + b)
                   else (if b > 0 then (,) (q - 1) else (,) (q + 1)) (r - b)

Then, since we know the result of b > 0 in each major branch, pick the correct inner branch:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> if r >= 0
              then (,) q r
              else if b > 0
                   then (,) (q - 1) (r + b)
                   else (,) (q + 1) (r - b)

Now we can make the (,)s infix again:

a `euclidean` b =
  case a `quotRem` b of
    (q, r) -> if r >= 0
              then (q, r)
              else if b > 0
                   then (q - 1, r + b)
                   else (q + 1, r - b)

Noticing that the first branch is the same as the result of the case match, we can just reuse that result instead of constructing a new tuple:

a `euclidean` b =
  case a `quotRem` b of
    qr@(q, r) -> if r >= 0
                 then qr
                 else if b > 0
                      then (q - 1, r + b)
                      else (q + 1, r - b)

Now let’s just tidy it up a little:

a `euclidean` b =
  case a `quotRem` b of
    qr@(q, r) | r >= 0    -> qr
              | b >  0    -> (q-1, r+b)
              | otherwise -> (q+1, r-b)

It’s not quite as pretty as the implementation I had using signum and abs, but at least it’s a bit tighter.

Benchmarking Euclidean Division

So, how efficient is it? Of course it will not be as cheap as quotRem alone, but it may be faster than divMod. Here’s the default definition of divMod, rewritten to fit the style of the definition I wrote for euclidean:

a `divMod` b =
  case a `quotRem` b of
    qr@(q, r) | (r > 0 && b < 0) || (r < 0 && b > 0) -> (q-1, r+b)
              | otherwise                            -> qr

Well, divMod looks a little more complicated, but it’s close enough that I can’t say just from looking at it whether that means anything, though. I’d look at the core, but even that doesn’t tell us as much as benchmarks.

So, I wrote a simple benchmark using the vector and criterion packages. It pregenerates two random vectors containing Ints and two random vectors containing Words, making sure that the second of each pair contains no zero elements; the vectors were generated with 100,000 elements each. For the benchmarks, it zips divMod, quotRem, and euclidean over each vector pair. I built the executable on x86_64 Linux using GHC 7.0.2 with -O and ran it with -s 1000 +RTS -H. I was surprised by the results:

benchmarking Int/divMod
mean: 39.97829 ms, lb 39.80771 ms, ub 40.14524 ms, ci 0.950
std dev: 2.727351 ms, lb 2.663118 ms, ub 2.795931 ms, ci 0.950

benchmarking Int/quotRem
mean: 36.32100 ms, lb 36.15058 ms, ub 36.49007 ms, ci 0.950
std dev: 2.735966 ms, lb 2.680588 ms, ub 2.816686 ms, ci 0.950

benchmarking Int/euclidean
mean: 35.98929 ms, lb 35.84691 ms, ub 36.13137 ms, ci 0.950
std dev: 2.315703 ms, lb 2.284974 ms, ub 2.364711 ms, ci 0.950

benchmarking Word/divMod
mean: 34.19098 ms, lb 34.00258 ms, ub 34.38643 ms, ci 0.950
std dev: 3.095565 ms, lb 3.019508 ms, ub 3.177908 ms, ci 0.950

benchmarking Word/quotRem
mean: 33.49599 ms, lb 33.31483 ms, ub 33.68272 ms, ci 0.950
std dev: 2.965439 ms, lb 2.883252 ms, ub 3.065997 ms, ci 0.950

benchmarking Word/euclidean
mean: 32.24446 ms, lb 32.09422 ms, ub 32.39927 ms, ci 0.950
std dev: 2.469073 ms, lb 2.402462 ms, ub 2.548382 ms, ci 0.950

I was afraid these results were wrong because for a while they fluctuated wildly, but I tracked the issue down to my laptop’s thermal management system; having changed a few settings, the results are now extremely consistent.

So what the heck is going on? It’s understandable that euclidean could be faster than divMod, but being faster than quotRem seems impossible, considering that quotRem is a part of its implementation! I can’t even fathom why this would happen. I tried to compare the core output of these functions, but I can’t figure out a way to get ghc-core to dump anything useful about divMod or quotRem, so instead I’ll just go poking around in base’s source to find the exact implementations used for Int and Word so I can compare those with the core output for euclidean.

Looking in GHC.Num, we find the implementations of divMod and quotRem for Int:

quotRemInt :: Int -> Int -> (Int, Int)
quotRemInt a@(I# _) b@(I# _) = (a `quotInt` b, a `remInt` b)

divModInt ::  Int -> Int -> (Int, Int)
divModInt x@(I# _) y@(I# _) = (x `divInt` y, x `modInt` y)

I find it interesting that they would be defined this way. It seems to me that it would make more sense for quotInt and remInt to share some work. Let’s look at their definitions in GHC.Base; maybe GHC can optimize quotRemInt by inlining them:

{-# INLINE quotInt #-}
{-# INLINE remInt #-}
quotInt, remInt :: Int -> Int -> Int
(I# x) `quotInt`  (I# y) = I# (x `quotInt#` y)
(I# x) `remInt`   (I# y) = I# (x `remInt#`  y)

It seems they are just primops, so if GHC is doing anything fancy, it’s at a lower level than core. Is the same true of divInt and modInt?

divInt, modInt :: Int -> Int -> Int
(I# x) `divInt`   (I# y) = I# (x `divInt#`  y)
(I# x) `modInt`   (I# y) = I# (x `modInt#`  y)

Well, I didn’t see any INLINE pragmas for them, but their unboxed versions are not primops, so at least they’re easy to look at:

divInt# :: Int# -> Int# -> Int#
x# `divInt#` y#
    | (x# ># 0#) && (y# <# 0#) = ((x# -# 1#) `quotInt#` y#) -# 1#
    | (x# <# 0#) && (y# ># 0#) = ((x# +# 1#) `quotInt#` y#) -# 1#
    | otherwise                = x# `quotInt#` y#

modInt# :: Int# -> Int# -> Int#
x# `modInt#` y#
    | (x# ># 0#) && (y# <# 0#) ||
      (x# <# 0#) && (y# ># 0#)    = if r# /=# 0# then r# +# y# else 0#
    | otherwise                   = r#
    where
    !r# = x# `remInt#` y#

These appear to perform essentially the same conditional branches, so it might be pretty easy for GHC to merge these operations into one if it happens to decide that it’s willing to perform some common subexpression elimination (which it’s capable of doing, but only does so rarely).

These definitions seem to imply that euclidean might have an advantage over divMod and quotRem in the Int case since it is specialized for this case. Perhaps it would be penalized by only computing the quotient or the remainder instead of both by composing with fst or snd respectively. Here are the benchmark results for that (also doing the same with Word):

benchmarking Int/div
mean: 16.28924 ms, lb 16.22815 ms, ub 16.35012 ms, ci 0.950
std dev: 983.7511 us, lb 955.7222 us, ub 1.022347 ms, ci 0.950

benchmarking Int/quot
mean: 15.25413 ms, lb 15.18767 ms, ub 15.32052 ms, ci 0.950
std dev: 1.074821 ms, lb 1.043818 ms, ub 1.121362 ms, ci 0.950

benchmarking Int/euclideanQuot
mean: 21.14791 ms, lb 21.03288 ms, ub 21.26672 ms, ci 0.950
std dev: 1.885141 ms, lb 1.808788 ms, ub 2.046627 ms, ci 0.950

benchmarking Int/mod
mean: 17.38129 ms, lb 17.32322 ms, ub 17.44116 ms, ci 0.950
std dev: 952.2632 us, lb 917.6375 us, ub 1.001158 ms, ci 0.950

benchmarking Int/rem
mean: 15.33842 ms, lb 15.27551 ms, ub 15.40213 ms, ci 0.950
std dev: 1.020424 ms, lb 992.9151 us, ub 1.064465 ms, ci 0.950

benchmarking Int/euclideanRem
mean: 19.37936 ms, lb 19.25804 ms, ub 19.50230 ms, ci 0.950
std dev: 1.963617 ms, lb 1.910733 ms, ub 2.038009 ms, ci 0.950

benchmarking Word/div
mean: 15.57217 ms, lb 15.50029 ms, ub 15.64809 ms, ci 0.950
std dev: 1.190803 ms, lb 1.136332 ms, ub 1.250223 ms, ci 0.950

benchmarking Word/quot
mean: 15.05417 ms, lb 14.99211 ms, ub 15.12062 ms, ci 0.950
std dev: 1.030228 ms, lb 988.1618 us, ub 1.077946 ms, ci 0.950

benchmarking Word/euclideanQuot
mean: 18.37285 ms, lb 18.26637 ms, ub 18.48592 ms, ci 0.950
std dev: 1.774987 ms, lb 1.682966 ms, ub 1.908387 ms, ci 0.950

benchmarking Word/mod
mean: 15.25315 ms, lb 15.19449 ms, ub 15.31481 ms, ci 0.950
std dev: 968.6585 us, lb 935.6003 us, ub 1.006045 ms, ci 0.950

benchmarking Word/rem
mean: 15.06085 ms, lb 15.00381 ms, ub 15.11985 ms, ci 0.950
std dev: 936.9703 us, lb 905.6734 us, ub 976.3766 us, ci 0.950

benchmarking Word/euclideanRem
mean: 16.12448 ms, lb 16.05018 ms, ub 16.20500 ms, ci 0.950
std dev: 1.245996 ms, lb 1.177754 ms, ub 1.354666 ms, ci 0.950

This seems fairly consistent with my prediction; this was enough of a penalty to bring euclidean into dead last for efficiency. This also verifies that GHC isn’t doing anything special with divMod or quotRem as far as combining div with mod or quot with rem; these operations are each taking roughly half the time that their combined counterparts are taking. I had heard that quotRem was itself a primop; I wonder if that was never true in the first place or if it has simply changed since that was stated (in which case, I wonder why). All this still doesn’t explain why euclidean manages to beat quotRem, but at least it was an enlightening bit of exploration. The only remaining avenue I can think of is to explore the implementation of GHC’s primops and also trace them through some of GHC’s optimization passes, but that is more trouble than I feel like going through right now. Maybe a knowledgable commenter can more easily explain the true reason for this.

I think these numbers make me less interested in how it’s specialized for Word; I expect it’s fairly predictable, anyway.

Specializing for Only the Quotient or Remainder

Let’s see what happens when we try manually specializing euclidean for just the quotient or just the remainder.

Quotient

Let’s start with the quotient first; I’ve taken the liberty to go ahead and naively drop the remainder from the results:

a `eucQuot` b =
  case a `quotRem` b of
    (q, r) | r >= 0    -> q
           | b >  0    -> q-1
           | otherwise -> q+1

We’ve already seen that quot and rem in isolation are faster than quotRem combined, so if we can arrange to only use quot then we might be better off. That means we need to find a way to get rid of the dependency on r in the first guard. We know that if r is not zero then it has the same sign as a. We also know that quotRem is acceptable when a == 0. This means we can split the first guard into two; the first one being a >= 0 and the second being r == 0:

a `eucQuot` b =
  case a `quotRem` b of
    (q, r) | a >= 0    -> q
           | r == 0    -> q
           | b >  0    -> q-1
           | otherwise -> q+1

Now let’s go ahead and remove the guard bodies’ dependence on the result of quotRem by using quot in the bodies themselves:

a `eucQuot` b =
  case a `quotRem` b of
    (q, r) | a >= 0    -> a `quot` b
           | r == 0    -> a `quot` b
           | b >  0    -> (a `quot` b) - 1
           | otherwise -> (a `quot` b) + 1

We had simplified the guard that uses r, but we still haven’t removed the r entirely. In order to do that, we will have to modify the last two cases such that the case that r == 0 is handled appropriately. Let’s look at the b > 0 case first. We want to write it such that when r == 0 it is equivalent to q, but otherwise it is equivalent to q - 1. This can be achieved by shifting the input to quot in the original guard body by one; when r == 0, this will shift the output by one, but otherwise it will be the same:

a `eucQuot` b =
  case a `quotRem` b of
    (q, r) | a >= 0    -> a `quot` b
           | r == 0    -> a `quot` b
           | b >  0    -> ((a + 1) `quot` b) - 1
           | otherwise -> (a `quot` b) + 1

The b > 0 case is now independent of the r == 0 case, so let’s move it above the latter to make that clear:

a `eucQuot` b =
  case a `quotRem` b of
    (q, r) | a >= 0    -> a `quot` b
           | b >  0    -> ((a + 1) `quot` b) - 1
           | r == 0    -> a `quot` b
           | otherwise -> (a `quot` b) + 1

All that’s left is the otherwise case. At this point, we know a < 0 and b < 0. To make this account for the r == 0 case, we want to simply shift a the exact same way:

a `eucQuot` b =
  case a `quotRem` b of
    (q, r) | a >= 0    -> a `quot` b
           | b >  0    -> ((a + 1) `quot` b) - 1
           | r == 0    -> a `quot` b
           | otherwise -> ((a + 1) `quot` b) + 1

Now we can drop the r == 0 case entirely:

a `eucQuot` b =
  case a `quotRem` b of
    (q, r) | a >= 0    -> a `quot` b
           | b >  0    -> ((a + 1) `quot` b) - 1
           | otherwise -> ((a + 1) `quot` b) + 1

And the case expression is entirely unnecessary now:

a `eucQuot` b
  | a >= 0    = a `quot` b
  | b >  0    = ((a + 1) `quot` b) - 1
  | otherwise = ((a + 1) `quot` b) + 1

Promisingly, aside from working with boxed instead of unboxed values, this looks even simpler than the implementation of divInt#!

Remainder

Let’s compute the remainder, now. We’ll start with the original euclidean function, but where the guard bodies only give the remainder instead of a pair:

a `eucRem` b =
  case a `quotRem` b of
    (_, r) | r >= 0    -> r
           | b >  0    -> r + b
           | otherwise -> r - b

We can see that the quotient from quotRem is never used. This gives us an opportunity to replace the use of quotRem with rem:

a `eucRem` b =
  case a `rem` b of
    r | r >= 0    -> r
      | b >  0    -> r + b
      | otherwise -> r - b

And we’re done already! Again, aside from the boxed values, this looks a lot simpler than modInt# to me. The gains from this are likely to be smaller than in the case of eucQuot since it’s easier for GHC to specialize euclidean to something like this in the first place.

Benchmarking Euclidean Quotient and Remainder

Let’s see how these stack up. I’m using the same benchmark setup as earlier, just using different operations:

benchmarking Int/eucQuot
mean: 16.11784 ms, lb 16.05454 ms, ub 16.18123 ms, ci 0.950
std dev: 1.023141 ms, lb 993.6663 us, ub 1.057743 ms, ci 0.950

benchmarking Int/eucRem
mean: 18.20244 ms, lb 18.14797 ms, ub 18.25797 ms, ci 0.950
std dev: 891.3623 us, lb 856.9801 us, ub 969.7699 us, ci 0.950

benchmarking Word/eucQuot
mean: 15.22592 ms, lb 15.16014 ms, ub 15.29432 ms, ci 0.950
std dev: 1.084259 ms, lb 1.042233 ms, ub 1.133132 ms, ci 0.950

benchmarking Word/eucRem
mean: 15.22092 ms, lb 15.16380 ms, ub 15.27993 ms, ci 0.950
std dev: 941.7308 us, lb 907.7359 us, ub 1.001557 ms, ci 0.950

It appears to be quite a substantial improvement, especially for the Euclidean quotient. eucQuot is faster than div, as I hoped, and slower than quot, as I expected. For Int, eucRem is slower than both mod and rem, though I doubt it is insurmountable; I won’t bother with it right now. For Word, at least, it’s still in its proper place between mod and rem.

Conclusions

eucQuot is faster than div, at least for the types I tested it on, and assuming the arguments for its well-behavedness are convincing enough, I think I know what I’ll be using for my fixed-point library. Oddly, the combined euclidean operation is benchmarking faster even than quotRem, but I suspect that a more complex benchmark might give the speed advantage back to quotRem.

About these ads

Posted on April 28, 2011, in Computer Arithmetic and tagged , , , , , , , , , , . Bookmark the permalink. 1 Comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.