Efficiently Generating a Number in a Range

Imagine this scenario:

For homework Huan and Sasha both sit down to implement the same randomized algorithm in C++, running it on the same machine at the University on the same data sets. Their code is almost identical except for random number generation. Huan is eager to get to music practice and so just uses the Mersenne Twister. Sasha, on the other hand, spends several extra hours doing lots of investigation. Sasha benchmarks several fast PRNGs that have been touted lately on social media, and picks the fastest. When Huan and Sasha meet up, Huan can't wait to show off, asking Sasha, “What did you use as your PRNG?”.

“Oh, I just used the Mersenne Twister, it's built in and it seemed to work well.”, says Huan.

“Ha!”, says Sasha, “I used jsf32. It goes much faster than something old and slow like the Mersenne Twister. My program was done in a little over a minute!”.

“Hmm, well mine gone done in 40 seconds”, says Sasha and shrugs. “Anyhow, I'm off to a concert. Want to come…?”

“No”, says Sasha, “I, uh, want to take another look at my code…”

This embarrassing scenario is not particularly contrived. If your randomized algorithm isn't running as fast as you like and the bottleneck seems to be generating random numbers, somewhat counterintuitively the problem might not be with the your random number generator.

Background: Random Numbers in Practice

Most modern good-quality random number generators produce machine words filled with random bits, thus typically generating numbers in the range [0..232) or [0..264). But in many use cases users need numbers in a particular range—simple examples like rolling a die or choosing a random playing card require numbers in small fixed ranges, but many algorithms from shuffling to reservoir sampling to randomized BSTs all need numbers drawn from different ranges.

Methods

We'll cover a number of different methods. To simplify the discussion, rather than generate in the range [i..j) or [i..j] we will generate numbers in the range [0..k). With such a scheme in hand, we can, for example, generate numbers in the range [i..j) by setting k = j - i, generating a number in the range [0..k) and then adding i to it.

C++ Built-In Techniques

In many languages there are built-in facilities for getting a random number in a range. To draw a card from a 52-card deck, in scripting languages like Perl and Python, for example, we can say int(rand(52) and random.randint(0,52) respectively. In C++, we can similarly use `uniform_int_distribution.

The C++ code to implement this approach is just

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    std::uniform_int_distribution<uint32_t> dist(0, range-1);

    return dist(rng);
}

Typically built-in approaches use one of the other techniques we will describe below, but most users will simply use the facilities provided without thinking too deeply about what is happening behind the scenes, assuming that they are well enough engineered and sufficiently efficient. In C++, the built-in techniques are more complex because they must be able to deal with quite arbitrary generation engines—a generator that produces values between -3 and 17 would quite legal and can be used with std::uniform_int_distribution to produce numbers in any range, such as [0..1000). C++'s built-in facilities are thus over-engineered for most of the uses to which they are put.

Classic Modulo (biased)

Let's move from an over-engineered approach to an under-engineered one.
When I was learning to program, we generated numbers in a range (e.g., to draw a card from a 52-card deck) by writing something like rand() % 52 to get a random number in the range [0..52).

In C++, we could write this approach as

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    return rng() % range;
}

Although this approach is simple, it also illustrates why getting numbers within a range is typically slow—it requires division (to calculate the remainder produced the % operation). Division is typically at least an order of magnitude slower than other arithmetic operations, and so this single arithmetic operation takes longer than all the work done by a fast PRNG.

But in addition being slow, it is also biased. To understand why rand() % 52 produces biased numbers, if we assume that rand() produces numbers in the range [0..232), observe that 52 does not perfectly divide 232, it divides it 82595524 times with remainder 48. Meaning that if we use rand() % 52, there will be 82595525 ways to select the first 48 cards from our 52-card deck and only 82595524 ways to select the final four cards . In other words, there is a 0.00000121% bias against these last four cards (perhaps these are the kings!). Back when I was a student writing homework assignments rolling dice or drawing cards, no one really worried about these tiny biases, but as the number increases, the bias increases linearly. For a 32-bit PRNG, a bounded range of less than 224 has a bias of less than 0.5% but above 231 the bias is 50%—some numbers will occur half as often as others.

Almost all of the remaining techniques adopt strategies to avoid bias, but it is probably worth realizing that for a 64-bit PRNG the amount of bias in typical use cases is likely to be negligible.

One other concern we might have is that some generators have weak low-order bits. For example, the Xoroshiro+ and Xoshiro+ families of PRNGs have low-order bits that fail statistical tests. When we perform % 52 (because 52 is even) we pass through the lowest bit straight through into the output.

FP Multiply (biased)

Another common technique is to use a PRNG that generates floating point numbers in the unit interval, [0..1), and then convert those numbers into the desired range. This is the approach taken by Perl, which recommends using int(rand(10)) to generate an integer in the range [0..10).

In C++, we would write this approach as:

static uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    double zeroone = 0x1.0p-32 * rng();
    return range * zeroone;
}

(Note that we could also write 0x1.0p-32 * rng() as ldexp(rng(), -32) but doing so appears to be a slower approach.)

This approach is just as biased as the Classic Modulo approach, but the bias falls differently. For example, if we were choosing numbers in the range [0..52), the numbers 0, 13, 26 and 39 appear once less often than the others.

This version is also more annoying to generalize to 64 bits because it requires a floating point type with at least 64 bits of mantissa, which is true for the x86 extended precision floating point used for long double on x86 Linux and macOS but it is not universally portable—on some systems long double is equivalent to double.

On the positive side this approach is that it works better than modulo-based approaches on PRNGs that have weak low order bits.

Integer Multiplication (biased)

The previous method can be adjusted to used fixed-point arithmetic rather than floating point. Essentially we just multiply throughout by 232.

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    uint32_t x = rng();
    uint64_t m = uint64_t(x) * uint64_t(range);
    return m >> 32;
}

It might appear that this version requires 64-bit arithmetic, but on x86 CPUs, a good compiler will compile this code to a 32-bit mult instruction (which yields two 32-bit outputs, one of which is the return value). We should expect this version to be fast, but biased in exactly the same way as the previous multiplication method.

Division with Rejection (unbiased)

We can rearrange the above multiplication-based scheme into a division based scheme. Instead of calculating x * range / 2**32, we calculate x / (2**32 / range). Because we are doing integer arithmetic, this version has different rounding behavior and will sometimes generate values outside of the desired range. If we reject those values, we end up with an unbiased technique.

This version uses a trick to divide 232 by range without using any 64-bit math. For unsigned integers, unary negation of r calculates the positive value 232r.

Thus C++ code to generate numbers via division with rejection is

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    // calculates divisor = 2**32 / range
    uint32_t divisor = ((-range) / range) + 1;
    if (divisor == 0) // overflow, it's really 2**32
        return 0;
    for (;;) {
        uint32_t val = rng() / divisor;
        if (val < range)
            return val;
    }
}

This approach also requires two slow division-based operations which are typically slower than other arithmetic operations, so we should not expect it to be quick.

Debiased Modulo (twice)

Returning to our example of choosing a card from a 52-card deck using a 32-bit PRNG, if there are 82595525 ways to select the first 48 cards from our deck and one fewer ways for the remaining four cards, we can consider one of those 82595525 ways disallowed. Whenever our underlying generator asks generates a disallowed number, we discard it and ask for another.

The code for OpenBSD's arc4random_uniform (which is also used on OS X and iOS) adopts this strategy. For many use code looked was as follows:

/*
 * Calculate a uniformly distributed random number less than upper_bound
 * avoiding "modulo bias".
 *
 * Uniformity is achieved by generating new random numbers until the one
 * returned is outside the range [0, 2**32 % upper_bound).  This
 * guarantees the selected random number will be inside
 * [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
 * after reduction modulo upper_bound.
 */
u_int32_t
arc4random_uniform(u_int32_t upper_bound)
{
    u_int32_t r, min;

    if (upper_bound < 2)
        return (0);

#if (ULONG_MAX > 0xffffffffUL)
    min = 0x100000000UL % upper_bound;
#else
    /* Calculate (2**32 % upper_bound) avoiding 64-bit math */
    ... ugly code omitted ...
#endif

    /*
     * This could theoretically loop forever but each retry has
     * p > 0.5 (worst case, usually far better) of selecting a
     * number inside the range we need, so it should rarely need
     * to re-roll.
     */
    for (;;) {
        r = arc4random();
        if (r >= min)
            break;
    }

    return (r % upper_bound);
}

The code looks a little complicated, but it is basically doing what we state previously, calculating the remainder of 232 divided by the range to work out what to discard. In the code above, it resorts to 64-bit math to calculate this remainder; I have omitted some rather ugly code that would be used on 32-bit systems. In 2012, the to calculate min was revised to something much simpler that only requires 32-bit math (and a little insight).

    /* 2**32 % x == (2**32 - x) % x */
    min = (-upper_bound) % upper_bound;

Because upper_bound is an unsigned 32-bit int, -upper_bound calculates the 32-bit value 0x100000000UL - upper_bound which has the same remainder after dividing by upper_bound as 0x100000000UL.

This technique eliminates bias, but it requires two time consuming modulo operations for each random number (and may require the underlying generator to produce more than one number). Thus we should expect it to be about half the speed of the classic biased approach.

Here's my C++ implementation of this approach:

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    // calculates 2**32 % range
    uint32_t t = (-range) % range;
    for (;;) {
        uint32_t r = rng();
        if (r >= t)
            return r % range;
    }
}

Debiased Modulo (once)

Java adopts a different approach to generating a number in a range that manages to use only one modulo operation except in the relatively rare occasions when rejection occurs. The code is

static uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    uint32_t x, r;
    do {
        x = rng();
        r = x % range;
    } while (x - r > (-range));
    return r;
}

It may take a little thought to figure out why this variant works. Unlike the previous version based on remainders which removes bias by removing some of the lowest values from the underlying generation engine, this version filters out values from the top part of the engine's range.

Debiased Integer Multiplication

Much as we removed bias from the modulo technique, we can also remove bias from the integer multiplication method. This method is due to Daniel Lemire, who has an excellent recent paper on this topic—he mentions most of the above techniques and includes more detailed explanations than the ones I have provided.

Except that this version exhibits bias because it has different rounding behavior from our earlier code (the former can produce values greater than range, the latter cannot). Lemire's full approach thus needs to adopt a different strategy to remove bias; details are in his paper.

uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t t = (-range) % range; do { uint32_t x = rng(); uint64_t m = uint64_t(x) * uint64_t(range); uint32_t l = uint32_t(m); } while (l < t); return m >> 32; }

Here's my implementation of Lemire's method:

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    uint32_t x = rng();
    uint64_t m = uint64_t(x) * uint64_t(range);
    uint32_t l = uint32_t(m);
    if (l < range) {
        uint32_t t = (-range) % range;
        while (l < t) {
            x = rng();
            m = uint64_t(x) * uint64_t(range);
            l = uint32_t(m);
        }
    }
    return m >> 32;
}

Lemire's approach is unlike the others because it relies on multiplication doing most of the work and seems like it won't use the modulus operator unless the range is large.

Except that this version exhibits bias because it has different rounding behavior from our earlier code (the former can produce values greater than range, the latter cannot). Lemire's full approach thus needs to adopt a different strategy to remove bias; details are in his paper.

Daniel Lemire has an excellent recent paper on this topic. He mentions most of the above techniques and includes more detailed explanations than the ones I have provided.

The essence of Lemire's method can understood by recalling our earlier division-based method and rearranging it. That code calculated rng() / (2**32 / range) which we can rearrange to rng()*range / (2**32). This rearrangement would seem to imply the following code:

An Division-Free Approach

Our final approach avoids division and remainder operations entirely. Instead it uses a simple masking operation to get a random number in the range [0..2^k) where k is the smallest value such that 2^k is greater than the range. If the value is too large for our range, we discard and try another. Code below:

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    uint32_t mask = ~uint32_t(0);
    --range;
    mask >>= __builtin_clz(range|1);
    uint32_t x;
    do {
        x = rng() & mask;
    } while (x > range);
    return x;
}

This approach was adopted by Apple when (in the macOS Sierra release) they made their own revision to the code for arc4random_uniform to fix the ugly code we mentioned earlier.

Benchmarking the Basic Approaches

We now have several approaches to evaluate. Unfortunately, benchmarking is tricky when we're concerned about the cost of a single division operation. No benchmark is going to capture all the factors that might be in play in your application and no guarantee that what is best for your application is necessarily best for mine.

We'll use three benchmarks, and test with a plethora of different PRNGs.

Large Shuffle Benchmark

Perhaps the most obvious benchmark to run is a shuffle benchmark. In this benchmark we will simulate performing a large shuffle. To sort an array of size N, we must generate numbers in the ranges [0..N), [0..(N-1)), … [0…1]. For this benchmark, we will assume N is as large as possible (232-1 for uint32_t). The code is

for (uint32_t i = 0xffffffff; i > 0; --i) {
    uint32_t bval = bounded_rand(rng, i);
    assert(bval < i);
    sum += bval;
}

Note that we “use” each number by adding it to sum (to avoid it being optimized away) but do not actually do any shuffling to keep the focus on number generation.

Mersenne Twister Results

The results below show performance of this benchmark for the methods discussed above, using the Mersenne Twister testing at both 32-bit (with libstdc++'s std::mt19937) as discussed in this article and analogous 64-bit code (with libstdc++'s std:mt19937_64). Results are the geometric mean of 15 runs with different seeds, then normalized to make the classic biased mod method have unit time.

Large Shuffle Benchmark, Mersenne Twister

It might seem like we have some clear answers about performance—we seem to have a clear progression of better techniques and also perhaps wonder what the designers of libstdc++ were thinking when they wrote such a terrible implementation for 32-bit sizes—but as is so often the case with benchmarking, the situation is more complex than a these results might seem to show. First, there is a risk that our results might somehow be specific to the Mersenne Twister, so we will broaden the set of PRNGs tested. Second, there may be a subtle issue with the benchmark itself. Let's address the first problem first.

Multiple PRNGs Results

For 32-bit PRNGs, we'll test with arc4_rand32, chacha8r, gjrand32, jsf32, mt19937, pcg32, pcg32_fast, sfc32, splitmix32, xoroshiro64+, xorshift*64/32, xoshiro128+, xoshiro128**, and for 64-bit PRNGs we'll use gjrand64, jsf64, mcg128, mcg128_fast, mt19937_64, pcg64, pcg64_fast, sfc64, splitmix64, xoroshiro128+, xorshift*128/64, xoshiro256+, xoshiro256*. These give us some slow PRNGs and a number very fast ones. Here are the results:

Large Shuffle Benchmark, Many PRNGs

These results do have some key differences. Faster PRNGs will shift the balance towards the bounding code and thus differences between approaches become more pronounced, especially for 64 bit PRNGs. With this broader collection of PRNGs the libstc++ implementation now no longer seems terrible.

Conclusions

In this benchmark, the biased multiplication-based approach is fastest by a significant margin. There are many situations where bounds will be small relative to the size of the PRNG and performance is absolutely critical. In these situations a negligible amount of bias may be unlikely to have noticeable effect but PRNG speed will. One such example is random-pivot Quicksort. For unbiased approaches, the bitmask technique also seems promising.

But before we draw too many conclusions, we should recognize that a significant issue with this benchmark is that the bulk of the time is spent with very high bounds, which likely over-emphasizes large ranges. Thus we shall move on to a second benchmark.

Small Shuffle Benchmark

This benchmark is similar to the previous one, but reduces the size of the array shuffled a much smaller array. The code is

for (uint32_t j = 0; j < 0xffff; ++j) {
    for (uint32_t i = 0xffff; i > 0; --i) {
        uint32_t bval = bounded_rand(rng, i);
        assert(bval < i);
        sum += bval;
    }
}
Mersenne Twister Results

Small Shuffle Benchmark, Mersenne Twister

Multiple PRNGs Results

Small Shuffle Benchmark, Many PRNGs

Conclusions

This benchmark avoids over-emphasis on larger bounds and more accurately reflects real use cases, but now omits larger bounds entirely.

All Ranges Benchmark

This benchmark aims to avoid the deficiencies in the previous two by doing some testing at each power of two size, ensuring that each size is represented but not overrepresented.

for (uint32_t bit = 1; bit != 0; bit <<= 1) {
    for (uint32_t i = 0; i < 0x1000000; ++i) {
        uint32_t bound = bit | (i & (bit - 1));
        uint32_t bval = bounded_rand(rng, bound);
        assert(bval < bound);
        sum += bval;
    }
}
Mersenne Twister Results

Small Shuffle Benchmark, Mersenne Twister

Multiple PRNGs Results

Small Shuffle Benchmark, Many PRNGs

Conclusions

Many of our conclusions remain. The biased multiply is fast if we can tolerate bias, and the bitmask scheme seems like a good all rounder.

We would be done at this point if we weren't willing to go back, look critically at our code, and make adjustments.

Making Improvements

Faster Threshold-Based Discarding

A few of our algorithms have code involving a threshold, such as

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    // calculates 2**32 % range
    uint32_t t = (-range) % range;
    for (;;) {
        uint32_t r = rng();
        if (r >= t)
            return r % range;
    }
}

When range is small compared to the output range of the PRNG, most of the time the number will be much greater than the threshold. Thus if we can have an estimate of the threshold that might be a little high, we can save an expensive modulo operation.

The code below achieves this goal:

uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    uint32_t r = rng();
    if (r < range) {
        uint32_t t = (-range) % range;
        while (r < t)
            r = rng();
    }
    return r % range;
}

This change can be applied to both the “Debiased Mod (twice)” method (above) and the “Debiased Mult” approach. The idea is due to Lemire

Small-Shuffle Benchmark Results

This change significantly speeds the small-shuffle benchmark for both the integer-multiply approach and the two-modulo approach, in each case moving their performance close to that

Optimizing Modulo

Bonus Results, PRNGs compared