A Birthday Test: Quickly Failing Some Popular PRNGs

M.E. O'Neill

2018-06-09 15:35

My two favorite statistical test suites for PRNGs are PractRand and TestU01, but while they detect statistical flaws of many kinds, detecting those flaws often takes a long time to run, and for various reasons some flaws take a very long time to detect (if they are detected at all). In this post, we'll put together a test based around a very classic example from the world of probability theory, the Birthday Problem.

In doing so, we'll develop a test that can detect deviations from random behavior in some widely used PRNGs. We'll eventually be able to detect flaws in std::minstd_rand and XorShift 32 in 0.01 seconds, and SplitMix, XorShift 64 and Xoroshiro64* in less than half an hour.

The Birthday Problem

In the classic Birthday Problem, we ask how many people we need to have in a room before it is more likely than not that we have two people who share a birthday. Many people are surprised at how low the number is, only 23. And by the time we have 58 people, the chance is more than 99%.

The Birthday Problem generalizes to the formula, \[ p \approx 1-e^{\frac{-n^2}{2d}}, \] which rearranges to \[ n \approx \sqrt{- 2 d \log(1-p)}, \] where \(p\) is the probability of success (e.g., 0.5 (or 0.99) in the preceding standard example above), \(d\) is the number of possible values (e.g., 365 in the standard birthday problem), and \(n\) is the number of samples observed.

Also, using a Poisson approximation, we can say that the expected number of repeats, \(r\), will be \[ r \approx -\log(1-p) \approx \frac{n^2}{2d}, \] but we instead use an estimate based on a binomial distribution (which is usually cited when discussing number of collisions in a hash table), \[ r \approx n - d \left(1-\left(1-\frac{1}{d}\right)^n\right). \] This second formula is more accurate when the expected number is large and the range of the generator is relatively small.

We can use the birthday problem as the basis of a test for random number generators. If our generator can produce \(d\) different outputs (e.g., 2³² or 2⁶⁴), how many outputs should we see before there is, say, a 99% chance that we ought to have observed a repeat? Plugging in the numbers, we find that for a generator with 2³² (i.e., 4,294,967,296) possible outputs, after only 198,893 outputs, we'll expect to see on average 4.6 repeats with about a 1% chance of having observed none at all. If we scan 926,820 outputs, the expected number of repeats rises to 100 and the chance of observing none at all is vanishingly small, a hard to imagine 3.7 × 10^-44.

Building a Simple Test

Here's the outline of a test. Given either a desired p-value (e.g., 50%) or a desired expected number of repeats (e.g., 100), and knowledge of the output range of the generator, we

Calculate how many outputs to look at and how many repeats we expect to see,
Collect that many outputs, and,
Count how many repeats we saw.

We can then determine how plausible that number is, given what we were expecting.

The Driver

Here's the driver program in C++, assuming that RNG_TYPE is a PRNG following the usual C++11 conventions.

int main(int argc, const char* argv[]) {
    long double desired = 0.01l;
    if (argc >= 2) {
        desired = strtold(argv[1], nullptr);
        assert(desired > 0.0l);
    }

    double factor = (desired < 1.0)
                      ? sqrt(-2.0l*log(desired))        // p-value
                      : sqrt(2.0l*desired);             // expected repeats

    long double rng_range = 1.0l + RNG_TYPE::max() - RNG_TYPE::min();
    size_t sample_size = ceil(factor * sqrt(rng_range));
    long double expected =
        sample_size - rng_range * (- expm1(sample_size*log1p(-1/rng_range)));
    float p_zero   = exp(-expected);

    std::cout << "Testing " << STRINGIFY(RNG_TYPE) << "\n";
    std::cout << "\t" << RNG_TYPE::min() << ".."
              << RNG_TYPE::max() << " is range of generator\n";
    std::cout << "\t" << sample_size << " outputs to scan for duplicates ["
              << factor << "*sqrt(range)]\n";
    std::cout << "\t" << expected << " duplicates expected on average\n";
    std::cout << "\t" << p_zero << " is probability of zero duplicates\n";
    uint32_t seed = std::random_device{}();
    std::cout << "\t0x" << std::hex << seed << std::dec
              << " is seed for this run\n";

    std::cout << "---\n";

    RNG_TYPE rng(seed);
    size_t repeats = count_repeats(rng, sample_size);

    std::cout << "---\n";

    std::cout << repeats << " repeats, p-value = ";

    long double p_value = 0;
    for (size_t k = 0; k <= repeats; ++k) {
        long double pdf_value =
            exp(log(expected)*k - expected - lgamma(1.0l+k));
        p_value += pdf_value;
        if (p_value > 0.5)
            p_value = p_value - 1;
    }
    if (p_value < 0)
        std::cout << "1 - " << -p_value << "\n";
    else
        std::cout << p_value << "\n";

    return 0;
}

We can imagine getting results like

unix% ./birthday 0.01
Testing std::mt19937
    0..4294967295 is range of generator
    198893 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60511 duplicates expected on average
    0.0100006 is probability of zero duplicates
    0xd533c680 is seed for this run
---
[... counting occurs ...]
---
3 repeats, p-value = 0.324873

unix% ./birthday 0.01
Testing std::mt19937
    0..4294967295 is range of generator
    198893 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60511 duplicates expected on average
    0.0100006 is probability of zero duplicates
    0x6365cd1f is seed for this run
---
[... counting occurs ...]
---
6 repeats, p-value = 1 - 0.182648

Notice that rerunning the algorithm with a different seed allows a different result. In these two cases, the p-values are reasonable (not very close to zero or 1).

We could also run more expansive tests. Below, we've run a test where we would expect 100 repeats.

unix% ./birthday 100
Testing std::mt19937
    0..4294967295 is range of generator
    926820 outputs to scan for duplicates [14.1421*sqrt(range)]
    99.9929 duplicates expected on average
    3.78351e-44 is probability of zero duplicates
    0x474a271e is seed for this run
---
[... counting occurs ...]
---
116 repeats, p-value = 1 - 0.0521379

Again, the p-value above is reasonable.

What I've not shown yet is how to implement count_repeats.

Counting Repeats #1: Store and Sort

The simplest way to count repeats is to store all the output in an array, sort the array, and then scan linearly through it looking for repeats—because items are in order, identical items will be next to each other.

Here's the code:

size_t count_repeats(RNG_TYPE& rng, size_t expanse)
{
    using rngval_t = RNG_TYPE::result_type;

    // Allocate an array to hold all the output.
    rngval_t* values = new rngval_t[expanse];

    // Store all the output from the generator.
    for (size_t i = 0; i < expanse; ++i)
        values[i] = rng();

    // Sort the array
    std::sort(values, values+expanse);

    // Scan the array for duplicates
    rngval_t prev = values[0];
    size_t duplicates = 0;
    for (size_t i = 1; i < expanse; ++i) {
        rngval_t curr = values[i];
        if (prev == curr) {
            std::cout << "    - repeat of " << curr << " found\n";
            ++duplicates;
        }
        prev = curr;
    }

    return duplicates;
}

Results for 32-Bit PRNGs

Here are the results, with a little extra instrumentation added to print what is going on and time how long each part takes.

unix% ./birthday.mt19937.sort 0.01
Testing std::mt19937
    0..4294967295 is range of generator
    198893 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60511 duplicates expected on average
    0.0100006 is probability of zero duplicates
    0xb0856615 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 198893.
  - Allocating memory completed (1.8799e-05 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.00156919 seconds)
  - Sorting started...
  - Sorting completed (0.0153883 seconds)
  - Scanning for duplicates started...
    - repeat of 44686935 found
    - repeat of 1119175552 found
    - repeat of 2307769765 found
    - repeat of 3839874543 found
    - repeat of 4249473610 found
  - Scanning for duplicates completed (0.000235218 seconds)
- Testing for repeats completed (0.0172872 seconds)
---
5 repeats, p-value = 1 - 0.315123

We see that the 32-bit Mersenne Twister passes the test in less than a thirtieth of a second. Or we could try out a longer test:

unix% ./birthday.mt19937.sort 20
Testing std::mt19937
    0..4294967295 is range of generator
    414487 outputs to scan for duplicates [6.32456*sqrt(range)]
    19.9994 duplicates expected on average
    2.06239e-09 is probability of zero duplicates
    0x54c21e82 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 414487.
  - Allocating memory completed (3.0691e-05 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.00466165 seconds)
  - Sorting started...
  - Sorting completed (0.0540854 seconds)
  - Scanning for duplicates started...
    - repeat of 474221760 found
    - repeat of 670096292 found
    - repeat of 780413921 found
    - repeat of 895436278 found
    - repeat of 1042072169 found
    - repeat of 1267229417 found
    - repeat of 1457725784 found
    - repeat of 1509219941 found
    - repeat of 1740429513 found
    - repeat of 2019745416 found
    - repeat of 2373928640 found
    - repeat of 2399448522 found
    - repeat of 2676526217 found
    - repeat of 3709805762 found
    - repeat of 3754906861 found
    - repeat of 3915683171 found
    - repeat of 4010268093 found
    - repeat of 4030683995 found
  - Scanning for duplicates completed (0.000699004 seconds)
- Testing for repeats completed (0.059584 seconds)
---
18 repeats, p-value = 0.381473

Again, in less than a sixteenth of a second, we've confirmed that the 32-bit Mersenne twister produces the right number of repeats. Cool.

Let's test a few more PRNGs and summarize the results:

Scheme	Storing (seconds)	Sorting (seconds)	Scanning (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937`	0.00491	0.04936	0.00046	0.05444	19	0.47025
`gjrand32`	0.00406	0.05736	0.00049	0.06193	19	0.47025
`jsf32`	0.00211	0.05677	0.00050	0.05940	20	0.55909
`pcg32`	0.00258	0.05774	0.00050	0.06089	20	0.55909
`pcg32_fast`	0.00202	0.05635	0.00049	0.05892	19	0.47025
`sfc32`	0.00230	0.05520	0.00049	0.05815	20	0.55909
`xoroshiro64+`	0.00235	0.05625	0.00049	0.05923	21	0.64369
`xoroshiro64*`	0.00256	0.05570	0.00049	0.05885	20	0.55909
`xoroshiro64**`	0.00281	0.05873	0.00049	0.06182	21	0.64369
`xorshift*64/32`	0.00320	0.05897	0.00050	0.06267	19	0.47031
`pcg32_once_insecure`	0.00262	0.05861	0.00049	0.06173	0	2.06e-9
`xorshift32`	0.00318	0.05884	0.00049	0.06260	0	2.06e-9
`std::minstd_rand`	0.00431	0.03838	0.00038	0.04357	0	2.06e-9

These results represent median values for 31 runs (because each run is very quick!). The timings for storing the PRNG output provide a way to compare all the generators except std::minstd_rand (which has a smaller output range and thus requires fewer outputs). We can see that std::mt19937 is rather slower than the others (although std::minstd_rand would be even slower if it were recording the same number of outputs). It may initially seem anomalous that the std::mt19937 implementation sorts faster than other generators with the same 32-bit output range, but that is because the C++ implementation returns a uint_fast32_t (which is actually a 64-bit integer) rather than a uint32_t—it turns out that uint_fast32_t really is faster (but also more wasteful of memory).

But most notable is the fact that std::minstd_rand, (untruncated) Xorshift 32 and pcg32_once_insecure fail this statistical test due to their wildly improbable p-values. This test quickly detects their lack of repeats. One of the reasons pcg32_once_insecure has the name it does is to telegraph to users the fact that it is not intended for general-purpose use but for specialized situations where lack of repeats might be desirable. In contrast, XorShift 32 and std::minstd_rand are usually presented as a general-purpose PRNGs. I would argue that their failure here shows that they do not meet the requirements of a general-purpose PRNG. (In contrast, XorShift* 64/32, which adds a multiplicative output function and returns only the high-order 32-bits, is fine.)

It's worth noting how quickly we can fail these generators. For MINSTD, the 99% test only needs 140,639 × 4 = 562,556 bytes and the more 20-repeats test only needs just over double that, 293,086 outputs. In 1987 when Steve Park and Keith Miller wrote the paper Random Number Generators: Good Ones Are Hard to Find, which introduced this generator, you could buy a Mac II with 4 MB of RAM which could have run either test in a reasonable time (as could any available Unix workstation, even a 1983 vintage Sun Workstation!). In other words, this generator could have been failed at launch in 1987.

It might seem strange to even bother to test MINSTD or XorShift 32 in this way, given that we can tell by simply inspecting their designs that they will never repeat any outputs. But these tests show that how few outputs we need to detect this design choice as a statistical flaw. After only 54,563 outputs from MINSTD (the 50% probability point for repeats), its output is increasingly statistically anomalous.

Results for 64-Bit PRNGs

We can also run the test on 64-bit PRNGs. Here it is running on the 64-bit Mersenne Twister:

unix% ./birthday.mt19937_64.sort 0.01
Testing std::mt19937_64
    0..18446744073709551615 is range of generator
    13034599790 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60517 duplicates expected on average
    0.01 is probability of zero duplicates
    0x37a0a001 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 13034599790.
  - Allocating memory completed (3.6719e-05 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (78.9106 seconds)
  - Sorting started...
  - Sorting completed (1719.31 seconds)
  - Scanning for duplicates started...
    - repeat of 3097887744857163366 found
    - repeat of 5135732003217760496 found
    - repeat of 5958311621514904927 found
  - Scanning for duplicates completed (14.3795 seconds)
- Testing for repeats completed (1812.6 seconds)
---
3 repeats, p-value = 0.324864

With 13,034,599,789 outputs to store and scan, it's obviously a bit slower. It gets all the data stored in less than 80 seconds, but then sorting grinds away for half an hour.

A larger run, looking for 20 duplicates, takes even longer:

unix% ./birthday.mt19937_64.sort 20
Testing std::mt19937_64
    0..18446744073709551615 is range of generator
    27163758263 outputs to scan for duplicates [6.32456*sqrt(range)]
    20 duplicates expected on average
    2.06115e-09 is probability of zero duplicates
    0xe8f137dd is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 27163758263.
  - Allocating memory completed (3.709e-05 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (175.529 seconds)
  - Sorting started...
  - Sorting completed (3924.11 seconds)
  - Scanning for duplicates started...
    - repeat of 20077808886971205 found
    - repeat of 637172222824756178 found
    - repeat of 1163359596874921636 found
    - repeat of 1557608038662986017 found
    - repeat of 2522575772404207458 found
    - repeat of 3525508532012308475 found
    - repeat of 3871057790955701391 found
    - repeat of 4726677753815860247 found
    - repeat of 5979125123285167945 found
    - repeat of 7899230134743204273 found
    - repeat of 7970432731437732158 found
    - repeat of 8488249810808916119 found
    - repeat of 9323300497538043642 found
    - repeat of 9750935714613327132 found
    - repeat of 10503536482358837444 found
    - repeat of 11038662378093814046 found
    - repeat of 11328089922193057506 found
    - repeat of 11824685809627306531 found
    - repeat of 14708280948401454462 found
    - repeat of 15326753147701364060 found
    - repeat of 15747525336113554261 found
    - repeat of 16636213418247152190 found
    - repeat of 17510404778311870795 found
    - repeat of 18092357997477209913 found
  - Scanning for duplicates completed (45.5912 seconds)
- Testing for repeats completed (4145.23 seconds)
---
24 repeats, p-value = 1 - 0.156773

These tests are rather memory hungry, however. The first test requires 97.1 GB, and the second requires 202.4 GB. That's a lot of RAM! I'm running these tests on a beefy server with 512 GB of RAM, but not everyone has such a powerful server at their disposal. It'd be good to have something faster and more memory efficient.

While we're here though, let's test a few others. Here are results for the 99% test (medians of three runs).

Scheme	Storing (seconds)	Sorting (seconds)	Scanning (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937_64`	82.9455	1925.06	15.9029	2024.34	2	0.16209
`gjrand64`	74.9326	1951.05	15.3217	2041.67	5	0.68487
`jsf64`	49.3962	1976.47	15.5237	2041.68	4	0.51226
`mcg128`	55.6478	1954.42	15.5654	2024.35	2	0.16209
`mcg128_fast`	52.3176	1952.29	15.4270	2020.43	5	0.68487
`pcg64`	67.0258	1943.01	15.7909	2026.46	4	0.51226
`pcg64_fast`	60.8966	1953.03	15.7946	2030.51	2	0.16209
`sfc64`	48.4997	1971.91	16.3415	2036.75	4	0.51226
`xoroshiro128+`	47.7538	1955.20	15.8975	2018.56	5	0.68487
`xoroshiro128**`	54.4471	1946.13	16.1368	2015.78	7	0.90450
`xorshift*128/64`	93.3112	2009.93	17.9337	2119.94	7	0.90450
`pcg64_once_insecure`	55.9306	1941.75	15.9162	2014.60	0	0.01
`xorshift64`	65.981	2031.29	17.3847	2113.96	0	0.01
`splitmix64`	51.3770	1947.12	15.2408	2014.70	0	0.01

And here is the 20-expected-repeats test (median of three runs).

Scheme	Storing (seconds)	Sorting (seconds)	Scanning (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937_64`	183.917	4198.77	40.0288	4441.60	21	0.64370
`gjrand64`	169.248	4270.51	37.0672	4481.24	23	0.78749
`jsf64`	105.242	4278.48	39.0786	4422.70	22	0.72061
`mcg128`	114.237	4290.56	38.0371	4443.02	23	0.78749
`mcg128_fast`	108.084	4243.50	39.1832	4402.13	24	0.84323
`pcg64`	140.412	4236.27	40.0860	4418.58	22	0.72061
`pcg64_fast`	128.951	4239.96	37.6571	4406.48	24	0.84323
`sfc64`	102.542	4296.64	41.2401	4440.63	26	0.92211
`xoroshiro128+`	100.587	4296.52	41.0189	4435.56	18	0.38142
`xoroshiro128**`	114.427	4257.38	43.1764	4414.22	20	0.55909
`xorshift*128/64`	204.078	4245.56	38.7307	4491.49	21	0.64370
`pcg64_once_insecure`	115.781	4223.30	37.2429	4376.37	0	2.06e-9
`xorshift64`	136.411	4257.19	37.9699	4442.13	0	2.06e-9
`splitmix64`	107.673	4266.90	40.2020	4414.78	0	2.06e-9

Again, we can use the time for storing the numbers as a microbenchmark. SplitMix is really fast, but so are mcg128_fast and Xoroshiro128+. The Mersenne Twister is notably slower than the others. Blackman's gjrand64 seems to lag behind sfc64 and jsf64 even though all three are quite similar designs. In this group of generally fast PRNGs, I'm content for the 64-bit-output PCG variants to fall somewhere in the middle, both because 128-bit-state PCG variants can never beat the mcg128 linear congruential generators (given that PCG applies a permuting output function to an underlying linear congruential generator); and because benchmark results can be so variable, as we'll see throughout this article.

But more importantly, like MINSTD, SplitMix never repeats itself, producing a low p-value and thus failing the test. As expected, pcg64_once_insecure fails too. But if you need a long computation on a beefy machine to determine that, maybe it's purely academic? A more efficient test would be helpful.

Counting Repeats #2: Store Values in a Hash Table

One way to avoid sorting is to store all the data in a hash table. If the number is already in the table when we go to insert it, we have a repeat.

Here's code for hash-table–based counting:

size_t count_repeats(RNG_TYPE& rng, size_t expanse)
{
    using rngval_t = RNG_TYPE::result_type;

    std::unordered_set<rngval_t> sample;

    // Store all the output from the generator.

    size_t duplicates = 0;
    for (size_t i = 0; i < expanse; ++i) {
        auto rval = rng();
        auto result = sample.insert(rval);
        if (!result.second) {
            std::cout << "    - repeat of " << rval << " found, step " << i
                      << "\n";
            ++duplicates;
        }
    }

    return duplicates;
}

Results for 32-Bit PRNGs

If we test the Mersenne Twister, we see results like

unix% ./birthday.mt19937.hashtable 0.01
Testing std::mt19937
    0..4294967295 is range of generator
    198893 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60511 duplicates expected on average
    0.0100006 is probability of zero duplicates
    0x882c2a09 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated hash table of reserved for 198893 elements
  - Allocating memory completed (0.00148097 seconds)
  - Checking PRNG output started...
    - repeat of 2326356096 found, step 102565
    - repeat of 764086403 found, step 104931
    - repeat of 4140509098 found, step 116132
    - repeat of 2835992082 found, step 139268
    - repeat of 2448493342 found, step 171267
    - repeat of 1230289205 found, step 198622
  - Checking PRNG output completed (0.0391443 seconds)
- Testing for repeats completed (0.0406811 seconds)
---
6 repeats, p-value = 1 - 0.182648

The nice thing about this approach is that it tells us exactly when each repeat occurs. The less-good thing is that it's a little slower and uses 4.7 MB of RAM for data and up to 3 MB in malloc bookkeeping overhead, more than ten times more than our previous approach. And that's with GCC's C++ standard library; Clang's library requires 6 MB.

Nevertheless, let's check out how some of our 32-bit PRNGs perform:

Scheme	Checking (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937`	0.11477	0.11647	19	0.47025
`gjrand32`	0.09952	0.10150	20	0.55909
`jsf32`	0.10681	0.10842	18	0.38141
`pcg32`	0.11272	0.11486	22	0.72060
`pcg32_fast`	0.09995	0.10137	21	0.64369
`sfc32`	0.09577	0.09740	19	0.47025
`xoroshiro64+`	0.09410	0.09662	19	0.47025
`xoroshiro64*`	0.09206	0.09376	20	0.55909
`xoroshiro64**`	0.09275	0.09543	19	0.47025
`xorshift*64/32`	0.09934	0.10130	19	0.47031
`pcg32_once_insecure`	0.10123	0.10453	0	2.06e-9
`xorshift32`	0.10358	0.10564	0	2.06e-9
`std::minstd_rand`	0.07011	0.07182	0	2.06e-9

It's mostly outside the scope of this article, but C++ allows us to specify our own memory allocator for its containers, and if we use a memory pool with a bump-pointer allocation, we can do much better in terms of both speed and memory use (the same amount of data is stored, but there is almost zero bookkeeping).

unix% ./birthday.mt19937.hashtable.bump 0.01
Testing std::mt19937
    0..4294967295 is range of generator
    198893 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60511 duplicates expected on average
    0.0100006 is probability of zero duplicates
    0x9bf209d is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated hash table of reserved for 198893 elements
  - Allocating memory completed (0.000961311 seconds)
  - Checking PRNG output started...
    - repeat of 692467637 found, step 107618
    - repeat of 414775239 found, step 147312
    - repeat of 4252812390 found, step 188509
  - Checking PRNG output completed (0.014626 seconds)
- Testing for repeats completed (0.0156315 seconds)
---
3 repeats, p-value = 0.324873

Now our test is faster than the original version using sorting. We can confirm this speedup by running all our 32-bit PRNGs again:

Scheme	Checking (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937`	0.04386	0.04629	21	0.64369
`gjrand32`	0.03943	0.04137	19	0.47025
`jsf32`	0.03980	0.04188	18	0.38141
`pcg32`	0.03844	0.03987	20	0.55909
`pcg32_fast`	0.03882	0.04068	21	0.64369
`sfc32`	0.04007	0.04157	20	0.55909
`xoroshiro64+`	0.03748	0.03931	20	0.55909
`xoroshiro64*`	0.03780	0.04004	21	0.64369
`xoroshiro64**`	0.04071	0.04247	21	0.64369
`xorshift*64/32`	0.03796	0.03967	18	0.38147
`pcg32_once_insecure`	0.03394	0.03640	0	2.06e-9
`xorshift32`	0.03711	0.03961	0	2.06e-9
`std::minstd_rand`	0.02988	0.03133	0	2.06e-9

Results for 64-Bit PRNGs

It's a little too easy to run out of memory with the hash-table version. With the default choice of looking for a 99% chance of a repeat, GCC's libstdc++ implementation of std::unordered_set allocates a 128 GB base array to hold 17,179,869,143 chains (a prime just below 2³⁴) for a load factor of 75.9%, and GB 194.2 to hold 13,034,599,790 16-byte chain nodes. That's 323 GB total, before overhead. Unfortunately, malloc overheads mean that those 16-byte chain nodes actually consume 32 bytes, doubling memory usage for chaining to 388.5 GB and total memory usage to 516.4 GB, which is just a few GB too much.

In contrast, Clang's implementation just makes it through okay, all due to a slightly different choice for load factor of their hash tables and despite grabbing 24 bytes for chain nodes (they end up also rounding up to 32 bytes).

The bump-allocator version avoids the time and space overhead of individual malloc calls. Here are results for this version (running the 99% test):

Scheme	Checking	Total	Repeats	p-value
`std::mt19937_64`	4486.00	4529.97	5	0.68487
`gjrand64`	4522.78	4568.31	4	0.51226
`jsf64`	4597.81	4640.45	3	0.32486
`mcg128`	4471.28	4513.88	4	0.51226
`mcg128_fast`	4460.94	4505.08	5	0.68487
`pcg64`	4513.76	4556.75	2	0.16209
`pcg64_fast`	4496.36	4539.63	8	0.95467
`sfc64`	4430.88	4473.66	4	0.51226
`xoroshiro128+`	4530.84	4574.55	4	0.51226
`xoroshiro128**`	4409.88	4453.78	4	0.51226
`xorshift*128/64`	4460.04	4503.96	3	0.32486
`pcg64_once_insecure`	4605.06	4649.2	0	0.01
`xorshift64`	4435.31	4481.12	0	0.01
`splitmix64`	4293.21	4335.7	0	0.01

Because this test is slow and memory hungry, these results are from just a single run. The 20-repeats test is too large to run using a standard C++ hash table. The provided code lets you swap in other, more efficient hash table implementations instead of std::unordered_set but we'll move on to other approaches.

Counting Repeats #3: Store in Buckets, Sort the Buckets

The next approach we'll look at for dealing with the high overhead of sorting is to dump ranges of numbers into separate buckets, and then sort each bucket to look for duplicates.

We'll skip looking at the code (but you can find it in the provided downloadable code).

Results for 32-Bit PRNGs

Here's a run for the 32-bit Mersenne Twister. You can see that we're spreading the output across 256 buckets:

unix% ./birthday.mt19937.bucketsort 0.01
Testing std::mt19937
        0..4294967295 is range of generator
        198893 outputs to scan for duplicates [3.03485*sqrt(range)]
        4.60521 duplicates expected on average
        0.00999963 is probability of zero duplicates
        0xdb99f8af is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated 256 buckets of capacity 971
    - allocated 256 bucket counters
  - Allocating memory completed (1.3238e-05 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.0030664 seconds)
  - Sorting started...
  - Sorting completed (0.0173691 seconds)
  - Scanning for duplicates started...
    - repeat of 731276309 found
    - repeat of 4007810445 found
    - repeat of 1926145973 found
    - repeat of 2079408604 found
    - repeat of 1373598710 found
  - Scanning for duplicates completed (0.000318993 seconds)
- Testing for repeats completed (0.0207814 seconds)
---
5 repeats, p-value = 1 - 0.315139

And here are results for our cohort of 32-bit PRNGs:

Scheme	Storing (seconds)	Sorting (seconds)	Scanning (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937`	0.00688	0.03629	0.00067	0.04393	20	0.55909
`gjrand32`	0.00377	0.03620	0.00058	0.04088	19	0.47025
`jsf32`	0.00345	0.03724	0.00059	0.04129	21	0.64369
`pcg32`	0.00358	0.03700	0.00058	0.04115	20	0.55909
`pcg32_fast`	0.00311	0.03702	0.00057	0.04095	21	0.64369
`sfc32`	0.00328	0.03706	0.00057	0.04111	20	0.55909
`xoroshiro64+`	0.00341	0.03701	0.00058	0.04099	21	0.64369
`xoroshiro64*`	0.00337	0.03618	0.00057	0.04037	19	0.47025
`xoroshiro64**`	0.00349	0.03514	0.00057	0.03916	20	0.55909
`xorshift*64/32`	0.00362	0.03690	0.00059	0.04133	19	0.47031
`pcg32_once_insecure`	0.00359	0.03711	0.00057	0.04134	0	2.06e-9
`xorshift32`	0.00352	0.03669	0.00057	0.04091	0	2.06e-9
`std::minstd_rand`	0.00493	0.02533	0.00043	0.03093	0	2.06e-9

Storing the output is now about 50% slower, but sorting gets done in about two thirds of the time. Overall, it's a small net win. One interesting thing to note here is that sorting for the 32-bit Mersenne Twister lost its speed advantage from using uint_fast32_t.

Results for 64-Bit PRNGs

unix% ./birthday.mt19937_64.bucketsort 0.01
Testing std::mt19937_64
    0..18446744073709551615 is range of generator
    13034599790 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60517 duplicates expected on average
    0.01 is probability of zero duplicates
    0xd68d5f90 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated 65536 buckets of capacity 248615
    - allocated 65536 bucket counters
  - Allocating memory completed (0.000549318 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (698.925 seconds)
  - Sorting started...
  - Sorting completed (966.247 seconds)
  - Scanning for duplicates started...
    - repeat of 6508886296057597688 found
    - repeat of 5924451221820332975 found
  - Scanning for duplicates completed (16.1705 seconds)
- Testing for repeats completed (1681.34 seconds)
---
2 repeats, p-value = 0.16209

What we can see here is that this version drastically reduces sorting time by 43%, from 1719.31 seconds to 966.247 seconds, saving 12.5 minutes. But storing the output is 8.9× slower, rising from 78.9 seconds for one array to 698.9 seconds for buckets; that's 10.3 minutes longer. Overall, it's a win, but only just.

Here are the results for our cohort of 64-bit generators running the 20-repeats test (only one run):

Scheme	Storing (seconds)	Sorting (seconds)	Scanning (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937_64`	1379.07	2340.24	40.5855	3759.90	17	0.29703
`gjrand64`	1963.33	2314.87	42.1575	4320.36	22	0.72061
`jsf64`	1499.80	2315.43	40.1675	3855.40	25	0.88782
`mcg128`	1378.42	2364.94	47.2118	3790.57	22	0.72061
`mcg128_fast`	1400.20	2294.61	36.6053	3731.41	18	0.38142
`pcg64`	1473.77	2291.80	39.3417	3804.91	22	0.72061
`pcg64_fast`	1460.90	2305.29	38.3696	3804.56	17	0.29703
`xoroshiro128+`	1561.03	2311.89	39.0475	3911.97	24	0.84323
`xoroshiro128**`	1607.38	2325.86	39.7094	3972.95	22	0.72061
`xorshift*128/64`	1502.62	2324.26	42.4239	3869.31	24	0.84323
`pcg64_once_insecure`	1991.08	2309.85	41.9223	4342.86	0	2.06e-9
`xorshift64`	1489.58	2331.94	42.0448	3863.56	0	2.06e-9
`splitmix64`	1346.53	2150.69	42.9768	3540.20	0	2.06e-9

This version is also more memory hungry than the original version—to allow buckets to be unevenly filled, it allocates 25% more memory. Overall, not a huge improvement.

Counting Repeats #4: Bloom Filter

Thus far, we've tried a few obvious approaches and they're slow and memory hungry. To do better, we'll use a more sophisticated approach.

In this version we deploy an advanced data structure, specifically, a variant of a Bloom filter. A Bloom filter is an incredibly space-efficient data structure for storing a set of values, but it is probabilistic—it will never forget anything added to the set, but it may erroneously say things are included in the set when they are not.

The steps for this algorithm are

Make a copy of the PRNG state.
Store the values in a Bloom filter. When we appear to see a duplicate, log this candidate repeat for later.
Throw away the Bloom filter.
Transfer the repeat candidates to a hash table, with a count of zero associated with each one.
Using the copy of the PRNG, replay the output sequence, ignoring all values except for the candidates stored in the hash table. If it's in the hash table, update the count whenever we see the value.
Candidates whose count reaches goes higher than 1 are true repeats.

Results for 32-Bit PRNGs

Here's the results running on the 32-bit Mersenne Twister:

unix% ./birthday.mt19937.bloomfilter 0.01
Testing std::mt19937
    0..4294967295 is range of generator
    198893 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60511 duplicates expected on average
    0.0100006 is probability of zero duplicates
    0x256128e4 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 16384 64-bit elements (3 bits/entry)
    - allocated possible-match log with 8192 64-bit elements
  - Allocating memory completed (6.7429e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 5611 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (0.0061467 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (0.000156567 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (9.2692e-05 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 183773048 found, step 92076
    - repeat of 2427098983 found, step 95253
    - repeat of 971438297 found, step 193226
  - Phase 3, Testing candidates completed (0.00687177 seconds)
- Testing for repeats completed (0.0134276 seconds)
---
3 repeats, p-value = 0.324873

The Bloom filter and match log require only (16384 + 8192) × 4 bytes of memory, or 96 KB, about an eighth of the 777 KB our original store-and-sort version would have need.

Here's summary results for our cohort of 32-bit PRNGs running the expect-20-repeats test. I actually accidentally ran the tests twice, so here are both sets of results (in each case, these are the median values from 31 runs). Here's the first result set:

Scheme	Candidates (seconds)	Rechecking (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937`	0.01217	0.01661	0.03024	19	0.47025
`gjrand32`	0.01017	0.01099	0.02113	18	0.38141
`jsf32`	0.01012	0.01217	0.02277	19	0.47025
`pcg32`	0.01044	0.01328	0.02421	19	0.47025
`pcg32_fast`	0.01085	0.01223	0.02357	17	0.29702
`sfc32`	0.00983	0.01170	0.02190	19	0.47025
`xoroshiro64+`	0.01060	0.01109	0.02139	19	0.47025
`xoroshiro64*`	0.01014	0.01187	0.02226	20	0.55909
`xoroshiro64**`	0.01024	0.01282	0.02379	20	0.55909
`xorshift*64/32`	0.00983	0.01307	0.02382	20	0.55915
`pcg32_once_insecure`	0.01008	0.01304	0.02360	0	2.06e-9
`xorshift32`	0.00956	0.01163	0.02157	0	2.06e-9
`std::minstd_rand`	0.00859	0.00644	0.01518	0	2.06e-9

And here is the second:

Scheme	Candidates (seconds)	Rechecking (seconds)	Total (seconds)	Repeats (count)	p-value
`std::mt19937`	0.01271	0.01763	0.03055	19	0.47025
`gjrand32`	0.01098	0.01279	0.02431	20	0.55909
`jsf32`	0.00945	0.01111	0.02131	20	0.55909
`pcg32`	0.01035	0.01192	0.02236	19	0.47025
`pcg32_fast`	0.01020	0.01166	0.02252	19	0.47025
`sfc32`	0.00991	0.0121	0.02281	19	0.47025
`xoroshiro64+`	0.01023	0.01156	0.02205	20	0.55909
`xoroshiro64*`	0.00998	0.01228	0.02235	21	0.64369
`xoroshiro64**`	0.01080	0.01316	0.02465	22	0.72060
`xorshift*64/32`	0.00904	0.00870	0.01964	19	0.47031
`pcg32_once_insecure`	0.01009	0.01339	0.02452	0	2.06e-9
`xorshift32`	0.00880	0.00870	0.01861	0	2.06e-9
`std::minstd_rand`	0.00914	0.00678	0.01658	0	2.06e-9

The times are so short that there is a fair amount of variation between runs. We can determine that the Mersenne Twister is slower than the others but not a lot else.

Results for 64-Bit PRNGs

unix% ./birthday.mt19937_64.bloomfilter 0.01
Testing std::mt19937_64
    0..18446744073709551615 is range of generator
    13034599790 outputs to scan for duplicates [3.03485*sqrt(range)]
    4.60517 duplicates expected on average
    0.01 is probability of zero duplicates
    0xe98207de is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 1073741824 64-bit elements (3 bits/entry)
    - allocated possible-match log with 536870912 64-bit elements
  - Allocating memory completed (1.9574e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 362448004 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (1003.56 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (3.00296 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (9.87777 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 4670426177229221241 found, step 5580903394
    - repeat of 3043852644687666645 found, step 6413308578
    - repeat of 16961218199013113029 found, step 8424705865
    - repeat of 14428636557340975895 found, step 8809377507
    - repeat of 13428407026126473932 found, step 9594741563
    - repeat of 3452994221327353931 found, step 9718494086
    - repeat of 5809172504185009050 found, step 10454910042
    - repeat of 6895377101806331713 found, step 11314113109
    - repeat of 10487859573817159209 found, step 11655528088
  - Phase 3, Testing candidates completed (624.485 seconds)
- Testing for repeats completed (1640.93 seconds)
---
9 repeats, p-value = 1 - 0.0196595

Rather than using 97.1 GB for testing a 64-bit PRNG, this version uses 8 GB for the Bloom filter and 4 GB to log repeats. If we're happy to settle for a 72.5% chance of observing a repeat (and run the test a few times if we see none), we can perform the test in a total of 6 GB of RAM, well within the reach of most modern computers.

unix% ./birthday.mt19937_64.bloomfilter 0.275
Testing std::mt19937_64
    0..18446744073709551615 is range of generator
    6901370125 outputs to scan for duplicates [1.60685*sqrt(range)]
    1.29098 duplicates expected on average
    0.275 is probability of zero duplicates
    0x54ef7472 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 536870912 64-bit elements (3 bits/entry)
    - allocated possible-match log with 268435456 64-bit elements
  - Allocating memory completed (8.9979e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 217650333 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (475.175 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (1.43748 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (5.42572 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 6751060484066504991 found, step 6557968840
  - Phase 3, Testing candidates completed (353.038 seconds)
- Testing for repeats completed (835.077 seconds)
---
1 repeats, p-value = 1 - 0.369979

This test gets done in less than 14 minutes on the main test machine I've been using for this article, which is a 2.1 GHz Haswell-EP-based Xeon E7-4830v3. If we run it instead on a smaller machine with only 32 GB of RAM but a more recent CPU, such as a 3.2 GHz Skylake-W-based Xeon W-2104, the same test gets done in less than ten minutes. On my 2015 iMac which has a 4.0 GHz Skylake-DT-based i7-6700K CPU (and 40 GB of RAM), it finishes in 11.5 minutes.

In fact, we can now see how long it takes us to fail SplitMix at different p-values on the Skylake-W machine:

Time (seconds)	Memory (GB)	Expected Repeats	p-value
503.59	6	1.29	0.275
906.07	12	4.6	0.01
1276.56	12	7.5	0.000553
1517.43	24	10	4.54e-05
2218.33	24	20	2.06e-09

It's fair to say that with less than half an hour on a modern machine we can fail SplitMix's 64-bit output for its lack of repeats.

Returning to our usual test machine, let's look at results for our group of 64-bit PRNGs looking for 20 repeats (these results are fast enough that I was able give the median of five runs):

Scheme	Candidates	Rechecking	Total	Repeats	p-value
`std::mt19937_64`	2275.92	1553.84	3865.49	22	0.72061
`gjrand64`	1862.55	1267.72	3171.11	19	0.47026
`jsf64`	1805.18	1172.75	3010.42	25	0.88782
`mcg128`	1583.11	1132.47	2771.36	20	0.55909
`mcg128_fast`	1598.20	1123.13	2715.28	21	0.64370
`pcg64`	2077.65	1347.71	3455.36	20	0.55909
`pcg64_fast`	2151.92	1307.01	3505.75	19	0.47026
`sfc64`	1659.98	1115.88	2817.56	17	0.29703
`xoroshiro128+`	1607.51	1161.14	2797.97	19	0.47026
`xoroshiro128**`	2281.04	1381.30	3698.20	20	0.55909
`xorshift*128/64`	2717.67	1456.00	4208.69	19	0.47026
`pcg64_once_insecure`	2130.41	1200.13	3388.48	0	2.06e-9
`xorshift64`	2227.01	1170.83	3423.48	0	2.06e-9
`splitmix64`	2271.97	1337.80	3652.14	0	2.06e-9

Although all timing results should be taken with a pinch of salt, it's a little disappointing that in this instance Vigna's Xoroshiro128** PRNG is actually slower than the Mersenne Twister. In this case, mcg128 also beats mcg128_fast and pcg64 beats pcg64_fast. For me, the main takeaway here is how difficult it is to meaningfully benchmark different PRNGs.

Applying PRNG adapters

It might seem like we're done here, but it's worth looking at the effects of a few of PRNG adapters.

A `divided_rng` Adapter

Suppose, for example, that we desire positive signed 64-bit integers; in other words, 63-bit integers. If we take SplitMix's output and divide it by two, it goes from having no repeats to having each number repeated exactly once. Should it now pass our test?

In C++11's framework for PRNGs, it's very easy to make a PRNG adapter that can divide any PRNG by some number:

template <typename RNG, unsigned int DIVISOR,
          typename rtype = typename RNG::result_type>
struct divided_rng : public RNG {
    using RNG::RNG;
    using result_type = rtype;

    static result_type min() { return RNG::min() / DIVISOR; }
    static result_type max() { return RNG::max() / DIVISOR; }

    result_type operator()() { return RNG::operator()() / DIVISOR; }
};

(Note that although the code uses division, because DIVISOR is known at compile time, an optimizer will simplify the code to use fast bit shifts when DIVISOR is a power of two.)

Testing `std::minstd_rand`

Let's look first at std::minstd_rand and divide by 3, so that now rather than no repeats, each number appears three times. Initially, things look good:

unix% ./birthday.minstddiv3.sort 0.01
Testing divided_rng<regularized_rng<std::minstd_rand,uint32_t>,3>
        0..715827881 is range of generator
        81198 outputs to scan for duplicates [3.03485*sqrt(range)]
        4.60501 duplicates expected on average
        0.0100016 is probability of zero duplicates
        0xe9be5e15 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 81198.
  - Allocating memory completed (7e-06 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.000403 seconds)
  - Sorting started...
  - Sorting completed (0.003575 seconds)
  - Scanning for duplicates started...
    - repeat of 29568970 found
    - repeat of 501715642 found
    - repeat of 533810085 found
  - Scanning for duplicates completed (4.1e-05 seconds)
- Testing for repeats completed (0.00403 seconds)
---
3 repeats, p-value = 0.324891

This result seems like a good pass.

If, however, we run a larger test we find the number of repeats found consistently falls below the expected value:

macos% ./birthday.minstddiv3.sort 200
Testing divided_rng<regularized_rng<std::minstd_rand,uint32_t>,3>
        0..715827881 is range of generator
        535100 outputs to scan for duplicates [20*sqrt(range)]
        199.95 duplicates expected on average
        0 is probability of zero duplicates
        0x51a1ff3d is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 535100.
  - Allocating memory completed (7e-06 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.00298 seconds)
  - Sorting started...
  - Sorting completed (0.028566 seconds)
  - Scanning for duplicates started...
    - repeat of 1990789 found
    - repeat of 13528078 found
    [... 126 repeats omitted ...]
    - repeat of 701171518 found
    - repeat of 704058012 found
  - Scanning for duplicates completed (0.000325 seconds)
- Testing for repeats completed (0.03189 seconds)
---
130 repeats, p-value = 8.26764e-08

In fact, we can still make this generator fail if we divide the output by 31 to get 30 repeats. We just have to generate more output to see the fail:

macos% ./birthday.minstddiv31.sort 20000
Testing divided_rng<regularized_rng<std::minstd_rand,uint32_t>,31>
        0..69273665 is range of generator
        1664617 outputs to scan for duplicates [200*sqrt(range)]
        19840.8 duplicates expected on average
        0 is probability of zero duplicates
        0x3587008a is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 1664617.
  - Allocating memory completed (7e-06 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.00837 seconds)
  - Sorting started...
  - Sorting completed (0.095529 seconds)
  - Scanning for duplicates started...
    - repeat of 9918 found
    - repeat of 15644 found
    [... 19226 repeats omitted ...]
    - repeat of 69269020 found
    - repeat of 69272254 found
  - Scanning for duplicates completed (0.007172 seconds)
- Testing for repeats completed (0.111095 seconds)
---
19230 repeats, p-value = 6.68804e-06

We can double-check the test by doing a similar test dividing the 32-bit Mersenne Twister by 32:

macos% ./birthday.mt19937div32.sort 20000
Testing divided_rng<std::mt19937,32>
        0..134217727 is range of generator
        2317048 outputs to scan for duplicates [200*sqrt(range)]
        19885.4 duplicates expected on average
        0 is probability of zero duplicates
        0x80df3adf is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 2317048.
  - Allocating memory completed (7e-06 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.013726 seconds)
  - Sorting started...
  - Sorting completed (0.13695 seconds)
  - Scanning for duplicates started...
    - repeat of 6369 found
    - repeat of 10240 found
    [... 19712 repeats omitted ...]
    - repeat of 134200182 found
    - repeat of 134205138 found
  - Scanning for duplicates completed (0.009592 seconds)
- Testing for repeats completed (0.160295 seconds)
---
19716 repeats, p-value = 0.1154

Here we get a much more plausible number of repeats.

Testing `splitmix64`

Again, with SplitMix and 63-bit output, we'll pass a small test:

unix% ./birthday.splitmix64div2.bloomfilter 0.01
Testing divided_rng<SplitMix64,2>
        0..9223372036854775807 is range of generator
        9216853902 outputs to scan for duplicates [3.03485*sqrt(range)]
        4.60517 duplicates expected on average
        0.01 is probability of zero duplicates
        0x24ceea9b is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 1073741824 64-bit elements (4 bits/entry)
    - allocated possible-match log with 536870912 64-bit elements
  - Allocating memory completed (3.4661e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 95636376 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (702.437 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (3.00048 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (2.1461 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 1021193238591905732 found, step 7067299467
  - Phase 3, Testing candidates completed (203.722 seconds)
- Testing for repeats completed (911.306 seconds)
---
1 repeats, p-value = 0.0560517

But if we run a larger test (e.g., looking for about 100 repeats), about an hour and quarter (and 48 GB) later, it fails:

unix% ./birthday.splitmix64div2.bloomfilter 100
Testing divided_rng<SplitMix64,2>
        0..9223372036854775807 is range of generator
        42949672961 outputs to scan for duplicates [14.1421*sqrt(range)]
        100 duplicates expected on average
        3.78351e-44 is probability of zero duplicates
        0x6bc87e32 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 4294967296 64-bit elements (3 bits/entry)
    - allocated possible-match log with 2147483648 64-bit elements
  - Allocating memory completed (2.7746e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 773607258 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (3268.16 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (11.7403 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (21.3582 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 6497401084100231171 found, step 4349343329
    - repeat of 8982250182547917419 found, step 10013821426
    [... 38 repeats omitted ...]
    - repeat of 3352171370750978657 found, step 40445072164
    - repeat of 3504357146487545 found, step 40476413442
  - Phase 3, Testing candidates completed (1285.31 seconds)
- Testing for repeats completed (4586.57 seconds)
---
42 repeats, p-value = 4.51174e-11

These two tests of MINSTD and SplitMix show that our birthday-problem test not only detects a complete lack of repeats, it also determines whether the number of repeats is statistically correct, which it is not for this variant of SplitMix. (When used as a 32-bit PRNG, SplitMix will easily have enough repeats to pass this test.)

A `skipping_rng` Adapter

Thus far via the Bloom-filter based test, we've been able to reduce the testing overheads so that we can test 64-bit PRNGs on most modern machines, but the test is still quite resource hungry. Our final adapter will allow us to find problems with lack of repeats in as little memory as we like.

This adapter is like the divided_rng adapter, but only returns a value when the remainder after division is equal to some particular value (e.g., zero). In this case, every repeat from the adapted generator represents a real repeat from the original generator, but most outputs from the original generator are skipped (including lots of repeats).

Here's the code

template <typename RNG, typename RNG::result_type DIVISOR,
          typename RNG::result_type REMAINDER_MATCH = 0,
          typename rtype = typename RNG::result_type>
struct skipping_rng : public RNG {
    using RNG::RNG;
    using result_type = rtype;

    static_assert(RNG::min() == 0, "Requires sane PRNG");  

    static result_type min() { return RNG::min() / DIVISOR; }
    static result_type max() { return RNG::max() / DIVISOR; }

    result_type operator()() {
        for(;;) {
            result_type rval = RNG::operator()();
            if ((rval % result_type(DIVISOR)) == result_type(REMAINDER_MATCH)) {
                return rval / result_type(DIVISOR);
            }
        }
    }
};

Testing `std::minstd_rand`

Here's a test of MINSTD.

macos% ./birthday.minstdskip49981.sort 10
Testing skipping_rng<regularized_rng<std::minstd_rand,uint32_t>,49981>
    0..42965 is range of generator
    927 outputs to scan for duplicates [4.47214*sqrt(range)]
    9.91802 duplicates expected on average
    4.92789e-05 is probability of zero duplicates
    0xb621d0 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 927.
  - Allocating memory completed (1e-06 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (0.163223 seconds)
  - Sorting started...
  - Sorting completed (3.1e-05 seconds)
  - Scanning for duplicates started...
  - Scanning for duplicates completed (0 seconds)
- Testing for repeats completed (0.163266 seconds)
---
0 repeats, p-value = 4.92789e-05

Now we fail the test in less than a second using less than 4 KB of RAM. In other words, had it been around then, I could have failed this generator on my 1982 16K ZX Spectrum home computer.

Testing `splitmix64`

It's the same story for SplitMix. In this test, we generate 4.5 TB of output, but only keep 1/65536 of it, reducing our storage requirements down to 0.5 GB to get a test small enough to run on an old iPhone 5s:

macos% ./birthday.spitmix64skip65536.sort 10
Testing skipping_rng<SplitMix64,65536>
    0..281474976710655 is range of generator
    75029991 outputs to scan for duplicates [4.47214*sqrt(range)]
    10 duplicates expected on average
    4.54e-05 is probability of zero duplicates
    0x3ad0e4f1 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated array of size 75029991.
  - Allocating memory completed (7e-06 seconds)
  - Storing PRNG output started...
  - Storing PRNG output completed (4680.73 seconds)
  - Sorting started...
  - Sorting completed (5.72453 seconds)
  - Scanning for duplicates started...
  - Scanning for duplicates completed (0.043199 seconds)
- Testing for repeats completed (4686.5 seconds)
---
0 repeats, p-value = 4.54e-05

This test took just under an hour and a twenty minutes to run, but ran in 2.3% of the memory we needed for our memory-efficient Bloom-filter version, and 0.39% of the memory we would use with our original store-and-sort version.

A `doubled_rng` Adapter

Our final adapter gathers together pairs of outputs, allowing us to test for repeats in two dimensional space rather than a single dimension. It does so by combining two output values of range \(r\) from the underling generator into a single output of range \(r^2\) (e.g., turning a 32-bit PRNG into a 64-bit PRNG).

template <typename RNG, typename rtype>
struct doubled_rng : public RNG {
public:
    using RNG::RNG;
    using result_type = rtype;

    static_assert(RNG::min() == 0, "Requires sane PRNG");

    static constexpr result_type min() { return combine2(RNG::min(), RNG::min()); }
    static constexpr result_type max() { return combine2(RNG::max(), RNG::max()); }

    result_type operator()()
    {
        auto first  = RNG::operator()();
        auto second = RNG::operator()();
        return combine2(first,second);
    }

protected:
    static constexpr result_type combine2(typename RNG::result_type first,
                         typename RNG::result_type second)
    {
        return (result_type(second) * (result_type(RNG::max())+1))
               + result_type(first);
    }
};

Testing Xoroshiro64+

We'll test this adapter by combining pairs of outputs from Xoroshiro64+ (a 32-bit PRNG) to make 64-bit outputs:

unix% ./birthday.xoroshiro64plus32doubled.bloomfilter 10
Testing doubled_rng<xoroshiro64plus32,uint64_t>
    0..18446744073709551615 is range of generator
    19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
    10 duplicates expected on average
    4.53999e-05 is probability of zero duplicates
    0x9d170147 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
    - allocated possible-match log with 1073741824 64-bit elements
  - Allocating memory completed (1.7381e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 222925104 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (1175.11 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (2.94328 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (3.82449 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 1341226328110553406 found, step 12231469160
    - repeat of 16219277760915004482 found, step 14854862079
    - repeat of 166193113774345064 found, step 15970584051
    - repeat of 15772780968502958711 found, step 17360448815
    - repeat of 9015983296019028048 found, step 18470600211
  - Phase 3, Testing candidates completed (387.382 seconds)
- Testing for repeats completed (1569.26 seconds)
---
5 repeats, p-value = 0.067086

As you can see, this generator passes the test. We could argue, perhaps, that because this generator only has 64 bits of state, producing 64-bit output that passes this test must also be biased—if some 64-bit outputs are showing up twice, some must not be showing up at all.

Testing Xoroshiro64*

Vigna's Xoroshiro64* is similar to Xoroshiro64+, but has a different output function. Vigna claims it is an improvement because this variant is now maximally equidistributed. Let's see how it fares:

unix% ./birthday.xoroshiro64star32doubled.bloomfilter 10
Testing doubled_rng<xoroshiro64star32,uint64_t>
    0..18446744073709551615 is range of generator
    19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
    10 duplicates expected on average
    4.53999e-05 is probability of zero duplicates
    0x3cff5c16 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
    - allocated possible-match log with 1073741824 64-bit elements
  - Allocating memory completed (1.8427e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 222957805 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (1188.37 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (4.81861 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (4.20488 seconds)
  - Phase 3, Testing candidates started...
  - Phase 3, Testing candidates completed (425.412 seconds)
- Testing for repeats completed (1622.8 seconds)
---
0 repeats, p-value = 4.53999e-05

It fails the test. Xoroshiro64* generates every pair of 32-bit outputs exactly once (except for (0,0), which is never generated).

Testing Truncated XorShift

Finally, we'll test two truncated XorShift PRNGs. First, let's look at my recommended generator, XorShift* 64/32:

unix% ./birthday.xorshift64star32doubled.bloomfilter 10
Testing doubled_rng<xorshift64star32a,uint64_t>
    0..18446744073709551615 is range of generator
    19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
    10 duplicates expected on average
    4.53999e-05 is probability of zero duplicates
    0x174da134 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
    - allocated possible-match log with 1073741824 64-bit elements
  - Allocating memory completed (2.6126e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 222922629 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (1609.33 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (6.1724 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (5.9659 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 16075751520106806424 found, step 4869737780
    - repeat of 8452591016538347469 found, step 6063298803
    - repeat of 8419835388433088013 found, step 6971801008
    - repeat of 9239875403387430140 found, step 7987074261
    - repeat of 11194445319005759147 found, step 9274052778
    - repeat of 12213942361713591065 found, step 15617173621
    - repeat of 3144078127879205090 found, step 17344117368
    - repeat of 9374697397498985266 found, step 17683711291
    - repeat of 4344357064314348963 found, step 18543158535
    - repeat of 13251189265117312659 found, step 18807841871
    - repeat of 10280694519002528251 found, step 19083159588
  - Phase 3, Testing candidates completed (657.915 seconds)
- Testing for repeats completed (2279.39 seconds)
---
11 repeats, p-value = 1 - 0.303224

Notice that this generator passes the test just fine. Now let's try XorShift 64/32 which throws away 32-bits but does not use multiplication as a scrambling function.

unix% ./birthday.xorshift64plain32doubled.bloomfilter 10
Testing doubled_rng<xorshift_detail::xorshift<uint64_t, uint32_t, 11, 31, 18>,uint64_t>
    0..18446744073709551615 is range of generator
    19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
    10 duplicates expected on average
    4.53999e-05 is probability of zero duplicates
    0x68389c89 is seed for this run
---
- Testing for repeats started...
  - Allocating memory started...
    - allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
    - allocated possible-match log with 1073741824 64-bit elements
  - Allocating memory completed (1.9951e-05 seconds)
  - Phase 1, Gathering repeat candidates started...
    - 323099500 repeat candidates found
  - Phase 1, Gathering repeat candidates completed (1749.5 seconds)
  - Reallocating memory started...
    - Deallocated bloomish filter
    - Allocated hash table of the same size
  - Reallocating memory completed (6.59228 seconds)
  - Phase 2, Storing candidates in hash table started...
  - Phase 2, Storing candidates in hash table completed (11.1204 seconds)
  - Phase 3, Testing candidates started...
    - repeat of 8432241086602748180 found, step 512009280
    - repeat of 10991711275384902383 found, step 2984376329
    - repeat of 8715911071080357710 found, step 3076849680
    - repeat of 6307293652390151796 found, step 3699476747
    - repeat of 5918160424646835498 found, step 3757200762
    - repeat of 2102851281581636308 found, step 3793484182
    - repeat of 18004728710102585771 found, step 3856237150
    - repeat of 13643872512152535833 found, step 3904548937
    - repeat of 15813780386717731573 found, step 5120593723
    - repeat of 10606428155832328500 found, step 5246820173
    - repeat of 4124773483188809046 found, step 5689671380
    - repeat of 8881227705206282627 found, step 6314841372
    - repeat of 10347249380159185545 found, step 6549081914
    - repeat of 3259454996581558145 found, step 6789498453
    - repeat of 5465544356170662972 found, step 6895105232
    - repeat of 9598202579737928973 found, step 7151318395
    - repeat of 12711680550813551463 found, step 7547298209
    - repeat of 8243254123160113110 found, step 7680715882
    - repeat of 3105491326004167632 found, step 9584052322
    - repeat of 6389591889012097379 found, step 10209858987
    - repeat of 5998122806250358082 found, step 10400041560
    - repeat of 6485183761602045376 found, step 10482622702
    - repeat of 13409280636077211564 found, step 10789884330
    - repeat of 7440253727363109864 found, step 10795267164
    - repeat of 15774111206512302355 found, step 10911672510
    - repeat of 9542813944494139543 found, step 11467617560
    - repeat of 13106000190889745837 found, step 11564124980
    - repeat of 4539599262207590835 found, step 11564467874
    - repeat of 7226327696139710853 found, step 11698843712
    - repeat of 16662818355299619849 found, step 11777961088
    - repeat of 3514753089552548735 found, step 12145627618
    - repeat of 3686498505503683864 found, step 12536448056
    - repeat of 8754154103642100727 found, step 13018513381
    - repeat of 5922676618160286848 found, step 13616884749
    - repeat of 2771565926978918441 found, step 13775938319
    - repeat of 2673034166933327078 found, step 13927906955
    - repeat of 17286920818906038529 found, step 14140724501
    - repeat of 15351512866337221122 found, step 14301601872
    - repeat of 17275527660410671298 found, step 14464765839
    - repeat of 12666155941460351740 found, step 14534427945
    - repeat of 7259412895957455688 found, step 15101896626
    - repeat of 8718082224947945304 found, step 15269349237
    - repeat of 17673454400876571400 found, step 15899175752
    - repeat of 3093540400943525978 found, step 16102279002
    - repeat of 100775988639710351 found, step 16115089460
    - repeat of 3840328319316700675 found, step 16162103481
    - repeat of 12046651965032902115 found, step 16222837628
    - repeat of 13960047651775979002 found, step 16443235714
    - repeat of 10615736297527204965 found, step 16670639022
    - repeat of 8565924938554264941 found, step 16892419321
    - repeat of 3992549844265476149 found, step 16943129372
    - repeat of 7998729961698395805 found, step 17104768142
    - repeat of 13918425977287437518 found, step 17233884687
    - repeat of 1117798506300080490 found, step 17290345185
    - repeat of 859886623890470340 found, step 17376432879
    - repeat of 14006406166013631531 found, step 17495268199
    - repeat of 8400904031756885513 found, step 17623867884
    - repeat of 17429755626699761713 found, step 17688996534
    - repeat of 10855999513313140013 found, step 17873370607
    - repeat of 6477932417974324450 found, step 17877309427
    - repeat of 9759571848035785557 found, step 17891403453
    - repeat of 8214606349987392434 found, step 17975244179
    - repeat of 3915974417542383620 found, step 18070507208
    - repeat of 2810120710806743824 found, step 18220588645
    - repeat of 2820970032855298710 found, step 18295595540
    - repeat of 6952151018582886574 found, step 18629103333
    - repeat of 6636270080993323558 found, step 18691888391
    - repeat of 16367499301027804324 found, step 18968625522
    - repeat of 13861772320392267683 found, step 19205087616
  - Phase 3, Testing candidates completed (854.275 seconds)
- Testing for repeats completed (2621.49 seconds)
---
69 repeats, p-value = 1.89751e-15

Oh dear, that's vastly too many repeats. Thus although this generator passed the test when we looked at repeats in one dimension, in two dimensions it fails. (This test also reveals some rounding-error issues in the calculation of p-values when there are vastly too many repeats; the correct p-value is 1 - 4.4e-35.)

The existence of repeats in both of these truncated XorShift generators does, however, show that neither of these generators is maximally equidistributed (because maximal equidistribution would be two-dimensional equidistribution, and would result in no repeats occurring in the test). I mention this because in a comment on the website Reddit, Vigna wrote:

And of course Marsaglia's xorshift64 generators are equidistributed in the maximum dimension—it's a trivial mathematical fact; Melissa O'Neill writing "no" in a table on her website will not change mathematics.

Initially, I thought Vigna was talking about one of these truncated generators, because the table on the front page of the PCG website lists XorShift* 64/32. If he had been thinking that XorShift 64/32 or XorShift* 64/32 were two-dimensionally equidistributed, he would have been in error as these tests show. But I now realize he was talking about the untruncated versions we saw failing the test earlier. We can describe XorShift 32 and XorShift 64 as ”equidistributed in the maximum dimension” provided we understand that in this case, the “maximum dimension” is one; but to my eye k-dimensional equidistribution with k = 1 is a degenerate case that is mostly meaningless. Thus we can chalk this one up to a misunderstanding—I only think k-dimensional equidistribution is interesting as a property when k > 1. If I had used Vigna's definition, I would have to say that every uniform PRNG is k-dimensionally equidistributed (k = 1), which is almost all of the ones I list in the table.

Conclusions

We began with a test for repeats based on the birthday problem that was viable to run on 32-bit PRNGs, and also worked on 64-bit PRNGs but required significant amounts of memory to run. We looked at several approaches, refining our storage strategy until we had a test that could run in less than half an hour on a modern laptop. By throwing away some outputs, we finally ended up with a test that can run on on far more modest hardware (such as a 2008 MacBook Air or an iPhone 5s).

There are some “birthday” tests in PRNG test suites, such as Marsaglia's Birthday Spacings test (as discussed by Eucyer and Simard), but Marsaglia's test is about detecting too much regularity in the distance between outputs when sorted into order rather than the number of repeats. It appears that currently most PRNG test-suites don't check to make sure generators have the correct frequency of 64-bit repeats, even though, as we have seen, it's entirely practical to do so and doing so can detect flaws missed by other tests.

With this test, we are able to quickly fail MINSTD, XorShift 32, XorShift 64, SplitMix64, and Xoroshiro64* (when its 32-bit output is paired to make 64-bit output). (We can also fail the PCG _once_insecure variants as well, although the every-output-occurs-only-once property is telegraphed in the name and the PCG paper cautions against their use as general-purpose generators for precisely this reason.)

Finally, the fact that Xoroshiro64* fails the test and Xoroshiro64+ passes casts some doubt on the idea that maximal equidistribution is a desirable property, because a maximally equidistributed generator must fail this statistical test shortly after the square root of the period, whereas its less-than-maximally equidistributed sibling survives far longer. We'll return to that question in a future article.

Downloads

All of the code (and results) is available on GitHub.

Trivia

The tests involved in this article involved 138 distinct executables, run across three different machines and two different operating systems (4310 runs on the 512 GB Haswell Xeon Linux server, 14 runs on the Skylake Xeon Linux server, and 7 runs on the Skylake Xeon Mac desktop). A total of 4049 runs generated 155 billion 32-bit integers (0.56 TB), and 280 runs generated 20.2 trillion 64-bit integers (147 TB), requiring 9.2 days of compute time. About 15 trillion of these outputs comes from three runs (across all three platforms) of birthday.spitmix64skip65536.sort, which uses 5 trillion numbers per run. To allow comparison across platforms, the code was compiled with Clang/LLVM 6.0.0, using the GNU libstdc++ standard library (which has a significantly faster Mersenne Twister implementation than the one in libc++). Although the tests can be used as benchmarks to compare both different generators and different platforms, an overall takeaway from the timings collected is that a vast array of factors influence performance.

Thanks

Thanks to Daniel Lemire who provided access to the Skylake Xeon Linux server I used, and to the National Science Foundation (Grant #1219243), who funded the Haswell Xeon Linux server, and to Csilla & Walt Foley, whose endowment funded it having 512 GB of RAM. Thanks also to the authors of the various PRNGs tested here.

The Birthday Problem

Building a Simple Test

The Driver

Counting Repeats #1: Store and Sort

Results for 32-Bit PRNGs

Results for 64-Bit PRNGs

Counting Repeats #2: Store Values in a Hash Table

Results for 32-Bit PRNGs

Results for 64-Bit PRNGs

Counting Repeats #3: Store in Buckets, Sort the Buckets

Results for 32-Bit PRNGs

Results for 64-Bit PRNGs

Counting Repeats #4: Bloom Filter

Results for 32-Bit PRNGs

Results for 64-Bit PRNGs

Applying PRNG adapters

A divided_rng Adapter

Testing std::minstd_rand

Testing splitmix64

A skipping_rng Adapter

Testing std::minstd_rand

Testing splitmix64

A doubled_rng Adapter

Testing Xoroshiro64+

Testing Xoroshiro64*

Testing Truncated XorShift

Conclusions

Downloads

Trivia

Thanks

A `divided_rng` Adapter

Testing `std::minstd_rand`

Testing `splitmix64`

A `skipping_rng` Adapter

Testing `std::minstd_rand`

Testing `splitmix64`

A `doubled_rng` Adapter