# A Birthday Test: Quickly Failing Some Popular PRNGs

My two favorite statistical test suites for PRNGs are PractRand and TestU01, but while they detect statistical flaws of many kinds, detecting those flaws often takes a long time to run, and for various reasons some flaws take a very long time to detect (if they are detected at all). In this post, we'll put together a test based around a very classic example from the world of probability theory, the Birthday Problem.

In doing so, we'll develop a test that can detect deviations from random behavior in some widely used PRNGs. We'll eventually be able to detect flaws in std::minstd_rand and XorShift 32 in 0.01 seconds, and SplitMix, XorShift 64 and Xoroshiro64* in less than half an hour.

### The Birthday Problem

In the classic Birthday Problem, we ask how many people we need to have in a room before it is more likely than not that we have two people who share a birthday. Many people are surprised at how low the number is, only 23. And by the time we have 58 people, the chance is more than 99%.

The Birthday Problem generalizes to the formula, $p \approx 1-e^{\frac{-n^2}{2d}},$ which rearranges to $n \approx \sqrt{- 2 d \log(1-p)},$ where $p$ is the probability of success (e.g., 0.5 (or 0.99) in the preceding standard example above), $d$ is the number of possible values (e.g., 365 in the standard birthday problem), and $n$ is the number of samples observed.

Also, using a Poisson approximation, we can say that the expected number of repeats, $r$, will be $r \approx -\log(1-p) \approx \frac{n^2}{2d},$ but we instead use an estimate based on a binomial distribution (which is usually cited when discussing number of collisions in a hash table), $r \approx n - d \left(1-\left(1-\frac{1}{d}\right)^n\right).$ This second formula is more accurate when the expected number is large and the range of the generator is relatively small.

We can use the birthday problem as the basis of a test for random number generators. If our generator can produce $d$ different outputs (e.g., 232 or 264), how many outputs should we see before there is, say, a 99% chance that we ought to have observed a repeat? Plugging in the numbers, we find that for a generator with 232 (i.e., 4,294,967,296) possible outputs, after only 198,893 outputs, we'll expect to see on average 4.6 repeats with about a 1% chance of having observed none at all. If we scan 926,820 outputs, the expected number of repeats rises to 100 and the chance of observing none at all is vanishingly small, a hard to imagine 3.7 × 10-44.

### Building a Simple Test

Here's the outline of a test. Given either a desired p-value (e.g., 50%) or a desired expected number of repeats (e.g., 100), and knowledge of the output range of the generator, we

• Calculate how many outputs to look at and how many repeats we expect to see,
• Collect that many outputs, and,
• Count how many repeats we saw.

We can then determine how plausible that number is, given what we were expecting.

#### The Driver

Here's the driver program in C++, assuming that RNG_TYPE is a PRNG following the usual C++11 conventions.

int main(int argc, const char* argv[]) {
long double desired = 0.01l;
if (argc >= 2) {
desired = strtold(argv[1], nullptr);
assert(desired > 0.0l);
}

double factor = (desired < 1.0)
? sqrt(-2.0l*log(desired))        // p-value
: sqrt(2.0l*desired);             // expected repeats

long double rng_range = 1.0l + RNG_TYPE::max() - RNG_TYPE::min();
size_t sample_size = ceil(factor * sqrt(rng_range));
long double expected =
sample_size - rng_range * (- expm1(sample_size*log1p(-1/rng_range)));
float p_zero   = exp(-expected);

std::cout << "Testing " << STRINGIFY(RNG_TYPE) << "\n";
std::cout << "\t" << RNG_TYPE::min() << ".."
<< RNG_TYPE::max() << " is range of generator\n";
std::cout << "\t" << sample_size << " outputs to scan for duplicates ["
<< factor << "*sqrt(range)]\n";
std::cout << "\t" << expected << " duplicates expected on average\n";
std::cout << "\t" << p_zero << " is probability of zero duplicates\n";
uint32_t seed = std::random_device{}();
std::cout << "\t0x" << std::hex << seed << std::dec
<< " is seed for this run\n";

std::cout << "---\n";

RNG_TYPE rng(seed);
size_t repeats = count_repeats(rng, sample_size);

std::cout << "---\n";

std::cout << repeats << " repeats, p-value = ";

long double p_value = 0;
for (size_t k = 0; k <= repeats; ++k) {
long double pdf_value =
exp(log(expected)*k - expected - lgamma(1.0l+k));
p_value += pdf_value;
if (p_value > 0.5)
p_value = p_value - 1;
}
if (p_value < 0)
std::cout << "1 - " << -p_value << "\n";
else
std::cout << p_value << "\n";

return 0;
}


We can imagine getting results like

unix% ./birthday 0.01
Testing std::mt19937
0..4294967295 is range of generator
198893 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60511 duplicates expected on average
0.0100006 is probability of zero duplicates
0xd533c680 is seed for this run
---
[... counting occurs ...]
---
3 repeats, p-value = 0.324873


Or

unix% ./birthday 0.01
Testing std::mt19937
0..4294967295 is range of generator
198893 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60511 duplicates expected on average
0.0100006 is probability of zero duplicates
0x6365cd1f is seed for this run
---
[... counting occurs ...]
---
6 repeats, p-value = 1 - 0.182648


Notice that rerunning the algorithm with a different seed allows a different result. In these two cases, the p-values are reasonable (not very close to zero or 1).

We could also run more expansive tests. Below, we've run a test where we would expect 100 repeats.

unix% ./birthday 100
Testing std::mt19937
0..4294967295 is range of generator
926820 outputs to scan for duplicates [14.1421*sqrt(range)]
99.9929 duplicates expected on average
3.78351e-44 is probability of zero duplicates
0x474a271e is seed for this run
---
[... counting occurs ...]
---
116 repeats, p-value = 1 - 0.0521379


Again, the p-value above is reasonable.

What I've not shown yet is how to implement count_repeats.

#### Counting Repeats #1: Store and Sort

The simplest way to count repeats is to store all the output in an array, sort the array, and then scan linearly through it looking for repeats—because items are in order, identical items will be next to each other.

Here's the code:

size_t count_repeats(RNG_TYPE& rng, size_t expanse)
{
using rngval_t = RNG_TYPE::result_type;

// Allocate an array to hold all the output.
rngval_t* values = new rngval_t[expanse];

// Store all the output from the generator.
for (size_t i = 0; i < expanse; ++i)
values[i] = rng();

// Sort the array
std::sort(values, values+expanse);

// Scan the array for duplicates
rngval_t prev = values[0];
size_t duplicates = 0;
for (size_t i = 1; i < expanse; ++i) {
rngval_t curr = values[i];
if (prev == curr) {
std::cout << "    - repeat of " << curr << " found\n";
++duplicates;
}
prev = curr;
}

return duplicates;
}

##### Results for 32-Bit PRNGs

Here are the results, with a little extra instrumentation added to print what is going on and time how long each part takes.

unix% ./birthday.mt19937.sort 0.01
Testing std::mt19937
0..4294967295 is range of generator
198893 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60511 duplicates expected on average
0.0100006 is probability of zero duplicates
0xb0856615 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 198893.
- Allocating memory completed (1.8799e-05 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.00156919 seconds)
- Sorting started...
- Sorting completed (0.0153883 seconds)
- Scanning for duplicates started...
- repeat of 44686935 found
- repeat of 1119175552 found
- repeat of 2307769765 found
- repeat of 3839874543 found
- repeat of 4249473610 found
- Scanning for duplicates completed (0.000235218 seconds)
- Testing for repeats completed (0.0172872 seconds)
---
5 repeats, p-value = 1 - 0.315123


We see that the 32-bit Mersenne Twister passes the test in less than a thirtieth of a second. Or we could try out a longer test:

unix% ./birthday.mt19937.sort 20
Testing std::mt19937
0..4294967295 is range of generator
414487 outputs to scan for duplicates [6.32456*sqrt(range)]
19.9994 duplicates expected on average
2.06239e-09 is probability of zero duplicates
0x54c21e82 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 414487.
- Allocating memory completed (3.0691e-05 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.00466165 seconds)
- Sorting started...
- Sorting completed (0.0540854 seconds)
- Scanning for duplicates started...
- repeat of 474221760 found
- repeat of 670096292 found
- repeat of 780413921 found
- repeat of 895436278 found
- repeat of 1042072169 found
- repeat of 1267229417 found
- repeat of 1457725784 found
- repeat of 1509219941 found
- repeat of 1740429513 found
- repeat of 2019745416 found
- repeat of 2373928640 found
- repeat of 2399448522 found
- repeat of 2676526217 found
- repeat of 3709805762 found
- repeat of 3754906861 found
- repeat of 3915683171 found
- repeat of 4010268093 found
- repeat of 4030683995 found
- Scanning for duplicates completed (0.000699004 seconds)
- Testing for repeats completed (0.059584 seconds)
---
18 repeats, p-value = 0.381473


Again, in less than a sixteenth of a second, we've confirmed that the 32-bit Mersenne twister produces the right number of repeats. Cool.

Let's test a few more PRNGs and summarize the results:

Scheme Storing (seconds) Sorting (seconds) Scanning (seconds) Total (seconds) Repeats (count) p-value
std::mt19937 0.00491 0.04936 0.00046 0.05444 19 0.47025
gjrand32 0.00406 0.05736 0.00049 0.06193 19 0.47025
jsf32 0.00211 0.05677 0.00050 0.05940 20 0.55909
pcg32 0.00258 0.05774 0.00050 0.06089 20 0.55909
pcg32_fast 0.00202 0.05635 0.00049 0.05892 19 0.47025
sfc32 0.00230 0.05520 0.00049 0.05815 20 0.55909
xoroshiro64+ 0.00235 0.05625 0.00049 0.05923 21 0.64369
xoroshiro64* 0.00256 0.05570 0.00049 0.05885 20 0.55909
xoroshiro64** 0.00281 0.05873 0.00049 0.06182 21 0.64369
xorshift*64/32 0.00320 0.05897 0.00050 0.06267 19 0.47031
pcg32_once_insecure 0.00262 0.05861 0.00049 0.06173 0 2.06e-9
xorshift32 0.00318 0.05884 0.00049 0.06260 0 2.06e-9
std::minstd_rand 0.00431 0.03838 0.00038 0.04357 0 2.06e-9

These results represent median values for 31 runs (because each run is very quick!). The timings for storing the PRNG output provide a way to compare all the generators except std::minstd_rand (which has a smaller output range and thus requires fewer outputs). We can see that std::mt19937 is rather slower than the others (although std::minstd_rand would be even slower if it were recording the same number of outputs). It may initially seem anomalous that the std::mt19937 implementation sorts faster than other generators with the same 32-bit output range, but that is because the C++ implementation returns a uint_fast32_t (which is actually a 64-bit integer) rather than a uint32_t—it turns out that uint_fast32_t really is faster (but also more wasteful of memory).

But most notable is the fact that std::minstd_rand, (untruncated) Xorshift 32 and pcg32_once_insecure fail this statistical test due to their wildly improbable p-values. This test quickly detects their lack of repeats. One of the reasons pcg32_once_insecure has the name it does is to telegraph to users the fact that it is not intended for general-purpose use but for specialized situations where lack of repeats might be desirable. In contrast, XorShift 32 and std::minstd_rand are usually presented as a general-purpose PRNGs. I would argue that their failure here shows that they do not meet the requirements of a general-purpose PRNG. (In contrast, XorShift* 64/32, which adds a multiplicative output function and returns only the high-order 32-bits, is fine.)

It's worth noting how quickly we can fail these generators. For MINSTD, the 99% test only needs 140,639 × 4 = 562,556 bytes and the more 20-repeats test only needs just over double that, 293,086 outputs. In 1987 when Steve Park and Keith Miller wrote the paper Random Number Generators: Good Ones Are Hard to Find, which introduced this generator, you could buy a Mac II with 4 MB of RAM which could have run either test in a reasonable time (as could any available Unix workstation, even a 1983 vintage Sun Workstation!). In other words, this generator could have been failed at launch in 1987.

It might seem strange to even bother to test MINSTD or XorShift 32 in this way, given that we can tell by simply inspecting their designs that they will never repeat any outputs. But these tests show that how few outputs we need to detect this design choice as a statistical flaw. After only 54,563 outputs from MINSTD (the 50% probability point for repeats), its output is increasingly statistically anomalous.

##### Results for 64-Bit PRNGs

We can also run the test on 64-bit PRNGs. Here it is running on the 64-bit Mersenne Twister:

unix% ./birthday.mt19937_64.sort 0.01
Testing std::mt19937_64
0..18446744073709551615 is range of generator
13034599790 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60517 duplicates expected on average
0.01 is probability of zero duplicates
0x37a0a001 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 13034599790.
- Allocating memory completed (3.6719e-05 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (78.9106 seconds)
- Sorting started...
- Sorting completed (1719.31 seconds)
- Scanning for duplicates started...
- repeat of 3097887744857163366 found
- repeat of 5135732003217760496 found
- repeat of 5958311621514904927 found
- Scanning for duplicates completed (14.3795 seconds)
- Testing for repeats completed (1812.6 seconds)
---
3 repeats, p-value = 0.324864


With 13,034,599,789 outputs to store and scan, it's obviously a bit slower. It gets all the data stored in less than 80 seconds, but then sorting grinds away for half an hour.

A larger run, looking for 20 duplicates, takes even longer:

unix% ./birthday.mt19937_64.sort 20
Testing std::mt19937_64
0..18446744073709551615 is range of generator
27163758263 outputs to scan for duplicates [6.32456*sqrt(range)]
20 duplicates expected on average
2.06115e-09 is probability of zero duplicates
0xe8f137dd is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 27163758263.
- Allocating memory completed (3.709e-05 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (175.529 seconds)
- Sorting started...
- Sorting completed (3924.11 seconds)
- Scanning for duplicates started...
- repeat of 20077808886971205 found
- repeat of 637172222824756178 found
- repeat of 1163359596874921636 found
- repeat of 1557608038662986017 found
- repeat of 2522575772404207458 found
- repeat of 3525508532012308475 found
- repeat of 3871057790955701391 found
- repeat of 4726677753815860247 found
- repeat of 5979125123285167945 found
- repeat of 7899230134743204273 found
- repeat of 7970432731437732158 found
- repeat of 8488249810808916119 found
- repeat of 9323300497538043642 found
- repeat of 9750935714613327132 found
- repeat of 10503536482358837444 found
- repeat of 11038662378093814046 found
- repeat of 11328089922193057506 found
- repeat of 11824685809627306531 found
- repeat of 14708280948401454462 found
- repeat of 15326753147701364060 found
- repeat of 15747525336113554261 found
- repeat of 16636213418247152190 found
- repeat of 17510404778311870795 found
- repeat of 18092357997477209913 found
- Scanning for duplicates completed (45.5912 seconds)
- Testing for repeats completed (4145.23 seconds)
---
24 repeats, p-value = 1 - 0.156773


These tests are rather memory hungry, however. The first test requires 97.1 GB, and the second requires 202.4 GB. That's a lot of RAM! I'm running these tests on a beefy server with 512 GB of RAM, but not everyone has such a powerful server at their disposal. It'd be good to have something faster and more memory efficient.

While we're here though, let's test a few others. Here are results for the 99% test (medians of three runs).

Scheme Storing (seconds) Sorting (seconds) Scanning (seconds) Total (seconds) Repeats (count) p-value
std::mt19937_64 82.9455 1925.06 15.9029 2024.34 2 0.16209
gjrand64 74.9326 1951.05 15.3217 2041.67 5 0.68487
jsf64 49.3962 1976.47 15.5237 2041.68 4 0.51226
mcg128 55.6478 1954.42 15.5654 2024.35 2 0.16209
mcg128_fast 52.3176 1952.29 15.4270 2020.43 5 0.68487
pcg64 67.0258 1943.01 15.7909 2026.46 4 0.51226
pcg64_fast 60.8966 1953.03 15.7946 2030.51 2 0.16209
sfc64 48.4997 1971.91 16.3415 2036.75 4 0.51226
xoroshiro128+ 47.7538 1955.20 15.8975 2018.56 5 0.68487
xoroshiro128** 54.4471 1946.13 16.1368 2015.78 7 0.90450
xorshift*128/64 93.3112 2009.93 17.9337 2119.94 7 0.90450
pcg64_once_insecure 55.9306 1941.75 15.9162 2014.60 0 0.01
xorshift64 65.981 2031.29 17.3847 2113.96 0 0.01
splitmix64 51.3770 1947.12 15.2408 2014.70 0 0.01

And here is the 20-expected-repeats test (median of three runs).

Scheme Storing (seconds) Sorting (seconds) Scanning (seconds) Total (seconds) Repeats (count) p-value
std::mt19937_64 183.917 4198.77 40.0288 4441.60 21 0.64370
gjrand64 169.248 4270.51 37.0672 4481.24 23 0.78749
jsf64 105.242 4278.48 39.0786 4422.70 22 0.72061
mcg128 114.237 4290.56 38.0371 4443.02 23 0.78749
mcg128_fast 108.084 4243.50 39.1832 4402.13 24 0.84323
pcg64 140.412 4236.27 40.0860 4418.58 22 0.72061
pcg64_fast 128.951 4239.96 37.6571 4406.48 24 0.84323
sfc64 102.542 4296.64 41.2401 4440.63 26 0.92211
xoroshiro128+ 100.587 4296.52 41.0189 4435.56 18 0.38142
xoroshiro128** 114.427 4257.38 43.1764 4414.22 20 0.55909
xorshift*128/64 204.078 4245.56 38.7307 4491.49 21 0.64370
pcg64_once_insecure 115.781 4223.30 37.2429 4376.37 0 2.06e-9
xorshift64 136.411 4257.19 37.9699 4442.13 0 2.06e-9
splitmix64 107.673 4266.90 40.2020 4414.78 0 2.06e-9

Again, we can use the time for storing the numbers as a microbenchmark. SplitMix is really fast, but so are mcg128_fast and Xoroshiro128+. The Mersenne Twister is notably slower than the others. Blackman's gjrand64 seems to lag behind sfc64 and jsf64 even though all three are quite similar designs. In this group of generally fast PRNGs, I'm content for the 64-bit-output PCG variants to fall somewhere in the middle, both because 128-bit-state PCG variants can never beat the mcg128 linear congruential generators (given that PCG applies a permuting output function to an underlying linear congruential generator); and because benchmark results can be so variable, as we'll see throughout this article.

But more importantly, like MINSTD, SplitMix never repeats itself, producing a low p-value and thus failing the test. As expected, pcg64_once_insecure fails too. But if you need a long computation on a beefy machine to determine that, maybe it's purely academic? A more efficient test would be helpful.

#### Counting Repeats #2: Store Values in a Hash Table

One way to avoid sorting is to store all the data in a hash table. If the number is already in the table when we go to insert it, we have a repeat.

Here's code for hash-table–based counting:

size_t count_repeats(RNG_TYPE& rng, size_t expanse)
{
using rngval_t = RNG_TYPE::result_type;

std::unordered_set<rngval_t> sample;

// Store all the output from the generator.

size_t duplicates = 0;
for (size_t i = 0; i < expanse; ++i) {
auto rval = rng();
auto result = sample.insert(rval);
if (!result.second) {
std::cout << "    - repeat of " << rval << " found, step " << i
<< "\n";
++duplicates;
}
}

return duplicates;
}

##### Results for 32-Bit PRNGs

If we test the Mersenne Twister, we see results like

unix% ./birthday.mt19937.hashtable 0.01
Testing std::mt19937
0..4294967295 is range of generator
198893 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60511 duplicates expected on average
0.0100006 is probability of zero duplicates
0x882c2a09 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated hash table of reserved for 198893 elements
- Allocating memory completed (0.00148097 seconds)
- Checking PRNG output started...
- repeat of 2326356096 found, step 102565
- repeat of 764086403 found, step 104931
- repeat of 4140509098 found, step 116132
- repeat of 2835992082 found, step 139268
- repeat of 2448493342 found, step 171267
- repeat of 1230289205 found, step 198622
- Checking PRNG output completed (0.0391443 seconds)
- Testing for repeats completed (0.0406811 seconds)
---
6 repeats, p-value = 1 - 0.182648


The nice thing about this approach is that it tells us exactly when each repeat occurs. The less-good thing is that it's a little slower and uses 4.7 MB of RAM for data and up to 3 MB in malloc bookkeeping overhead, more than ten times more than our previous approach. And that's with GCC's C++ standard library; Clang's library requires 6 MB.

Nevertheless, let's check out how some of our 32-bit PRNGs perform:

Scheme Checking (seconds) Total (seconds) Repeats (count) p-value
std::mt19937 0.11477 0.11647 19 0.47025
gjrand32 0.09952 0.10150 20 0.55909
jsf32 0.10681 0.10842 18 0.38141
pcg32 0.11272 0.11486 22 0.72060
pcg32_fast 0.09995 0.10137 21 0.64369
sfc32 0.09577 0.09740 19 0.47025
xoroshiro64+ 0.09410 0.09662 19 0.47025
xoroshiro64* 0.09206 0.09376 20 0.55909
xoroshiro64** 0.09275 0.09543 19 0.47025
xorshift*64/32 0.09934 0.10130 19 0.47031
pcg32_once_insecure 0.10123 0.10453 0 2.06e-9
xorshift32 0.10358 0.10564 0 2.06e-9
std::minstd_rand 0.07011 0.07182 0 2.06e-9

It's mostly outside the scope of this article, but C++ allows us to specify our own memory allocator for its containers, and if we use a memory pool with a bump-pointer allocation, we can do much better in terms of both speed and memory use (the same amount of data is stored, but there is almost zero bookkeeping).

unix% ./birthday.mt19937.hashtable.bump 0.01
Testing std::mt19937
0..4294967295 is range of generator
198893 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60511 duplicates expected on average
0.0100006 is probability of zero duplicates
0x9bf209d is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated hash table of reserved for 198893 elements
- Allocating memory completed (0.000961311 seconds)
- Checking PRNG output started...
- repeat of 692467637 found, step 107618
- repeat of 414775239 found, step 147312
- repeat of 4252812390 found, step 188509
- Checking PRNG output completed (0.014626 seconds)
- Testing for repeats completed (0.0156315 seconds)
---
3 repeats, p-value = 0.324873


Now our test is faster than the original version using sorting. We can confirm this speedup by running all our 32-bit PRNGs again:

Scheme Checking (seconds) Total (seconds) Repeats (count) p-value
std::mt19937 0.04386 0.04629 21 0.64369
gjrand32 0.03943 0.04137 19 0.47025
jsf32 0.03980 0.04188 18 0.38141
pcg32 0.03844 0.03987 20 0.55909
pcg32_fast 0.03882 0.04068 21 0.64369
sfc32 0.04007 0.04157 20 0.55909
xoroshiro64+ 0.03748 0.03931 20 0.55909
xoroshiro64* 0.03780 0.04004 21 0.64369
xoroshiro64** 0.04071 0.04247 21 0.64369
xorshift*64/32 0.03796 0.03967 18 0.38147
pcg32_once_insecure 0.03394 0.03640 0 2.06e-9
xorshift32 0.03711 0.03961 0 2.06e-9
std::minstd_rand 0.02988 0.03133 0 2.06e-9
##### Results for 64-Bit PRNGs

It's a little too easy to run out of memory with the hash-table version. With the default choice of looking for a 99% chance of a repeat, GCC's libstdc++ implementation of std::unordered_set allocates a 128 GB base array to hold 17,179,869,143 chains (a prime just below 234) for a load factor of 75.9%, and GB 194.2 to hold 13,034,599,790 16-byte chain nodes. That's 323 GB total, before overhead. Unfortunately, malloc overheads mean that those 16-byte chain nodes actually consume 32 bytes, doubling memory usage for chaining to 388.5 GB and total memory usage to 516.4 GB, which is just a few GB too much.

In contrast, Clang's implementation just makes it through okay, all due to a slightly different choice for load factor of their hash tables and despite grabbing 24 bytes for chain nodes (they end up also rounding up to 32 bytes).

The bump-allocator version avoids the time and space overhead of individual malloc calls. Here are results for this version (running the 99% test):

Scheme Checking Total Repeats p-value
std::mt19937_64 4486.00 4529.97 5 0.68487
gjrand64 4522.78 4568.31 4 0.51226
jsf64 4597.81 4640.45 3 0.32486
mcg128 4471.28 4513.88 4 0.51226
mcg128_fast 4460.94 4505.08 5 0.68487
pcg64 4513.76 4556.75 2 0.16209
pcg64_fast 4496.36 4539.63 8 0.95467
sfc64 4430.88 4473.66 4 0.51226
xoroshiro128+ 4530.84 4574.55 4 0.51226
xoroshiro128** 4409.88 4453.78 4 0.51226
xorshift*128/64 4460.04 4503.96 3 0.32486
pcg64_once_insecure 4605.06 4649.2 0 0.01
xorshift64 4435.31 4481.12 0 0.01
splitmix64 4293.21 4335.7 0 0.01

Because this test is slow and memory hungry, these results are from just a single run. The 20-repeats test is too large to run using a standard C++ hash table. The provided code lets you swap in other, more efficient hash table implementations instead of std::unordered_set but we'll move on to other approaches.

#### Counting Repeats #3: Store in Buckets, Sort the Buckets

The next approach we'll look at for dealing with the high overhead of sorting is to dump ranges of numbers into separate buckets, and then sort each bucket to look for duplicates.

We'll skip looking at the code (but you can find it in the provided downloadable code).

##### Results for 32-Bit PRNGs

Here's a run for the 32-bit Mersenne Twister. You can see that we're spreading the output across 256 buckets:

unix% ./birthday.mt19937.bucketsort 0.01
Testing std::mt19937
0..4294967295 is range of generator
198893 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60521 duplicates expected on average
0.00999963 is probability of zero duplicates
0xdb99f8af is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated 256 buckets of capacity 971
- allocated 256 bucket counters
- Allocating memory completed (1.3238e-05 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.0030664 seconds)
- Sorting started...
- Sorting completed (0.0173691 seconds)
- Scanning for duplicates started...
- repeat of 731276309 found
- repeat of 4007810445 found
- repeat of 1926145973 found
- repeat of 2079408604 found
- repeat of 1373598710 found
- Scanning for duplicates completed (0.000318993 seconds)
- Testing for repeats completed (0.0207814 seconds)
---
5 repeats, p-value = 1 - 0.315139


And here are results for our cohort of 32-bit PRNGs:

Scheme Storing (seconds) Sorting (seconds) Scanning (seconds) Total (seconds) Repeats (count) p-value
std::mt19937 0.00688 0.03629 0.00067 0.04393 20 0.55909
gjrand32 0.00377 0.03620 0.00058 0.04088 19 0.47025
jsf32 0.00345 0.03724 0.00059 0.04129 21 0.64369
pcg32 0.00358 0.03700 0.00058 0.04115 20 0.55909
pcg32_fast 0.00311 0.03702 0.00057 0.04095 21 0.64369
sfc32 0.00328 0.03706 0.00057 0.04111 20 0.55909
xoroshiro64+ 0.00341 0.03701 0.00058 0.04099 21 0.64369
xoroshiro64* 0.00337 0.03618 0.00057 0.04037 19 0.47025
xoroshiro64** 0.00349 0.03514 0.00057 0.03916 20 0.55909
xorshift*64/32 0.00362 0.03690 0.00059 0.04133 19 0.47031
pcg32_once_insecure 0.00359 0.03711 0.00057 0.04134 0 2.06e-9
xorshift32 0.00352 0.03669 0.00057 0.04091 0 2.06e-9
std::minstd_rand 0.00493 0.02533 0.00043 0.03093 0 2.06e-9

Storing the output is now about 50% slower, but sorting gets done in about two thirds of the time. Overall, it's a small net win. One interesting thing to note here is that sorting for the 32-bit Mersenne Twister lost its speed advantage from using uint_fast32_t.

##### Results for 64-Bit PRNGs
unix% ./birthday.mt19937_64.bucketsort 0.01
Testing std::mt19937_64
0..18446744073709551615 is range of generator
13034599790 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60517 duplicates expected on average
0.01 is probability of zero duplicates
0xd68d5f90 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated 65536 buckets of capacity 248615
- allocated 65536 bucket counters
- Allocating memory completed (0.000549318 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (698.925 seconds)
- Sorting started...
- Sorting completed (966.247 seconds)
- Scanning for duplicates started...
- repeat of 6508886296057597688 found
- repeat of 5924451221820332975 found
- Scanning for duplicates completed (16.1705 seconds)
- Testing for repeats completed (1681.34 seconds)
---
2 repeats, p-value = 0.16209


What we can see here is that this version drastically reduces sorting time by 43%, from 1719.31 seconds to 966.247 seconds, saving 12.5 minutes. But storing the output is 8.9× slower, rising from 78.9 seconds for one array to 698.9 seconds for buckets; that's 10.3 minutes longer. Overall, it's a win, but only just.

Here are the results for our cohort of 64-bit generators running the 20-repeats test (only one run):

Scheme Storing (seconds) Sorting (seconds) Scanning (seconds) Total (seconds) Repeats (count) p-value
std::mt19937_64 1379.07 2340.24 40.5855 3759.90 17 0.29703
gjrand64 1963.33 2314.87 42.1575 4320.36 22 0.72061
jsf64 1499.80 2315.43 40.1675 3855.40 25 0.88782
mcg128 1378.42 2364.94 47.2118 3790.57 22 0.72061
mcg128_fast 1400.20 2294.61 36.6053 3731.41 18 0.38142
pcg64 1473.77 2291.80 39.3417 3804.91 22 0.72061
pcg64_fast 1460.90 2305.29 38.3696 3804.56 17 0.29703
xoroshiro128+ 1561.03 2311.89 39.0475 3911.97 24 0.84323
xoroshiro128** 1607.38 2325.86 39.7094 3972.95 22 0.72061
xorshift*128/64 1502.62 2324.26 42.4239 3869.31 24 0.84323
pcg64_once_insecure 1991.08 2309.85 41.9223 4342.86 0 2.06e-9
xorshift64 1489.58 2331.94 42.0448 3863.56 0 2.06e-9
splitmix64 1346.53 2150.69 42.9768 3540.20 0 2.06e-9

This version is also more memory hungry than the original version—to allow buckets to be unevenly filled, it allocates 25% more memory. Overall, not a huge improvement.

#### Counting Repeats #4: Bloom Filter

Thus far, we've tried a few obvious approaches and they're slow and memory hungry. To do better, we'll use a more sophisticated approach.

In this version we deploy an advanced data structure, specifically, a variant of a Bloom filter. A Bloom filter is an incredibly space-efficient data structure for storing a set of values, but it is probabilistic—it will never forget anything added to the set, but it may erroneously say things are included in the set when they are not.

The steps for this algorithm are

1. Make a copy of the PRNG state.
2. Store the values in a Bloom filter. When we appear to see a duplicate, log this candidate repeat for later.
3. Throw away the Bloom filter.
4. Transfer the repeat candidates to a hash table, with a count of zero associated with each one.
5. Using the copy of the PRNG, replay the output sequence, ignoring all values except for the candidates stored in the hash table. If it's in the hash table, update the count whenever we see the value.
6. Candidates whose count reaches goes higher than 1 are true repeats.
##### Results for 32-Bit PRNGs

Here's the results running on the 32-bit Mersenne Twister:

unix% ./birthday.mt19937.bloomfilter 0.01
Testing std::mt19937
0..4294967295 is range of generator
198893 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60511 duplicates expected on average
0.0100006 is probability of zero duplicates
0x256128e4 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 16384 64-bit elements (3 bits/entry)
- allocated possible-match log with 8192 64-bit elements
- Allocating memory completed (6.7429e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 5611 repeat candidates found
- Phase 1, Gathering repeat candidates completed (0.0061467 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (0.000156567 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (9.2692e-05 seconds)
- Phase 3, Testing candidates started...
- repeat of 183773048 found, step 92076
- repeat of 2427098983 found, step 95253
- repeat of 971438297 found, step 193226
- Phase 3, Testing candidates completed (0.00687177 seconds)
- Testing for repeats completed (0.0134276 seconds)
---
3 repeats, p-value = 0.324873


The Bloom filter and match log require only (16384 + 8192) × 4 bytes of memory, or 96 KB, about an eighth of the 777 KB our original store-and-sort version would have need.

Here's summary results for our cohort of 32-bit PRNGs running the expect-20-repeats test. I actually accidentally ran the tests twice, so here are both sets of results (in each case, these are the median values from 31 runs). Here's the first result set:

Scheme Candidates (seconds) Rechecking (seconds) Total (seconds) Repeats (count) p-value
std::mt19937 0.01217 0.01661 0.03024 19 0.47025
gjrand32 0.01017 0.01099 0.02113 18 0.38141
jsf32 0.01012 0.01217 0.02277 19 0.47025
pcg32 0.01044 0.01328 0.02421 19 0.47025
pcg32_fast 0.01085 0.01223 0.02357 17 0.29702
sfc32 0.00983 0.01170 0.02190 19 0.47025
xoroshiro64+ 0.01060 0.01109 0.02139 19 0.47025
xoroshiro64* 0.01014 0.01187 0.02226 20 0.55909
xoroshiro64** 0.01024 0.01282 0.02379 20 0.55909
xorshift*64/32 0.00983 0.01307 0.02382 20 0.55915
pcg32_once_insecure 0.01008 0.01304 0.02360 0 2.06e-9
xorshift32 0.00956 0.01163 0.02157 0 2.06e-9
std::minstd_rand 0.00859 0.00644 0.01518 0 2.06e-9

And here is the second:

Scheme Candidates (seconds) Rechecking (seconds) Total (seconds) Repeats (count) p-value
std::mt19937 0.01271 0.01763 0.03055 19 0.47025
gjrand32 0.01098 0.01279 0.02431 20 0.55909
jsf32 0.00945 0.01111 0.02131 20 0.55909
pcg32 0.01035 0.01192 0.02236 19 0.47025
pcg32_fast 0.01020 0.01166 0.02252 19 0.47025
sfc32 0.00991 0.0121 0.02281 19 0.47025
xoroshiro64+ 0.01023 0.01156 0.02205 20 0.55909
xoroshiro64* 0.00998 0.01228 0.02235 21 0.64369
xoroshiro64** 0.01080 0.01316 0.02465 22 0.72060
xorshift*64/32 0.00904 0.00870 0.01964 19 0.47031
pcg32_once_insecure 0.01009 0.01339 0.02452 0 2.06e-9
xorshift32 0.00880 0.00870 0.01861 0 2.06e-9
std::minstd_rand 0.00914 0.00678 0.01658 0 2.06e-9

The times are so short that there is a fair amount of variation between runs. We can determine that the Mersenne Twister is slower than the others but not a lot else.

##### Results for 64-Bit PRNGs
unix% ./birthday.mt19937_64.bloomfilter 0.01
Testing std::mt19937_64
0..18446744073709551615 is range of generator
13034599790 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60517 duplicates expected on average
0.01 is probability of zero duplicates
0xe98207de is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 1073741824 64-bit elements (3 bits/entry)
- allocated possible-match log with 536870912 64-bit elements
- Allocating memory completed (1.9574e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 362448004 repeat candidates found
- Phase 1, Gathering repeat candidates completed (1003.56 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (3.00296 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (9.87777 seconds)
- Phase 3, Testing candidates started...
- repeat of 4670426177229221241 found, step 5580903394
- repeat of 3043852644687666645 found, step 6413308578
- repeat of 16961218199013113029 found, step 8424705865
- repeat of 14428636557340975895 found, step 8809377507
- repeat of 13428407026126473932 found, step 9594741563
- repeat of 3452994221327353931 found, step 9718494086
- repeat of 5809172504185009050 found, step 10454910042
- repeat of 6895377101806331713 found, step 11314113109
- repeat of 10487859573817159209 found, step 11655528088
- Phase 3, Testing candidates completed (624.485 seconds)
- Testing for repeats completed (1640.93 seconds)
---
9 repeats, p-value = 1 - 0.0196595


Rather than using 97.1 GB for testing a 64-bit PRNG, this version uses 8 GB for the Bloom filter and 4 GB to log repeats. If we're happy to settle for a 72.5% chance of observing a repeat (and run the test a few times if we see none), we can perform the test in a total of 6 GB of RAM, well within the reach of most modern computers.

unix% ./birthday.mt19937_64.bloomfilter 0.275
Testing std::mt19937_64
0..18446744073709551615 is range of generator
6901370125 outputs to scan for duplicates [1.60685*sqrt(range)]
1.29098 duplicates expected on average
0.275 is probability of zero duplicates
0x54ef7472 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 536870912 64-bit elements (3 bits/entry)
- allocated possible-match log with 268435456 64-bit elements
- Allocating memory completed (8.9979e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 217650333 repeat candidates found
- Phase 1, Gathering repeat candidates completed (475.175 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (1.43748 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (5.42572 seconds)
- Phase 3, Testing candidates started...
- repeat of 6751060484066504991 found, step 6557968840
- Phase 3, Testing candidates completed (353.038 seconds)
- Testing for repeats completed (835.077 seconds)
---
1 repeats, p-value = 1 - 0.369979


This test gets done in less than 14 minutes on the main test machine I've been using for this article, which is a 2.1 GHz Haswell-EP-based Xeon E7-4830v3. If we run it instead on a smaller machine with only 32 GB of RAM but a more recent CPU, such as a 3.2 GHz Skylake-W-based Xeon W-2104, the same test gets done in less than ten minutes. On my 2015 iMac which has a 4.0 GHz Skylake-DT-based i7-6700K CPU (and 40 GB of RAM), it finishes in 11.5 minutes.

In fact, we can now see how long it takes us to fail SplitMix at different p-values on the Skylake-W machine:

Time (seconds) Memory (GB) Expected Repeats Actual Repeats p-value
503.59 6 1.29 0 0.275
906.07 12 4.6 0 0.01
1276.56 12 7.5 0 0.000553
1517.43 24 10 0 4.54e-05
2218.33 24 20 0 2.06e-09

It's fair to say that with less than half an hour on a modern machine we can fail SplitMix's 64-bit output for its lack of repeats.

Returning to our usual test machine, let's look at results for our group of 64-bit PRNGs looking for 20 repeats (these results are fast enough that I was able give the median of five runs):

Scheme Candidates Rechecking Total Repeats p-value
std::mt19937_64 2275.92 1553.84 3865.49 22 0.72061
gjrand64 1862.55 1267.72 3171.11 19 0.47026
jsf64 1805.18 1172.75 3010.42 25 0.88782
mcg128 1583.11 1132.47 2771.36 20 0.55909
mcg128_fast 1598.20 1123.13 2715.28 21 0.64370
pcg64 2077.65 1347.71 3455.36 20 0.55909
pcg64_fast 2151.92 1307.01 3505.75 19 0.47026
sfc64 1659.98 1115.88 2817.56 17 0.29703
xoroshiro128+ 1607.51 1161.14 2797.97 19 0.47026
xoroshiro128** 2281.04 1381.30 3698.20 20 0.55909
xorshift*128/64 2717.67 1456.00 4208.69 19 0.47026
pcg64_once_insecure 2130.41 1200.13 3388.48 0 2.06e-9
xorshift64 2227.01 1170.83 3423.48 0 2.06e-9
splitmix64 2271.97 1337.80 3652.14 0 2.06e-9

Although all timing results should be taken with a pinch of salt, it's a little disappointing that in this instance Vigna's Xoroshiro128** PRNG is actually slower than the Mersenne Twister. In this case, mcg128 also beats mcg128_fast and pcg64 beats pcg64_fast. For me, the main takeaway here is how difficult it is to meaningfully benchmark different PRNGs.

It might seem like we're done here, but it's worth looking at the effects of a few of PRNG adapters.

#### A divided_rng Adapter

Suppose, for example, that we desire positive signed 64-bit integers; in other words, 63-bit integers. If we take SplitMix's output and divide it by two, it goes from having no repeats to having each number repeated exactly once. Should it now pass our test?

In C++11's framework for PRNGs, it's very easy to make a PRNG adapter that can divide any PRNG by some number:

template <typename RNG, unsigned int DIVISOR,
typename rtype = typename RNG::result_type>
struct divided_rng : public RNG {
using RNG::RNG;
using result_type = rtype;

static result_type min() { return RNG::min() / DIVISOR; }
static result_type max() { return RNG::max() / DIVISOR; }

result_type operator()() { return RNG::operator()() / DIVISOR; }
};


(Note that although the code uses division, because DIVISOR is known at compile time, an optimizer will simplify the code to use fast bit shifts when DIVISOR is a power of two.)

##### Testing std::minstd_rand

Let's look first at std::minstd_rand and divide by 3, so that now rather than no repeats, each number appears three times. Initially, things look good:

unix% ./birthday.minstddiv3.sort 0.01
Testing divided_rng<regularized_rng<std::minstd_rand,uint32_t>,3>
0..715827881 is range of generator
81198 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60501 duplicates expected on average
0.0100016 is probability of zero duplicates
0xe9be5e15 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 81198.
- Allocating memory completed (7e-06 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.000403 seconds)
- Sorting started...
- Sorting completed (0.003575 seconds)
- Scanning for duplicates started...
- repeat of 29568970 found
- repeat of 501715642 found
- repeat of 533810085 found
- Scanning for duplicates completed (4.1e-05 seconds)
- Testing for repeats completed (0.00403 seconds)
---
3 repeats, p-value = 0.324891


This result seems like a good pass.

If, however, we run a larger test we find the number of repeats found consistently falls below the expected value:

macos% ./birthday.minstddiv3.sort 200
Testing divided_rng<regularized_rng<std::minstd_rand,uint32_t>,3>
0..715827881 is range of generator
535100 outputs to scan for duplicates [20*sqrt(range)]
199.95 duplicates expected on average
0 is probability of zero duplicates
0x51a1ff3d is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 535100.
- Allocating memory completed (7e-06 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.00298 seconds)
- Sorting started...
- Sorting completed (0.028566 seconds)
- Scanning for duplicates started...
- repeat of 1990789 found
- repeat of 13528078 found
[... 126 repeats omitted ...]
- repeat of 701171518 found
- repeat of 704058012 found
- Scanning for duplicates completed (0.000325 seconds)
- Testing for repeats completed (0.03189 seconds)
---
130 repeats, p-value = 8.26764e-08


In fact, we can still make this generator fail if we divide the output by 31 to get 30 repeats. We just have to generate more output to see the fail:

macos% ./birthday.minstddiv31.sort 20000
Testing divided_rng<regularized_rng<std::minstd_rand,uint32_t>,31>
0..69273665 is range of generator
1664617 outputs to scan for duplicates [200*sqrt(range)]
19840.8 duplicates expected on average
0 is probability of zero duplicates
0x3587008a is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 1664617.
- Allocating memory completed (7e-06 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.00837 seconds)
- Sorting started...
- Sorting completed (0.095529 seconds)
- Scanning for duplicates started...
- repeat of 9918 found
- repeat of 15644 found
[... 19226 repeats omitted ...]
- repeat of 69269020 found
- repeat of 69272254 found
- Scanning for duplicates completed (0.007172 seconds)
- Testing for repeats completed (0.111095 seconds)
---
19230 repeats, p-value = 6.68804e-06


We can double-check the test by doing a similar test dividing the 32-bit Mersenne Twister by 32:

macos% ./birthday.mt19937div32.sort 20000
Testing divided_rng<std::mt19937,32>
0..134217727 is range of generator
2317048 outputs to scan for duplicates [200*sqrt(range)]
19885.4 duplicates expected on average
0 is probability of zero duplicates
0x80df3adf is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 2317048.
- Allocating memory completed (7e-06 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.013726 seconds)
- Sorting started...
- Sorting completed (0.13695 seconds)
- Scanning for duplicates started...
- repeat of 6369 found
- repeat of 10240 found
[... 19712 repeats omitted ...]
- repeat of 134200182 found
- repeat of 134205138 found
- Scanning for duplicates completed (0.009592 seconds)
- Testing for repeats completed (0.160295 seconds)
---
19716 repeats, p-value = 0.1154


Here we get a much more plausible number of repeats.

##### Testing splitmix64

Again, with SplitMix and 63-bit output, we'll pass a small test:

unix% ./birthday.splitmix64div2.bloomfilter 0.01
Testing divided_rng<SplitMix64,2>
0..9223372036854775807 is range of generator
9216853902 outputs to scan for duplicates [3.03485*sqrt(range)]
4.60517 duplicates expected on average
0.01 is probability of zero duplicates
0x24ceea9b is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 1073741824 64-bit elements (4 bits/entry)
- allocated possible-match log with 536870912 64-bit elements
- Allocating memory completed (3.4661e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 95636376 repeat candidates found
- Phase 1, Gathering repeat candidates completed (702.437 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (3.00048 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (2.1461 seconds)
- Phase 3, Testing candidates started...
- repeat of 1021193238591905732 found, step 7067299467
- Phase 3, Testing candidates completed (203.722 seconds)
- Testing for repeats completed (911.306 seconds)
---
1 repeats, p-value = 0.0560517


But if we run a larger test (e.g., looking for about 100 repeats), about an hour and quarter (and 48 GB) later, it fails:

unix% ./birthday.splitmix64div2.bloomfilter 100
Testing divided_rng<SplitMix64,2>
0..9223372036854775807 is range of generator
42949672961 outputs to scan for duplicates [14.1421*sqrt(range)]
100 duplicates expected on average
3.78351e-44 is probability of zero duplicates
0x6bc87e32 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 4294967296 64-bit elements (3 bits/entry)
- allocated possible-match log with 2147483648 64-bit elements
- Allocating memory completed (2.7746e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 773607258 repeat candidates found
- Phase 1, Gathering repeat candidates completed (3268.16 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (11.7403 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (21.3582 seconds)
- Phase 3, Testing candidates started...
- repeat of 6497401084100231171 found, step 4349343329
- repeat of 8982250182547917419 found, step 10013821426
[... 38 repeats omitted ...]
- repeat of 3352171370750978657 found, step 40445072164
- repeat of 3504357146487545 found, step 40476413442
- Phase 3, Testing candidates completed (1285.31 seconds)
- Testing for repeats completed (4586.57 seconds)
---
42 repeats, p-value = 4.51174e-11


These two tests of MINSTD and SplitMix show that our birthday-problem test not only detects a complete lack of repeats, it also determines whether the number of repeats is statistically correct, which it is not for this variant of SplitMix. (When used as a 32-bit PRNG, SplitMix will easily have enough repeats to pass this test.)

#### A skipping_rng Adapter

Thus far via the Bloom-filter based test, we've been able to reduce the testing overheads so that we can test 64-bit PRNGs on most modern machines, but the test is still quite resource hungry. Our final adapter will allow us to find problems with lack of repeats in as little memory as we like.

This adapter is like the divided_rng adapter, but only returns a value when the remainder after division is equal to some particular value (e.g., zero). In this case, every repeat from the adapted generator represents a real repeat from the original generator, but most outputs from the original generator are skipped (including lots of repeats).

Here's the code

template <typename RNG, typename RNG::result_type DIVISOR,
typename RNG::result_type REMAINDER_MATCH = 0,
typename rtype = typename RNG::result_type>
struct skipping_rng : public RNG {
using RNG::RNG;
using result_type = rtype;

static_assert(RNG::min() == 0, "Requires sane PRNG");

static result_type min() { return RNG::min() / DIVISOR; }
static result_type max() { return RNG::max() / DIVISOR; }

result_type operator()() {
for(;;) {
result_type rval = RNG::operator()();
if ((rval % result_type(DIVISOR)) == result_type(REMAINDER_MATCH)) {
return rval / result_type(DIVISOR);
}
}
}
};

##### Testing std::minstd_rand

Here's a test of MINSTD.

macos% ./birthday.minstdskip49981.sort 10
Testing skipping_rng<regularized_rng<std::minstd_rand,uint32_t>,49981>
0..42965 is range of generator
927 outputs to scan for duplicates [4.47214*sqrt(range)]
9.91802 duplicates expected on average
4.92789e-05 is probability of zero duplicates
0xb621d0 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 927.
- Allocating memory completed (1e-06 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (0.163223 seconds)
- Sorting started...
- Sorting completed (3.1e-05 seconds)
- Scanning for duplicates started...
- Scanning for duplicates completed (0 seconds)
- Testing for repeats completed (0.163266 seconds)
---
0 repeats, p-value = 4.92789e-05


Now we fail the test in less than a second using less than 4 KB of RAM. In other words, had it been around then, I could have failed this generator on my 1982 16K ZX Spectrum home computer.

##### Testing splitmix64

It's the same story for SplitMix. In this test, we generate 4.5 TB of output, but only keep 1/65536 of it, reducing our storage requirements down to 0.5 GB to get a test small enough to run on an old iPhone 5s:

macos% ./birthday.spitmix64skip65536.sort 10
Testing skipping_rng<SplitMix64,65536>
0..281474976710655 is range of generator
75029991 outputs to scan for duplicates [4.47214*sqrt(range)]
10 duplicates expected on average
4.54e-05 is probability of zero duplicates
0x3ad0e4f1 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated array of size 75029991.
- Allocating memory completed (7e-06 seconds)
- Storing PRNG output started...
- Storing PRNG output completed (4680.73 seconds)
- Sorting started...
- Sorting completed (5.72453 seconds)
- Scanning for duplicates started...
- Scanning for duplicates completed (0.043199 seconds)
- Testing for repeats completed (4686.5 seconds)
---
0 repeats, p-value = 4.54e-05


This test took just under an hour and a twenty minutes to run, but ran in 2.3% of the memory we needed for our memory-efficient Bloom-filter version, and 0.39% of the memory we would use with our original store-and-sort version.

#### A doubled_rng Adapter

Our final adapter gathers together pairs of outputs, allowing us to test for repeats in two dimensional space rather than a single dimension. It does so by combining two output values of range $r$ from the underling generator into a single output of range $r^2$ (e.g., turning a 32-bit PRNG into a 64-bit PRNG).

template <typename RNG, typename rtype>
struct doubled_rng : public RNG {
public:
using RNG::RNG;
using result_type = rtype;

static_assert(RNG::min() == 0, "Requires sane PRNG");

static constexpr result_type min() { return combine2(RNG::min(), RNG::min()); }
static constexpr result_type max() { return combine2(RNG::max(), RNG::max()); }

result_type operator()()
{
auto first  = RNG::operator()();
auto second = RNG::operator()();
return combine2(first,second);
}

protected:
static constexpr result_type combine2(typename RNG::result_type first,
typename RNG::result_type second)
{
return (result_type(second) * (result_type(RNG::max())+1))
+ result_type(first);
}
};

##### Testing Xoroshiro64+

We'll test this adapter by combining pairs of outputs from Xoroshiro64+ (a 32-bit PRNG) to make 64-bit outputs:

unix% ./birthday.xoroshiro64plus32doubled.bloomfilter 10
Testing doubled_rng<xoroshiro64plus32,uint64_t>
0..18446744073709551615 is range of generator
19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
10 duplicates expected on average
4.53999e-05 is probability of zero duplicates
0x9d170147 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
- allocated possible-match log with 1073741824 64-bit elements
- Allocating memory completed (1.7381e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 222925104 repeat candidates found
- Phase 1, Gathering repeat candidates completed (1175.11 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (2.94328 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (3.82449 seconds)
- Phase 3, Testing candidates started...
- repeat of 1341226328110553406 found, step 12231469160
- repeat of 16219277760915004482 found, step 14854862079
- repeat of 166193113774345064 found, step 15970584051
- repeat of 15772780968502958711 found, step 17360448815
- repeat of 9015983296019028048 found, step 18470600211
- Phase 3, Testing candidates completed (387.382 seconds)
- Testing for repeats completed (1569.26 seconds)
---
5 repeats, p-value = 0.067086


As you can see, this generator passes the test. We could argue, perhaps, that because this generator only has 64 bits of state, producing 64-bit output that passes this test must also be biased—if some 64-bit outputs are showing up twice, some must not be showing up at all.

##### Testing Xoroshiro64*

Vigna's Xoroshiro64* is similar to Xoroshiro64+, but has a different output function. Vigna claims it is an improvement because this variant is now maximally equidistributed. Let's see how it fares:

unix% ./birthday.xoroshiro64star32doubled.bloomfilter 10
Testing doubled_rng<xoroshiro64star32,uint64_t>
0..18446744073709551615 is range of generator
19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
10 duplicates expected on average
4.53999e-05 is probability of zero duplicates
0x3cff5c16 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
- allocated possible-match log with 1073741824 64-bit elements
- Allocating memory completed (1.8427e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 222957805 repeat candidates found
- Phase 1, Gathering repeat candidates completed (1188.37 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (4.81861 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (4.20488 seconds)
- Phase 3, Testing candidates started...
- Phase 3, Testing candidates completed (425.412 seconds)
- Testing for repeats completed (1622.8 seconds)
---
0 repeats, p-value = 4.53999e-05


It fails the test. Xoroshiro64* generates every pair of 32-bit outputs exactly once (except for (0,0), which is never generated).

##### Testing Truncated XorShift

Finally, we'll test two truncated XorShift PRNGs. First, let's look at my recommended generator, XorShift* 64/32:

unix% ./birthday.xorshift64star32doubled.bloomfilter 10
Testing doubled_rng<xorshift64star32a,uint64_t>
0..18446744073709551615 is range of generator
19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
10 duplicates expected on average
4.53999e-05 is probability of zero duplicates
0x174da134 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
- allocated possible-match log with 1073741824 64-bit elements
- Allocating memory completed (2.6126e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 222922629 repeat candidates found
- Phase 1, Gathering repeat candidates completed (1609.33 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (6.1724 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (5.9659 seconds)
- Phase 3, Testing candidates started...
- repeat of 16075751520106806424 found, step 4869737780
- repeat of 8452591016538347469 found, step 6063298803
- repeat of 8419835388433088013 found, step 6971801008
- repeat of 9239875403387430140 found, step 7987074261
- repeat of 11194445319005759147 found, step 9274052778
- repeat of 12213942361713591065 found, step 15617173621
- repeat of 3144078127879205090 found, step 17344117368
- repeat of 9374697397498985266 found, step 17683711291
- repeat of 4344357064314348963 found, step 18543158535
- repeat of 13251189265117312659 found, step 18807841871
- repeat of 10280694519002528251 found, step 19083159588
- Phase 3, Testing candidates completed (657.915 seconds)
- Testing for repeats completed (2279.39 seconds)
---
11 repeats, p-value = 1 - 0.303224


Notice that this generator passes the test just fine. Now let's try XorShift 64/32 which throws away 32-bits but does not use multiplication as a scrambling function.

unix% ./birthday.xorshift64plain32doubled.bloomfilter 10
Testing doubled_rng<xorshift_detail::xorshift<uint64_t, uint32_t, 11, 31, 18>,uint64_t>
0..18446744073709551615 is range of generator
19207677670 outputs to scan for duplicates [4.47214*sqrt(range)]
10 duplicates expected on average
4.53999e-05 is probability of zero duplicates
0x68389c89 is seed for this run
---
- Testing for repeats started...
- Allocating memory started...
- allocated bloomish filter with 2147483648 64-bit elements (4 bits/entry)
- allocated possible-match log with 1073741824 64-bit elements
- Allocating memory completed (1.9951e-05 seconds)
- Phase 1, Gathering repeat candidates started...
- 323099500 repeat candidates found
- Phase 1, Gathering repeat candidates completed (1749.5 seconds)
- Reallocating memory started...
- Deallocated bloomish filter
- Allocated hash table of the same size
- Reallocating memory completed (6.59228 seconds)
- Phase 2, Storing candidates in hash table started...
- Phase 2, Storing candidates in hash table completed (11.1204 seconds)
- Phase 3, Testing candidates started...
- repeat of 8432241086602748180 found, step 512009280
- repeat of 10991711275384902383 found, step 2984376329
- repeat of 8715911071080357710 found, step 3076849680
- repeat of 6307293652390151796 found, step 3699476747
- repeat of 5918160424646835498 found, step 3757200762
- repeat of 2102851281581636308 found, step 3793484182
- repeat of 18004728710102585771 found, step 3856237150
- repeat of 13643872512152535833 found, step 3904548937
- repeat of 15813780386717731573 found, step 5120593723
- repeat of 10606428155832328500 found, step 5246820173
- repeat of 4124773483188809046 found, step 5689671380
- repeat of 8881227705206282627 found, step 6314841372
- repeat of 10347249380159185545 found, step 6549081914
- repeat of 3259454996581558145 found, step 6789498453
- repeat of 5465544356170662972 found, step 6895105232
- repeat of 9598202579737928973 found, step 7151318395
- repeat of 12711680550813551463 found, step 7547298209
- repeat of 8243254123160113110 found, step 7680715882
- repeat of 3105491326004167632 found, step 9584052322
- repeat of 6389591889012097379 found, step 10209858987
- repeat of 5998122806250358082 found, step 10400041560
- repeat of 6485183761602045376 found, step 10482622702
- repeat of 13409280636077211564 found, step 10789884330
- repeat of 7440253727363109864 found, step 10795267164
- repeat of 15774111206512302355 found, step 10911672510
- repeat of 9542813944494139543 found, step 11467617560
- repeat of 13106000190889745837 found, step 11564124980
- repeat of 4539599262207590835 found, step 11564467874
- repeat of 7226327696139710853 found, step 11698843712
- repeat of 16662818355299619849 found, step 11777961088
- repeat of 3514753089552548735 found, step 12145627618
- repeat of 3686498505503683864 found, step 12536448056
- repeat of 8754154103642100727 found, step 13018513381
- repeat of 5922676618160286848 found, step 13616884749
- repeat of 2771565926978918441 found, step 13775938319
- repeat of 2673034166933327078 found, step 13927906955
- repeat of 17286920818906038529 found, step 14140724501
- repeat of 15351512866337221122 found, step 14301601872
- repeat of 17275527660410671298 found, step 14464765839
- repeat of 12666155941460351740 found, step 14534427945
- repeat of 7259412895957455688 found, step 15101896626
- repeat of 8718082224947945304 found, step 15269349237
- repeat of 17673454400876571400 found, step 15899175752
- repeat of 3093540400943525978 found, step 16102279002
- repeat of 100775988639710351 found, step 16115089460
- repeat of 3840328319316700675 found, step 16162103481
- repeat of 12046651965032902115 found, step 16222837628
- repeat of 13960047651775979002 found, step 16443235714
- repeat of 10615736297527204965 found, step 16670639022
- repeat of 8565924938554264941 found, step 16892419321
- repeat of 3992549844265476149 found, step 16943129372
- repeat of 7998729961698395805 found, step 17104768142
- repeat of 13918425977287437518 found, step 17233884687
- repeat of 1117798506300080490 found, step 17290345185
- repeat of 859886623890470340 found, step 17376432879
- repeat of 14006406166013631531 found, step 17495268199
- repeat of 8400904031756885513 found, step 17623867884
- repeat of 17429755626699761713 found, step 17688996534
- repeat of 10855999513313140013 found, step 17873370607
- repeat of 6477932417974324450 found, step 17877309427
- repeat of 9759571848035785557 found, step 17891403453
- repeat of 8214606349987392434 found, step 17975244179
- repeat of 3915974417542383620 found, step 18070507208
- repeat of 2810120710806743824 found, step 18220588645
- repeat of 2820970032855298710 found, step 18295595540
- repeat of 6952151018582886574 found, step 18629103333
- repeat of 6636270080993323558 found, step 18691888391
- repeat of 16367499301027804324 found, step 18968625522
- repeat of 13861772320392267683 found, step 19205087616
- Phase 3, Testing candidates completed (854.275 seconds)
- Testing for repeats completed (2621.49 seconds)
---
69 repeats, p-value = 1.89751e-15


Oh dear, that's vastly too many repeats. Thus although this generator passed the test when we looked at repeats in one dimension, in two dimensions it fails. (This test also reveals some rounding-error issues in the calculation of p-values when there are vastly too many repeats; the correct p-value is 1 - 4.4e-35.)

The existence of repeats in both of these truncated XorShift generators does, however, show that neither of these generators is maximally equidistributed (because maximal equidistribution would be two-dimensional equidistribution, and would result in no repeats occurring in the test). I mention this because in a comment on the website Reddit, Vigna wrote:

And of course Marsaglia's xorshift64 generators are equidistributed in the maximum dimension—it's a trivial mathematical fact; Melissa O'Neill writing "no" in a table on her website will not change mathematics.

Initially, I thought Vigna was talking about one of these truncated generators, because the table on the front page of the PCG website lists XorShift* 64/32. If he had been thinking that XorShift 64/32 or XorShift* 64/32 were two-dimensionally equidistributed, he would have been in error as these tests show. But I now realize he was talking about the untruncated versions we saw failing the test earlier. We can describe XorShift 32 and XorShift 64 as ”equidistributed in the maximum dimension” provided we understand that in this case, the “maximum dimension” is one; but to my eye k-dimensional equidistribution with k = 1 is a degenerate case that is mostly meaningless. Thus we can chalk this one up to a misunderstanding—I only think k-dimensional equidistribution is interesting as a property when k > 1. If I had used Vigna's definition, I would have to say that every uniform PRNG is k-dimensionally equidistributed (k = 1), which is almost all of the ones I list in the table.

### Conclusions

We began with a test for repeats based on the birthday problem that was viable to run on 32-bit PRNGs, and also worked on 64-bit PRNGs but required significant amounts of memory to run. We looked at several approaches, refining our storage strategy until we had a test that could run in less than half an hour on a modern laptop. By throwing away some outputs, we finally ended up with a test that can run on on far more modest hardware (such as a 2008 MacBook Air or an iPhone 5s).

There are some “birthday” tests in PRNG test suites, such as Marsaglia's Birthday Spacings test (as discussed by Eucyer and Simard), but Marsaglia's test is about detecting too much regularity in the distance between outputs when sorted into order rather than the number of repeats. It appears that currently most PRNG test-suites don't check to make sure generators have the correct frequency of 64-bit repeats, even though, as we have seen, it's entirely practical to do so and doing so can detect flaws missed by other tests.

With this test, we are able to quickly fail MINSTD, XorShift 32, XorShift 64, SplitMix64, and Xoroshiro64* (when its 32-bit output is paired to make 64-bit output). (We can also fail the PCG _once_insecure variants as well, although the every-output-occurs-only-once property is telegraphed in the name and the PCG paper cautions against their use as general-purpose generators for precisely this reason.)

Finally, the fact that Xoroshiro64* fails the test and Xoroshiro64+ passes casts some doubt on the idea that maximal equidistribution is a desirable property, because a maximally equidistributed generator must fail this statistical test shortly after the square root of the period, whereas its less-than-maximally equidistributed sibling survives far longer. We'll return to that question in a future article.

The tests involved in this article involved 138 distinct executables, run across three different machines and two different operating systems (4310 runs on the 512 GB Haswell Xeon Linux server, 14 runs on the Skylake Xeon Linux server, and 7 runs on the Skylake Xeon Mac desktop). A total of 4049 runs generated 155 billion 32-bit integers (0.56 TB), and 280 runs generated 20.2 trillion 64-bit integers (147 TB), requiring 9.2 days of compute time. About 15 trillion of these outputs comes from three runs (across all three platforms) of birthday.spitmix64skip65536.sort, which uses 5 trillion numbers per run. To allow comparison across platforms, the code was compiled with Clang/LLVM 6.0.0, using the GNU libstdc++ standard library (which has a significantly faster Mersenne Twister implementation than the one in libc++). Although the tests can be used as benchmarks to compare both different generators and different platforms, an overall takeaway from the timings collected is that a vast array of factors influence performance.