PCG Passes PractRand

There were various questions that people asked when PCG first started being discussed on the Internet; one those questions was whether I had also tested PCG with PractRand. I hadn't. I had done my testing using TestU01, which is considered by most to be the “gold standard”, with the broadest range of tests. Back in 2014, PractRand hadn't been on my radar, but looking into it I discovered that it is TestU01's closest competitor. I put testing with PractRand on my to-do list, but it wasn't a very high priority because I knew that all my PCG family members had been designed to pass TestU01's BigCrush test battery with plenty of headroom, which gave me good reason to believe they would do well in any set of statistical tests.

Finally, this summer I was discussing the best practices for testing random number generators with a colleague and I realized that it was high time for me to finally get around to testing PCG (and a few other recent PRNGs, too!). The short version is, it passes just fine. But for the details, read on…

How Random Number Generator Testing Works

PRNG test suites work by performing some statistically well-understood task using a candidate generator and then checking the plausibility of the results. For example, we could simulate the Birthday Problem, where we keep adding people to a room until two people have the same birthday, and then count how many people we needed. The mathematics of this problem is well understood, so we know exactly how probable every outcome is. A generator that repeated “42” forever would fail this test because repeats (“the same birthday”) occur far too often, as would a generator which never allowed any number to repeat (not enough repeats), as would a generator where repeats happened with the exact same gap between every repeat (incorrect distribution for repeats).

There are a large number of possible tests, some based on familiar ideas like the birthday problem or hands of poker, and others that are more abstruse. It is not uncommon for a generator to pass many tests in a test battery but be tripped up by just one family of tests.

Much like experiments in other fields of science, these tests typically produce a p-value, but whereas scientists usually desire results where the null hypothesis—that the observations were merely due to chance—is ruled out, we desire the opposite: a result confirming that the observed behavior is consistent with chance.

It's in the nature of random number generation that if you run enough tests enough times, you'll sometimes have test outcomes that amount to a “one in a thousand” or even “one in a million” event, but these kinds of failures are easily separated from genuine problems because real problems get worse and worse the more we test, whereas the occasional improbable result won't recur through extended or repeated testing.

Installation and Use; PractRand vs TestU01

It turns out that TestU01 and PractRand are quite complementary.

TestU01

TestU01 comes with an academic paper and two user manuals (a basic one and an advanced one). It compiles to a library and users must understand the TestU01 API in order to use it. In practice, to use the various test batteries TestU01 provides, we just need to understand one function, unif01_CreateExternGenBits. We pass it a function that generates a 32-bit unsigned int and the name of the generator, and then pass the unif01_Gen* object it returns on to the test battery we wish to use, such as bbattery_BigCrush.

TestU01 is not very hard to interface with in theory, but in practice there are a few gotchas. It is designed to test at most 32 bits and some tests use fewer. And it tends to be most sensitive to issues in the high/most-significant bits rather than the low/least-significant bits. (Some users have thought that because it also provides a function unif01_CreateExternGen01 for generators that return floating point double value in the range of 0–1 that this function would use all 48 bits of the floating-point mantissa but that is not the case.)

Thus even when testing 32-bit values, it is important to run two tests, a regular test and a test with a modified version of the PRNG that returns its bits flipped into reverse order.

When the Crush batteries finish, they write a summary. For example, a successful pair of Small Crush runs looks like this:


========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        32-bit WELLRNG512a
 Number of statistics:  15
 Total CPU time:   00:00:07.82

 All tests were passed


========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        32-bit WELLRNG512a [Reversed]
 Number of statistics:  15
 Total CPU time:   00:00:09.10

 All tests were passed

and a failure looks like this:


========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        32-bit MINSTD LCG
 Number of statistics:  15
 Total CPU time:   00:00:10.05
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  1  BirthdaySpacings                 eps  
  2  Collision                      1 - eps1
  6  MaxOft                           eps  
 ----------------------------------------------
 All other tests were passed

For generators that produce 64-bit values, we must generate at least four variations to return the high 32 bits, the low 32 bits, and each of these reversed. Even there, it is not clear whether there might be some unfortunate correlation between the top half and the bottom half of the 64-bit value that this “half at a time” approach examines, so perhaps we should also add two more variants, a version that alternates the high-32 and low-32 bits and its reversed version.

Because 64-bit generators are difficult to test with much certainty in TestU01, I strongly recommend that creators of 64-bit generators make sure that their generation scheme is scalable to smaller sizes, allowing them to test cut-down 32-bit versions that use half as much memory for generator state. That way TestU01 can have a more holistic view of a smaller-scale generator. If the smaller-scale generator passes TestU01's tests, we have much more reason to be confident about its larger sibling.

TestU01 is capable of far more than just running its crush batteries. There are more detailed interfaces to individual tests. For example, as I mentioned in the PCG paper, the linear complexity test (which a number of generators fail, including some of L'Ecuyer's generators) was inexplicably omitted from Small Crush, so my testing program has an option to directly call the linear complexity test, running it at four different sizes, instead of running one of the crush benchmarks. Here's are the results of the first of those calls:

***********************************************************
HOST = satsuki.local, Darwin

32-bit WELLRNG512a


scomp_LinearComp test:
-----------------------------------------------
   N =  1,  n = 5000,  r =  0,    s = 1



-----------------------------------------------
Number of degrees of freedom          :    5
Chi2 statistic for size of jumps      :    3.76
p-value of test                       :    0.58


-----------------------------------------------
Normal statistic for number of jumps  :  -40.14
p-value of test                       : 1 - eps1    *****



-----------------------------------------------
CPU time used                    :  00:00:00.00

Generator state:

As you can see, individual test results are more detailed than the Crush test battery results. The failure of the test is marked by the asterisks. The other three tests fail similarly. (Running all four tests took less than 0.1 of a second to run, so it would not have significantly slowed to Small Crush to have included some linear-complexity testing.)

In summary, TestU01 is well documented, powerful, and fairly easy to use, but it is also a little tricky to use well and a little sketchy when it comes to testing 64-bit generators.

PractRand

In contrast, PractRand is more sparsely documented and connecting a preexisting random number generator into its C++ code is more complex, but doing so is not actually necessary because it will happily read a binary stream of random numbers from standard input.

Thus, it is even easier to run tests than it is with TestU01. In fact, we can test macOS's /dev/random (a special “device file” that outputs nondeterministic cryptographically secure randomness using the Yarrow PRNG) by just running a simple command:

macOS$ ./RNG_test stdin < /dev/random
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0x906c8753
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0x906c8753
length= 32 megabytes (2^25 bytes), time= 3.9 seconds
  no anomalies in 130 test result(s)

rng=RNG_stdin, seed=0x906c8753
length= 64 megabytes (2^26 bytes), time= 8.3 seconds
  no anomalies in 139 test result(s)

rng=RNG_stdin, seed=0x906c8753
length= 128 megabytes (2^27 bytes), time= 16.4 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R=  +4.9  p =  1.1e-3   unusual          
  ...and 150 test result(s) without anomalies

rng=RNG_stdin, seed=0x906c8753
length= 256 megabytes (2^28 bytes), time= 32.5 seconds
  no anomalies in 162 test result(s)

rng=RNG_stdin, seed=0x906c8753
length= 512 megabytes (2^29 bytes), time= 63.0 seconds
  no anomalies in 171 test result(s)

rng=RNG_stdin, seed=0x906c8753
length= 1 gigabyte (2^30 bytes), time= 122 seconds
  no anomalies in 183 test result(s)

^C

Compared to TestU01, there are a few things to notice. TestU01's various Crush test batteries run silently for a period of time (a few hours for Big Crush) and then output their results. In contrast, PractRand starts telling you what it's seeing almost immediately and issues a report every time it has seen twice as much data as the last time it made a report.

Also, PractRand is oriented towards receiving a bytestream. It doesn't mind if the generators are 32-bit, 64-bit, 128-bit, or something more unusual like 72-bit or 96-bit. (Although it does benefit from knowing that the generator is producing 32- or 64-bit chunks.)

Finally, we can see that one test reported a result that was slightly unlikely from a statistical perspective but the blip vanished as testing continued.

I stopped the test interactively after 1 GB by pressing Control-C. By default PractRand will keep going until it has tested 32 TB of output. If 1 GB takes 122 seconds, then it'll take 46 days to finish the test! Might as well stop it now.

For fun let's try another Unix special file, /dev/zero, which outputs zeros forever:

macOS$ $ ./RNG_test stdin < /dev/zero
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0xbbfee72
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0xbbfee72
length= 128 megabytes (2^27 bytes), time= 2.4 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(2+0,13-3,T)                  R=+17706724 p = 0           FAIL !!!!!!!!  
  BCFN(2+1,13-3,T)                  R=+8369402 p = 0           FAIL !!!!!!!!  
  BCFN(2+2,13-3,T)                  R=+4018669 p = 0           FAIL !!!!!!!!  
  BCFN(2+3,13-3,T)                  R=+1951923 p = 0           FAIL !!!!!!!!  
  BCFN(2+4,13-4,T)                  R=+1217378 p = 0           FAIL !!!!!!!!  
  BCFN(2+5,13-5,T)                  R=+754738 p = 0           FAIL !!!!!!!!  
  BCFN(2+6,13-5,T)                  R=+373439 p = 0           FAIL !!!!!!!!  
  BCFN(2+7,13-6,T)                  R=+229795 p = 0           FAIL !!!!!!!!  
  BCFN(2+8,13-6,T)                  R=+114286 p = 0           FAIL !!!!!!!!  
  BCFN(2+9,13-7,T)                  R=+69277  p = 0           FAIL !!!!!!!!  
  BCFN(2+10,13-8,T)                 R=+41033  p = 0           FAIL !!!!!!!!  
  BCFN(2+11,13-8,T)                 R=+20468  p =  1e-5195    FAIL !!!!!!!!  
  BCFN(2+12,13-9,T)                 R=+11745  p =  3e-2640    FAIL !!!!!!!!  
  BCFN(2+13,13-9,T)                 R= +5857  p =  4e-1317    FAIL !!!!!!!!  
  DC6-9x1Bytes-1                    R=+10806727 p = 0           FAIL !!!!!!!!  
  Gap-16:A                          R=+3583824 p = 0           FAIL !!!!!!!!  
  Gap-16:B                          R=+16048500 p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:cross                 R=+294195018 p = 0           FAIL !!!!!!!!  
  BRank(12):128(4)                  R= +5300  p~=  1e-2819    FAIL !!!!!!!!  
  BRank(12):256(4)                  R=+10811  p~=  1e-5750    FAIL !!!!!!!!  
  BRank(12):384(1)                  R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  BRank(12):512(2)                  R=+15438  p~=  2e-4648    FAIL !!!!!!!!  
  BRank(12):768(1)                  R=+16427  p~=  3e-4946    FAIL !!!!!!!!  
  BRank(12):1K(2)                   R=+31025  p~=  1e-9340    FAIL !!!!!!!!  
  BRank(12):1536(1)                 R=+32960  p~=  4e-9923    FAIL !!!!!!!!  
  [Low1/8]BCFN(2+0,13-5,T)          R=+3546530 p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+1,13-5,T)          R=+1676322 p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+2,13-5,T)          R=+804897 p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+3,13-5,T)          R=+390941 p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+4,13-6,T)          R=+237394 p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+5,13-6,T)          R=+116954 p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+6,13-7,T)          R=+70419  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+7,13-8,T)          R=+41511  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+8,13-8,T)          R=+20637  p =  2e-5238    FAIL !!!!!!!!  
  [Low1/8]BCFN(2+9,13-9,T)          R=+11814  p =  1e-2655    FAIL !!!!!!!!  
  [Low1/8]BCFN(2+10,13-9,T)         R= +5881  p =  1e-1322    FAIL !!!!!!!!  
  [Low1/8]DC6-9x1Bytes-1            R=+1591281 p = 0           FAIL !!!!!!!!  
  [Low1/8]Gap-16:A                  R=+604230 p = 0           FAIL !!!!!!!!  
  [Low1/8]Gap-16:B                  R=+2770854 p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:cross         R=+33904284 p = 0           FAIL !!!!!!!!  
  [Low1/8]BRank(12):128(4)          R= +5300  p~=  1e-2819    FAIL !!!!!!!!  
  [Low1/8]BRank(12):256(2)          R= +7644  p~=  3e-2302    FAIL !!!!!!!!  
  [Low1/8]BRank(12):384(1)          R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  [Low1/8]BRank(12):512(2)          R=+15438  p~=  2e-4648    FAIL !!!!!!!!  
  [Low1/8]BRank(12):768(1)          R=+16427  p~=  3e-4946    FAIL !!!!!!!!  
  [Low4/32]BCFN(2+0,13-5,T)         R=+3546530 p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+1,13-5,T)         R=+1676322 p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+2,13-5,T)         R=+804897 p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+3,13-5,T)         R=+390941 p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+4,13-6,T)         R=+237394 p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+5,13-6,T)         R=+116954 p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+6,13-7,T)         R=+70419  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+7,13-8,T)         R=+41511  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+8,13-8,T)         R=+20637  p =  2e-5238    FAIL !!!!!!!!  
  [Low4/32]BCFN(2+9,13-9,T)         R=+11814  p =  1e-2655    FAIL !!!!!!!!  
  [Low4/32]BCFN(2+10,13-9,T)        R= +5881  p =  1e-1322    FAIL !!!!!!!!  
  [Low4/32]DC6-9x1Bytes-1           R=+1591281 p = 0           FAIL !!!!!!!!  
  [Low4/32]Gap-16:A                 R=+604230 p = 0           FAIL !!!!!!!!  
  [Low4/32]Gap-16:B                 R=+2770854 p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:cross        R=+33904284 p = 0           FAIL !!!!!!!!  
  [Low4/32]BRank(12):128(4)         R= +5300  p~=  1e-2819    FAIL !!!!!!!!  
  [Low4/32]BRank(12):256(2)         R= +7644  p~=  3e-2302    FAIL !!!!!!!!  
  [Low4/32]BRank(12):384(1)         R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  [Low4/32]BRank(12):512(2)         R=+15438  p~=  2e-4648    FAIL !!!!!!!!  
  [Low4/32]BRank(12):768(1)         R=+16427  p~=  3e-4946    FAIL !!!!!!!!  
  [Low1/32]BCFN(2+0,13-6,T)         R=+1099294 p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+1,13-6,T)         R=+519590 p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+2,13-6,T)         R=+249477 p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+3,13-6,T)         R=+121164 p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+4,13-7,T)         R=+72212  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+5,13-8,T)         R=+42258  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+6,13-8,T)         R=+20899  p =  5e-5305    FAIL !!!!!!!!  
  [Low1/32]BCFN(2+7,13-9,T)         R=+11920  p =  1e-2679    FAIL !!!!!!!!  
  [Low1/32]BCFN(2+8,13-9,T)         R= +5918  p =  6e-1331    FAIL !!!!!!!!  
  [Low1/32]DC6-9x1Bytes-1           R=+483643 p = 0           FAIL !!!!!!!!  
  [Low1/32]Gap-16:A                 R=+223041 p = 0           FAIL !!!!!!!!  
  [Low1/32]Gap-16:B                 R=+888483 p = 0           FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:cross        R=+7961869 p = 0           FAIL !!!!!!!!  
  [Low1/32]BRank(12):128(2)         R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  [Low1/32]BRank(12):256(2)         R= +7644  p~=  3e-2302    FAIL !!!!!!!!  
  [Low1/32]BRank(12):384(1)         R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  [Low1/32]BRank(12):512(1)         R=+10916  p~=  3e-3287    FAIL !!!!!!!!  

We can see that PractRand is very unhappy with tons of failures. PractRand actually varies the number of exclamation marks based on the severity of the failure. Obviously all zeros all the time causes very severe failures. The names of the tests are a little cryptic, but they are described on the PractRand website.

That example was a pretty ridiculously extreme fail, so I'll also show you the results from a test of a couple of bad PRNGs to show what more typical fails might look like:

linux$ ./bad-64bit-rng1 | ./RNG_test stdin64
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xfecb5683
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xfecb5683
length= 128 megabytes (2^27 bytes), time= 2.1 seconds
  no anomalies in 148 test result(s)

rng=RNG_stdin64, seed=0xfecb5683
length= 256 megabytes (2^28 bytes), time= 4.8 seconds
  no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0xfecb5683
length= 512 megabytes (2^29 bytes), time= 10.0 seconds
  no anomalies in 169 test result(s)

rng=RNG_stdin64, seed=0xfecb5683
length= 1 gigabyte (2^30 bytes), time= 19.8 seconds
  no anomalies in 180 test result(s)

rng=RNG_stdin64, seed=0xfecb5683
length= 2 gigabytes (2^31 bytes), time= 36.5 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/64]FPF-14+6/16:(0,14-0)     R=  +6.3  p =  1.9e-5   unusual          
  [Low4/64]FPF-14+6/16:(1,14-0)     R=  +6.2  p =  2.6e-5   unusual          
  [Low4/64]FPF-14+6/16:all          R=  +7.5  p =  1.5e-6   suspicious       
  [Low4/64]FPF-14+6/16:all2         R=  +9.6  p =  3.1e-5   mildly suspicious
  ...and 187 test result(s) without anomalies

rng=RNG_stdin64, seed=0xfecb5683
length= 4 gigabytes (2^32 bytes), time= 71.7 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low16/64]FPF-14+6/16:all         R=  +7.8  p =  8.4e-7   very suspicious  
  [Low4/64]FPF-14+6/16:(0,14-0)     R= +10.4  p =  3.0e-9   very suspicious  
  [Low4/64]FPF-14+6/16:(1,14-0)     R=  +9.5  p =  2.0e-8   very suspicious  
  [Low4/64]FPF-14+6/16:all          R= +12.3  p =  4.9e-11   VERY SUSPICIOUS 
  [Low4/64]FPF-14+6/16:all2         R= +33.1  p =  1.7e-14    FAIL           
  [Low1/64]FPF-14+6/16:cross        R=  +9.3  p =  1.7e-8   very suspicious  
  ...and 195 test result(s) without anomalies
linux$ ./bad-64bit-rng2 | ./RNG_test stdin64
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x72e95cf7
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x72e95cf7
length= 128 megabytes (2^27 bytes), time= 2.1 seconds
  no anomalies in 148 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 256 megabytes (2^28 bytes), time= 4.7 seconds
  no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 512 megabytes (2^29 bytes), time= 9.3 seconds
  no anomalies in 169 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 1 gigabyte (2^30 bytes), time= 18.3 seconds
  no anomalies in 180 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 2 gigabytes (2^31 bytes), time= 36.3 seconds
  no anomalies in 191 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 4 gigabytes (2^32 bytes), time= 71.4 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low16/64]Gap-16:B                R=  +4.1  p =  2.0e-3   unusual          
  ...and 200 test result(s) without anomalies

rng=RNG_stdin64, seed=0x72e95cf7
length= 8 gigabytes (2^33 bytes), time= 141 seconds
  no anomalies in 212 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 16 gigabytes (2^34 bytes), time= 292 seconds
  no anomalies in 223 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 32 gigabytes (2^35 bytes), time= 578 seconds
  no anomalies in 233 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 64 gigabytes (2^36 bytes), time= 1124 seconds
  no anomalies in 244 test result(s)

rng=RNG_stdin64, seed=0x72e95cf7
length= 128 gigabytes (2^37 bytes), time= 2193 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:B                          R=  +4.1  p =  1.9e-3   unusual          
  [Low16/64]FPF-14+6/16:all         R=  +5.3  p =  1.7e-4   unusual          
  ...and 253 test result(s) without anomalies

rng=RNG_stdin64, seed=0x72e95cf7
length= 256 gigabytes (2^38 bytes), time= 4239 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low16/64]FPF-14+6/16:all         R=  +6.4  p =  1.6e-5   mildly suspicious
  [Low16/64]FPF-14+6/16:all2        R=  +8.9  p =  2.1e-5   mildly suspicious
  ...and 263 test result(s) without anomalies

rng=RNG_stdin64, seed=0x72e95cf7
length= 512 gigabytes (2^39 bytes), time= 8493 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low16/64]FPF-14+6/16:(3,14-0)    R= +11.1  p =  7.2e-10  very suspicious  
  [Low16/64]FPF-14+6/16:(4,14-0)    R=  +9.0  p =  6.6e-8   suspicious       
  [Low16/64]FPF-14+6/16:(5,14-0)    R=  +6.9  p =  5.4e-6   unusual          
  [Low16/64]FPF-14+6/16:all         R= +10.8  p =  1.3e-9    VERY SUSPICIOUS 
  [Low16/64]FPF-14+6/16:all2        R= +38.1  p =  9.0e-19    FAIL !         
  ...and 271 test result(s) without anomalies

One thing you can see from these results is that the time it takes a generator to fail can vary significantly. The first failed under two minutes, whereas the second took 2 hours, 22 minutes to fail. You also get to see the way things fail, with problems building over time and becoming more obvious as we take more data. One nice thing about PractRand is that unless you tell it to keep going regardless, it stops the moment the generator shows serious failures. In contrast, TestU01 waits to complete the test battery.

One other nice thing about PractRand is that all the tests are being applied at once (sharing the same PRNG output). In contrast, TestU01 runs each test one after the other. In other words, PractRand makes better use of the (purportedly) random data it receives.

In summary, PractRand is a great complement to TestU01. It doesn't have as many statistical tests, but the ones it does have are good ones. It gives you results sooner, and it can test more of the output of a PRNG.

Testing PCG

Here are the results of testing pcg64.

linux$ ./pcg64 | ./RNG_test stdin64
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x78041158
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x78041158
length= 128 megabytes (2^27 bytes), time= 2.1 seconds
  no anomalies in 148 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 256 megabytes (2^28 bytes), time= 4.8 seconds
  no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 512 megabytes (2^29 bytes), time= 9.3 seconds
  no anomalies in 169 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 1 gigabyte (2^30 bytes), time= 17.8 seconds
  no anomalies in 180 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 2 gigabytes (2^31 bytes), time= 33.9 seconds
  no anomalies in 191 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 4 gigabytes (2^32 bytes), time= 64.5 seconds
  no anomalies in 201 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 8 gigabytes (2^33 bytes), time= 128 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/64]BCFN(2+1,13-1,T)         R=  -7.6  p =1-5.7e-4   unusual
  ...and 211 test result(s) without anomalies

rng=RNG_stdin64, seed=0x78041158
length= 16 gigabytes (2^34 bytes), time= 254 seconds
  no anomalies in 223 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 32 gigabytes (2^35 bytes), time= 504 seconds
  no anomalies in 233 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 64 gigabytes (2^36 bytes), time= 998 seconds
  no anomalies in 244 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 128 gigabytes (2^37 bytes), time= 1989 seconds
  no anomalies in 255 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 256 gigabytes (2^38 bytes), time= 3946 seconds
  no anomalies in 265 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 512 gigabytes (2^39 bytes), time= 7932 seconds
  no anomalies in 276 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 1 terabyte (2^40 bytes), time= 16853 seconds
  no anomalies in 287 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 2 terabytes (2^41 bytes), time= 37027 seconds
  no anomalies in 297 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 4 terabytes (2^42 bytes), time= 75378 seconds
  no anomalies in 308 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 8 terabytes (2^43 bytes), time= 147024 seconds
  no anomalies in 319 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 16 terabytes (2^44 bytes), time= 292970 seconds
  no anomalies in 329 test result(s)

rng=RNG_stdin64, seed=0x78041158
length= 32 terabytes (2^45 bytes), time= 563979 seconds
  no anomalies in 339 test result(s)

This is exactly the kind of result we might hope for. It takes about a week to run, but in the end completes with no problems. As we would expect, during the test there was a brief moment one of the test results looked unusual but as testing continued it proved to be nothing.

It's a similar story for pcg64_fast. In the interest of not filling your screen with junk, I'll switch to abbreviating the rest of the results to show just the beginnings and the ends of the tests.

linux$ ./pcg64_fast | ./RNG_test stdin64
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xf24baca7
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xf24baca7
length= 128 megabytes (2^27 bytes), time= 2.2 seconds
  no anomalies in 148 test result(s)

[...]

rng=RNG_stdin64, seed=0xf24baca7
length= 8 terabytes (2^43 bytes), time= 132005 seconds
  no anomalies in 319 test result(s)

rng=RNG_stdin64, seed=0xf24baca7
length= 16 terabytes (2^44 bytes), time= 263825 seconds
  no anomalies in 329 test result(s)

rng=RNG_stdin64, seed=0xf24baca7
length= 32 terabytes (2^45 bytes), time= 526487 seconds
  no anomalies in 339 test result(s)

Of course, pcg64 and pcg64_fast have large periods (2128 and 2126, respectively) and large state spaces (2255 and 2127, respectively). 32 TB of output seems like a lot, but these are 64-bit generators outputting 8 bytes at a time, so we've only used 242 numbers from each generator. We haven't even reached the square root of the period; we're closer to the cube root. In my opinion, any generator with 128 bits of state (or more) that fails statistical tests must have some serious flaws. More on that another time.

A greater challenge is to test PCG's smaller generators that produce 32-bit output. Because they only produce four bytes of output, 32 TB represents 243 outputs. pcg32 and pcg32_fast have periods of 264 and 262 and state spaces of 2127 and 263, respectively. Some older papers on random number generation suggest that you should never use more numbers from a generator than the square root of its period, otherwise statistical problems may occur. Those comments date back to an earlier era of generation when generators were lower quality, but I will admit that I had a moment of doubt. Was I wrong to have waited so long to test with PractRand? Would it push these generators past their limits? In theory, based on my prior headroom-based pronouncements with TestU01, no, but what about in practice? (TheorRand vs PractRand!) Let's see...

linux$ ./pcg32 | ./RNG_test stdin32
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x16692278
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x16692278
length= 128 megabytes (2^27 bytes), time= 2.3 seconds
  no anomalies in 117 test result(s)

[...]

rng=RNG_stdin32, seed=0x16692278
length= 8 terabytes (2^43 bytes), time= 130996 seconds
  no anomalies in 244 test result(s)

rng=RNG_stdin32, seed=0x16692278
length= 16 terabytes (2^44 bytes), time= 260545 seconds
  no anomalies in 252 test result(s)

rng=RNG_stdin32, seed=0x16692278
length= 32 terabytes (2^45 bytes), time= 519966 seconds
  no anomalies in 260 test result(s)
linux$ ./pcg32_fast | ./RNG_test stdin32
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x3d5f2c7e
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x3d5f2c7e
length= 128 megabytes (2^27 bytes), time= 2.3 seconds
  no anomalies in 117 test result(s)

[...]

rng=RNG_stdin32, seed=0x3d5f2c7e
length= 8 terabytes (2^43 bytes), time= 131096 seconds
  no anomalies in 244 test result(s)

rng=RNG_stdin32, seed=0x3d5f2c7e
length= 16 terabytes (2^44 bytes), time= 261300 seconds
  no anomalies in 252 test result(s)

rng=RNG_stdin32, seed=0x3d5f2c7e
length= 32 terabytes (2^45 bytes), time= 518888 seconds
  no anomalies in 260 test result(s)

One interesting thing to note here is that pcg32_fast doesn't seem to be significantly faster than pcg32. Most of the time is due to PractRand itself, and the small inefficiency of sending numbers over a Unix pipe, not the random number generation process itself. In other contexts, pcg32_fast can be faster.

I could have stopped there, but PractRand provides the option to apply more variations of its tests and try much harder to find flaws. These extended tests make PractRand run more slowly and aren't enabled by default, but it seemed worthwhile to really hammer on pcg32_fast as the most vulnerable generator of the ones I recommend for general-purpose use.

linux$ ./pcg32_fast | time ./RNG_test stdin32 -te 1 -tf 2
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0xb5f0bbc4
test set = expanded, folding = extra

rng=RNG_stdin32, seed=0xb5f0bbc4
length= 32 megabytes (2^25 bytes), time= 2.3 seconds
  no anomalies in 796 test result(s)

[...]

rng=RNG_stdin32, seed=0xb5f0bbc4
length= 2 terabytes (2^41 bytes), time= 153323 seconds
  no anomalies in 1723 test result(s)

rng=RNG_stdin32, seed=0xb5f0bbc4
length= 4 terabytes (2^42 bytes), time= 282780 seconds
  no anomalies in 1770 test result(s)

rng=RNG_stdin32, seed=0xb5f0bbc4
length= 8 terabytes (2^43 bytes), time= 539058 seconds
  no anomalies in 1816 test result(s)

rng=RNG_stdin32, seed=0xb5f0bbc4
length= 16 terabytes (2^44 bytes), time= 1064442 seconds
  no anomalies in 1859 test result(s)

rng=RNG_stdin32, seed=0xb5f0bbc4
length= 32 terabytes (2^45 bytes), time= 2178649 seconds
  no anomalies in 1901 test result(s)

./RNG_test stdin32 -te 1 -tf 2  2170392.00s user 9792.10s system 100% cpu 605:11:07.57 total
linux$ 

These extended-test results, which represent more than twenty-five days of continuous testing, reveal no problems up to 32 TB (the standard stopping point for PractRand).

Making PCG Fail

One of my key arguments about testing is that it isn't enough to proudly say “my generator passes statistical tests” (especially if it has a large period or state space). Instead we should squeeze it down smaller and smaller until it fails.

So, this time we'll test pcg_engines::setseq_xsh_rr_32_16 and pcg_engines::mcg_xsh_rs_32_16, which are half-size versions of pcg32 and pcg32_fast. Their periods are 232 and 230, respectively, and they only produce 16-bit outputs (2 bytes), so it would be utterly unrealistic to expect them to generate 32 TB of data. They must fail. The question is, when? Let's find out!

We'll run PractRand so that it starts telling is about what it is doing a little sooner so we can enjoy some successes before we fail.

linux$ ./pcg_setseq_xsh_rr_32_16  | ./RNG_test stdin16 -tlmin 16
RNG_test using PractRand version 0.93
RNG = RNG_stdin16, seed = 0xe7dc4836
test set = normal, folding = standard (16 bit)

rng=RNG_stdin16, seed=0xe7dc4836
length= 64 kilobytes (2^16 bytes), time= 0.1 seconds
  no anomalies in 39 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 128 kilobytes (2^17 bytes), time= 0.7 seconds
  no anomalies in 43 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 256 kilobytes (2^18 bytes), time= 1.2 seconds
  no anomalies in 50 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 512 kilobytes (2^19 bytes), time= 1.7 seconds
  no anomalies in 56 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 1 megabyte (2^20 bytes), time= 2.3 seconds
  no anomalies in 64 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 2 megabytes (2^21 bytes), time= 2.8 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/16]DC6-9x1Bytes-1           R=  +5.7  p =  5.3e-3   unusual          
  ...and 71 test result(s) without anomalies

rng=RNG_stdin16, seed=0xe7dc4836
length= 4 megabytes (2^22 bytes), time= 3.4 seconds
  no anomalies in 79 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 8 megabytes (2^23 bytes), time= 4.0 seconds
  no anomalies in 87 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 16 megabytes (2^24 bytes), time= 4.6 seconds
  no anomalies in 95 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 32 megabytes (2^25 bytes), time= 5.4 seconds
  no anomalies in 103 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 64 megabytes (2^26 bytes), time= 6.6 seconds
  no anomalies in 111 test result(s)

rng=RNG_stdin16, seed=0xe7dc4836
length= 128 megabytes (2^27 bytes), time= 8.7 seconds
  Test Name                         Raw       Processed     Evaluation
  FPF-14+6/16:all                   R=  -5.3  p =1-1.0e-4   mildly suspicious
  ...and 118 test result(s) without anomalies

rng=RNG_stdin16, seed=0xe7dc4836
length= 256 megabytes (2^28 bytes), time= 12.4 seconds
  Test Name                         Raw       Processed     Evaluation
  FPF-14+6/16:all                   R=  -6.4  p =1-7.5e-6   suspicious       
  ...and 126 test result(s) without anomalies

rng=RNG_stdin16, seed=0xe7dc4836
length= 512 megabytes (2^29 bytes), time= 19.0 seconds
  Test Name                         Raw       Processed     Evaluation
  FPF-14+6/16:(0,14-0)              R=  -9.3  p =1-2.7e-8   very suspicious  
  FPF-14+6/16:(1,14-0)              R=  -8.5  p =1-1.6e-7   suspicious       
  FPF-14+6/16:all                   R= -11.5  p =1-7.5e-11   VERY SUSPICIOUS 
  FPF-14+6/16:all2                  R= +31.0  p =  7.9e-14    FAIL           
  ...and 131 test result(s) without anomalies

So, for pcg_engines::pcg_setseq_xsh_rr_32_16, we begin to see intimations of problems at 227 bytes (i.e., 226 16-bit outputs) and a clear fail at 229 bytes (i.e., 228 16-bit outputs). If we extrapolate from that result, pcg32 might start to show issues sometime around 252 outputs (i.e., 254 bytes; 4 petabytes), and would probably have failed by the time we reach 256 outputs (258 bytes; 256 petabytes). In other words, we can estimate that pcg32 will fail PractRand if increase the test that took a week to complete by a factor somewhere between 512 and 8192. In other words, we might see issues with pcg32 after 10 years of testing, and we can be pretty sure that if we test for 150 years, it'll have failed by then.

I should also point out that any extrapolation from behavior at small sizes is going to be imperfect. The only sure way to find out whether the full size pcg32 generator takes a thousand years to fail, or somehow fails unexpectedly after just a few of months, would be actually run tests that might take years to complete. What these small-scale tests provide is a back-of-the-envelope way to make an informed guess about whether testing beyond 32TB is worthwhile. (The headroom analysis in the PCG paper is a little more rigorous than the quite simple test we've done here, but that too is necessarily extrapolatory.)

Now let's move on to the tiny sibling of pcg32_fast. We should expect it to do worse, and it does.

linux$ ./pcg_mcg_xsh_rs_32_16 | ./RNG_test stdin16 -tlmin 16
seeding with 0xdd44ecf0
RNG_test using PractRand version 0.93
RNG = RNG_stdin16, seed = 0x1aac2a7b
test set = normal, folding = standard (16 bit)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 64 kilobytes (2^16 bytes), time= 0.1 seconds
  no anomalies in 39 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 128 kilobytes (2^17 bytes), time= 0.7 seconds
  no anomalies in 43 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 256 kilobytes (2^18 bytes), time= 1.2 seconds
  no anomalies in 50 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 512 kilobytes (2^19 bytes), time= 1.8 seconds
  no anomalies in 56 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 1 megabyte (2^20 bytes), time= 2.3 seconds
  no anomalies in 64 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 2 megabytes (2^21 bytes), time= 2.8 seconds
  no anomalies in 72 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 4 megabytes (2^22 bytes), time= 3.4 seconds
  no anomalies in 79 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 8 megabytes (2^23 bytes), time= 4.0 seconds
  no anomalies in 87 test result(s)

rng=RNG_stdin16, seed=0x1aac2a7b
length= 16 megabytes (2^24 bytes), time= 4.6 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R=  +7.3  p =  2.5e-5   suspicious       
  ...and 94 test result(s) without anomalies

rng=RNG_stdin16, seed=0x1aac2a7b
length= 32 megabytes (2^25 bytes), time= 5.4 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R=  +9.1  p =  5.9e-7   very suspicious  
  ...and 102 test result(s) without anomalies

rng=RNG_stdin16, seed=0x1aac2a7b
length= 64 megabytes (2^26 bytes), time= 6.5 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R= +17.4  p =  1.3e-14    FAIL !         
  Gap-16:B                          R=  +9.5  p =  3.4e-8   very suspicious  
  FPF-14+6/16:all                   R=  -6.3  p =1-9.0e-6   suspicious       
  ...and 108 test result(s) without anomalies

Here we see PractRand becoming suspicious at 224 bytes (i.e., 223 outputs) and failing at 226 (i.e., 225 outputs). This generator has a period of 230, whereas pcg32_fast has a period of 262. Thus the equivalent points for pcg32_fast would be 2(23/30)×62 and 2(25/30)×62, or somewhere between 0.72 petabytes of output and 12.7 petabytes, or somewhere between 23 weeks and 7.8 years of testing.

Comparison with SplitMix

To set these results into context, it helps to compare them with another PRNG. Popular choices that Internet folks might want to see are comparisons with SplitMix and Xoroshiro128+. XoroShiro isn't a good choice for comparison because it is already known to fail some of PractRand's statistical tests, as acknowledged by its authors in its source code, although the extent of the issues are larger, so that comparison will have to wait for another day (coming soon!).

In some ways I quite like SplitMix because it has some similar ideas to PCG in that it avoids having a trivial output function, but it does have a trivial state transition function. If we test a full-sized version of SplitMix, we find it passes PractRand.

linux$ ./splitmix | ./RNG_test stdin64
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xbb88093
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xbb88093
length= 128 megabytes (2^27 bytes), time= 2.4 seconds
  no anomalies in 148 test result(s)

[...]

rng=RNG_stdin64, seed=0xbb88093
length= 8 terabytes (2^43 bytes), time= 132138 seconds
  no anomalies in 319 test result(s)

rng=RNG_stdin64, seed=0xbb88093
length= 16 terabytes (2^44 bytes), time= 265163 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]FPF-14+6/16:all          R=  +5.4  p =  1.6e-4   unusual
  ...and 328 test result(s) without anomalies

rng=RNG_stdin64, seed=0xbb88093
length= 32 terabytes (2^45 bytes), time= 530227 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/64]BCFN(2+2,13-0,T)         R=  +7.6  p =  1.4e-3   unusual
  ...and 338 test result(s) without anomalies

Notice that SplitMix is not (necessarily) starting to fail here. We already know that something statistically “unusual” must be expected to happen occasionally. The fact that different tests fail means that we're not (yet) seeing any systemic breakdown.

So we could say, SplitMix passes, PCG passes, all is good. But let's instead test a cut-down version of SplitMix equivalent to the cut-down versions of pcg32 we tested earlier. For a fair comparison, we'll test its 16-bit output function.

linux$ ./splitmix32 | ./RNG_test stdin16 -tlmin 16      
RNG_test using PractRand version 0.93
RNG = RNG_stdin16, seed = 0xdfafcc2c
test set = normal, folding = standard (16 bit)

rng=RNG_stdin16, seed=0xdfafcc2c
length= 64 kilobytes (2^16 bytes), time= 0.1 seconds
  no anomalies in 39 test result(s)

rng=RNG_stdin16, seed=0xdfafcc2c
length= 128 kilobytes (2^17 bytes), time= 0.7 seconds
  no anomalies in 43 test result(s)

rng=RNG_stdin16, seed=0xdfafcc2c
length= 256 kilobytes (2^18 bytes), time= 1.2 seconds
  no anomalies in 50 test result(s)

rng=RNG_stdin16, seed=0xdfafcc2c
length= 512 kilobytes (2^19 bytes), time= 1.8 seconds
  no anomalies in 56 test result(s)

rng=RNG_stdin16, seed=0xdfafcc2c
length= 1 megabyte (2^20 bytes), time= 2.3 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/16]Gap-16:B                 R=  +4.1  p =  3.0e-3   unusual          
  ...and 63 test result(s) without anomalies

rng=RNG_stdin16, seed=0xdfafcc2c
length= 2 megabytes (2^21 bytes), time= 2.8 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R=  +5.4  p =  4.9e-4   unusual          
  ...and 71 test result(s) without anomalies

rng=RNG_stdin16, seed=0xdfafcc2c
length= 4 megabytes (2^22 bytes), time= 3.4 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R=  +6.6  p =  7.3e-5   mildly suspicious
  ...and 78 test result(s) without anomalies

rng=RNG_stdin16, seed=0xdfafcc2c
length= 8 megabytes (2^23 bytes), time= 4.0 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R=  +8.4  p =  1.6e-6   very suspicious  
  ...and 86 test result(s) without anomalies

rng=RNG_stdin16, seed=0xdfafcc2c
length= 16 megabytes (2^24 bytes), time= 4.6 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R= +16.8  p =  7.3e-13    FAIL           
  ...and 94 test result(s) without anomalies

Here, we can see SplitMix32 showing a clear tell by 222 bytes (221 outputs) and failing by 224 bytes (223 outputs). That result is clearly much worse than pcg32_fast, which is the weakest PCG generator I would recommend for general purpose use.

If performance at small sizes really is a proxy to performance at larger sizes, the full-size SplitMix is actually very close to failing a regular run of PractRand. I've set a longer test in motion and we'll see what happens. If it does fail, it validates the approach of testing at small sizes. On the other hand, if it doesn't fail, it shows that the small-size results may sometimes be too pessimistic.

Regardless, it seems that in a PractRand statistical quality contest, pcg32 and pcg32_fast beat SplitMix (and Xoroshiro128+, because it doesn't pass PractRand's tests to begin with).

Conclusion

Results from PractRand do indeed validate the statistical quality claims I've made about the PCG family. Yay!