Random Number Generation Basics

What is randomness? And what does it mean for a generator to be pseudorandom. And what does it mean for a generator to have “multiple streams”? We'll answer all of those questions, at least a little.

“True” Randomness vs “Pseudo” Randomness

Almost all random-number generation on computers is done using algorithms to produce a stream of numbers that (hopefully) match the expectations statisticians would have about random numbers.

In contrast, we often consider acts like “tossing a coin” (a real physical coin) or “seeing where a roulette ball lands” as examples of “true randomness”. Some people consider this “natural” randomness to be superior to simulated randomness (sometimes quoting John von Neumann's statement, “Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin,” out of context to bolster their arguments). Some people call it “true” randomness.

In fact, many kinds of “natural” real-world randomness to us aren't truly random either. For example, tools exist to predict the path of a roulette ball (using data gained after it has been released and before the croupier calls, “No more bets!”), at least according to people who make money selling such gizmos. Likewise, for flips of a physical coin, we can suppose that with sufficient knowledge of the environment, we might accurately predict the outcome. In fact, research has shown that people can inject bias into their own coin flips to skew the result towards a desired outcome.

Arguably then, from a truly omniscient perspective, nothing, not even the physical world, is truly random—it is in the nature of causality that everything that happens has a chain of prior events that caused it.[1] But in practice, for most real-world randomness, obtaining such an omniscient perspective is infeasible to the point of being impossible.

High-quality algorithmic randomness is similar. If we are denied an omniscient perspective—if we can't look “inside the box” to see what is going on—the random numbers produced will seem completely random. We won't be able to predict the output, not because is theoretically impossible, but because it is infeasible.

Algorithmic randomness is most commonly used as a kind of randomness amplifier. The random number generator begins with seed data, an initial secret (sometimes chosen using “natural” randomness). From that small amount of initial data, a random number generator can generate vast amounts of random data (we'll come back to this idea in the codebook analogy in the next section). If we know the algorithm, and we know the secret (the seed), yes, the output is predictable. But if we don't know the seed, the output will be very hard to predict. Just how hard it is will depend on the details of the generator, but it can be really hard. Modern cryptography (the technology that secures online transactions) is founded on the intractability of making predictions about random number generator behavior.

Thus, a key difference between algorithmic randomness and randomness from natural phenomena is that for algorithmic randomness, we can choose who has the omniscient perspective and who doesn't. If you play a game of pure chance with a computer program that includes a random number generator, the program could be written to know with 100% certainty whether you will win or not (it could be written to have the omniscient perspective), whereas it should be infeasible for you to predict the outcome. In contrast, for many natural phenomena, it is infeasible for anyone to have an omniscient perspective.

Whether random numbers are produced algorithmically or from nature, we can be genuinely in the dark about what the next random number will be.

The Code-Book Analogy

One way to think of a random number generator is similar to a codebook used by spies, which a book of “random numbers” they use for secure communication. To send a message, they would turn to a particular page, and then a particular row, and find the numbers, which might look like this:

805 916 456 822 984 097 920 121 945 133 953 246 970 593 792 022
599 193 713 963 838 230 442 649 784 112 751 922 700 212 540 335
680 661 086 523 732 314 139 829 690 927 873 177 157 634 694 105
133 724 918 139 216 940 441 244 335 637 640 962 060 158 681 796
642 877 129 666 121 430 356 612 713 077 184 357 916 166 108 090
448 670 630 601 877 837 294 108 251 495 979 722 630 286 432 251
278 028 989 244 111 133 938 324 509 755 068 932 889 339 597 960
049 817 014 501 079 610 067 744 631 121 787 515 124 910 197 297
047 821 993 227 324 497 976 635 082 300 038 340 538 172 964 144
178 237 718 457 123 386 503 165 333 899 417 371 960 203 422 606
631 538 120 816 044 305 449 572 936 348 664 686 243 059 297 581
809 003 837 731 129 841 510 293 322 199 351 923 281 207 208 370
306 555 890 555 905 352 885 487 939 013 069 524 560 897 451 341
223 698 922 728 860 007 215 700 311 376 625 949 465 831 966 544
835 245 496 627 469 312 310 768 389 124 951 686 628 748 399 764
093 261 187 530 265 429 841 053 609 973 922 975 291 855 672 434

They would then use their spy-craft to combine their message with the random numbers (e.g., by numbering the letters of the alphabet and adding each successive letter of their message to each successive number from the table). They could then send that message knowing that if it was intercepted, it would be unreadable without a copy of their codebook and very hard to decode even with the codebook without the knowledge of which page they used to perform the encoding. (Do the numbers above already have a message? You can't tell.)

One question is how the spy would decide which page of their codebook to use. In random number generation, this concept is known as the seed (in cryptography it is also known as the encryption key). If the codebook was truly secret, the page to use could be disclosed publicly.

A pseudo-random number generator is like a big codebook. In fact, a 32-bit generator with a period of 232 is like a codebook that is 16 GB in size, which is larger than the compressed download for the whole of Wikipedia. But a 32-bit generator with a period of 264 is like 64 EiB (exbibytes), or about six times a mid-1999 estimate of the information content of all human knowledge. How big would an actual book of 264 random numbers be? It would be so large, it would require the entire biomass of planet earth to make!

One difference from a spy's codebook is that in a typical random generator everyone has the book. Thus, whereas spies would keep their codebook secret and could allow anyone to know the page number, random number generators do the opposite—the sequence of the generator (the codebook in our analogy) is known, but if we want to avoid prediction, no one must know where in that sequence we are looking. If the sequence is large enough (and there is no discernible pattern to the numbers), it will be infeasible to figure out what numbers are coming next.

(Some purportedly random random number generators do have obvious patterns to the numbers they produce which means that we can know, just by looking at the numbers, which page of their codebook they came from. But good random number generators don't have any clear pattern to their output, making finding which page of their codebook they correspond to very difficult.)

There is no limit to the size of the codebook that algorithmic random number generation can support. For example, with a bit more computing effort, we can have a 64-bit random number generator that has a period of 2128, but the sky is the limit. The PCG generation scheme provides a generator with a period of 2524352, but it could just as easily provide an even larger codebook; it would just use more memory.

There is a complimentary alternative to bigger codebooks: more codebooks.

Multiple Streams (More Codebooks)

Returning to our spy analogy, there was a strategy eavesdroppers could use to defeat spies using a secret codebook—the known plaintext attack. If all spies used the same codebook, and on a given day everyone used the same page of that codebook, you could use the following trick: Get someone to send a message whose text you knew. From that message and its encoding, you could reverse engineer what the codes in the codebook were. Because everyone uses the same codebook, you can then decode everyone's messages.

One workaround would be to tell every spy to start at a different place in the book, but you'd have to make very sure that everyone really did use a distinct page. With 100 spies, a 100 page book would be worrisome—you'd probably want to have a book 100 times bigger.

Another alternative would be to give every spy their own unique codebook (paired with an identical one back at headquarters). It offers the same benefits of having a huge book and making every spy use a different page, but it's less cumbersome and doesn't cause problems if everyone uses the same page from their book.

If a random number generator is like a codebook, a generator with multiple streams is like multiple books. We could also see all those books as volumes in one single masterwork. For example, if we have a generator with a period of 264 that supports 263 streams, it's a lot like having a generator with a period of 2127, but it may be faster and easier to use.

PCG Generators

The baseline PCG generator is a 32-bit generator with 64 bits of state and 63 bits used to select the stream. In other words, it's like having 9,223,372,036,854,775,808 codebooks, each filled with 18,446,744,073,709,551,616 numbers (32-bit integers). In one sense, the numbers it produces are “preordained”—if we only knew which book it was reading from and where it started reading from, we could predict it. But if you don't know that information, the numbers are, in a very real sense, random. It's probably easier to predict dice rolls or roulette.

But if you want more, you can use a generator with 128-bits of state and 127 bits for selecting the stream. That's like having 170,141,183,460,469,231,731,687,303,715,884,105,728 codebooks, each with 340,282,366,920,938,463,463,374,607,431,768,211,456 numbers.

But if you really want, you can have a generator a period of 2524352 (or even more, the sky's the limit). Of course, just saying “what page of the codebook” you want to start at requires 64 KB of data for a generator this huge.

Comparison to “True Randomness” from random.org

The website random.org provides users with access to “true, naturally derived, randomness”. It's pretty cool, but it's also illustrative of the differences between “natural” randomness and algorithmic randomness.

At the time I write this, random.org has been running for about 15 years and has generated 215 GB of random data. In contrast, I can produce 215 GB of statistically high quality random data about one minute on a two-year-old laptop (using one of the PCG generators). Although it's not a very meaningful comparison, it does illustrate a fundamental difference. For good algorithmic generation, the limiting factor is how fast you can process the random data (saving it to disk or sending it over the network becomes the limiting factor—the PCG generation scheme can easily saturate a 10 gigabit Ethernet connection). For “natural” randomness, the limiting factor how quickly you can collect data from your source and process it.

And “natural” randomness needs processing. Natural sources of randomness often have bias of one kind or another that needs to be eliminated. For example, random.org provides statistics that show how much bias it has in its input. It uses a bias removal algorithm by Jon von Neumann, but that algorithm requires it to discard 75% of its input.

Footnotes

[1] Our discussion of causation assumes a Newtonian view of physics, which largely holds for macroscopic events in our world, including examples like tossing a coin or the path of a roulette ball. Effects at the quantum level have been shown to be strongly random beyond what can be explained by mere “hidden state”. The website random.org uses chaotic randomness (which can be explained using only Newtonian physics) for its “true” randomness, rather than quantum randomness.