Predictability

Ideally, random number generators shouldn't be trivial to predict! Although you probably shouldn't use PCG-family generators for tasks with strong cryptographic security needs, they're much less predictable than most popular general-purpose RNGs.

What is Predictability

A random number generator is predictable if, after observing some of its “random” output, we can make accurate predictions about what “random values” are coming up next.

To most people, predictability seems like the antithesis of randomness, yet it is in part a matter of perspective. For example, here are some random numbers generated by /dev/random on my computer:

0x2089b25b, 0xa904bd46  0x17de4ab5, 0x3fbae22f
0x93fa9859, 0x58b8aa89  0x2911cfb7, 0xcd2e9be0
0x6faf0d43, 0xcfc13aec  0xeebd2f36, 0x4c19793d
0x89ff6083, 0x9666c577  0x2b387d0c, 0x17c76c76
0x175a6408, 0xe3b39734  0xa8168511, 0xf798a21c

If you've never seen this page, they ought to look pretty random. But if you come back and read this page tomorrow, they'll be the same and they won't seem quite as random. You could predict that if you came back next week and read this page, the exact same numbers will be here, and if someone asked you “What comes after 0x17de4ab5 you could be pretty sure the answer is 0x3fbae22f.

Thus, the above numbers both “look random” and are also “totally predictable”. In that sense, it is possible for an entirely predictable random number generator to pass a battery of statistical tests for randomness.

Whether predictability matters depends on the application. If you're performing a simulation, it may not matter at all. But if you were running a lottery, it might matter a lot.

Surprisingly, the general-purpose random number generators that are in most widespread use are easily predicted. (In contrast RNGs used to construct stream ciphers for secure communication are believed to be infeasible to predict, and are known as cryptographically secure).

How Can Predictability Arise?

In the basics of pseudorandom number generation, we introduced an analogy for algorithmic random number generation—the idea that a random-number generation scheme is like a book of “random numbers”, where the seed for the generator tells us which page to begin reading from.

The neat thing about algorithmic generation is that the contents of this mostrously huge book are not explicitly stored, they are computed as needed (using our position in the book). Nevertheless, the contents of the book itself never change, only our reading position (which is what is stored in the internal state of the generator).

Although in some sense, the sequence for a given generator is fixed, the book is so huge that a brute-force strategy of simply looking though the entire book to figure out where we are reading from is impractical.

But that's just a brute-force strategy, what if there were some cleverer strategy to figure out where in the book we are? Because the content of the book is calculated algorithmically, perhaps it is possible to somehow do the opposite calculation…?

At present, there is no actual proof that every algorithmic random number generator isn't actually predictable given some of its output (even the “cryptographically secure” ones!). Proving a generator is impossible to predict amounts to proving the existence of one-way functions, and such a proof would show that P ≠ NP (see Wikipedia for more details). Nevertheless, in practice, there are random number generators that no one knows how to predict (and most computer scientists believe P ≠ NP).

But there are also generators that are trivial to predict.

Predicting LCGs

As an example, let's see how easy it is to predict a linear congruential generator. These are defined according to the formula

\begin{equation*} r_{n+1} = a \, r_{n} + c \mod m \end{equation*}

for some constants \(a\), \(c\), and \(m\). From that, we can say that

\begin{align*} r_1 = a \, r_0 + c \mod m \\ r_2 = a \, r_1 + c \mod m \end{align*}

and, rearranging,

\begin{align*} a = \frac{r_1 - r_2}{r_0 - r_1} \mod m \\ c = \frac{r_0 r_2 - r_1^2}{r_0 - r_1} \mod m. \end{align*}

And that's it. If we know the modulus, given just three values from the LCG, we can calculate the multiplicative and additive constants (if you'd like to try, note that we can perform modular division using the modular multiplicative inverse, which is available in Mathematica via the PowerMod function, for example).

In fact, sometimes almost no mathematics is required. Consider this C++ program:

#include <random>
#include <iostream>

int main()
{
    std::default_random_engine rng;

    std::cout << "Minimum: " << rng.min() << "\n";
    std::cout << "Maximum: " << rng.max() << "\n";

    for (int i = 0; i < 3; ++i) {
        std::cout << rng() << "\n";
    }
}

With the GNU C++ compiler and library, it prints out

Minimum: 1
Maximum: 2147483646
16807
282475249
1622650073

We might immediately notice that \(16807^2 = 282475249\), and suspect that \(a = 16807, c = 0\). We might also realize that \(m = \mathrm{maximum}+1\) \(= 2147483647\). The third number confirms our suspicion.

With the LLVM C++ library, it prints out

Minimum: 1
Maximum: 2147483646
48271
182605794
1291394886

Based on the above experience with the GNU library, you might suspect that \(a = 48271, c = 0\) purely by inspecting the first output, and you'd be right.

In fact, that's because these implementations define std::default_random_engine to be std::minstd_rand0 and std::minstd_rand; these constants are well known as part of their specification. If you know it's std::minstd_rand, you only need to see one number to predict all its future output.

Other Trivially Predictable Generators

In fact, any generator that outputs its entire internal state is trivial to predict. The page discussing other random number generators gives several examples, but one notable one is the Mersenne Twister. If you look online you can find several examples, such as this one, where people figure out the state of this generator from its output.

A Failed Attempt at Preventing Prediction

In 1980, the highly respected computer scientist Donald Knuth wrote a technical report, Deciphering a Linear Congruential Encryption, claiming that if we drop the low-order bits of an LCG and keep the constants secret, it would be hard for anyone to reconstruct the internal state of the generator from the output. (The paper, Stanford Tech Report 024800, is no longer available, but a revised version became a 1985 journal paper.)

Knuth was mistaken. Even as he prepared the paper, efficient algorithms were being developed that could determine the constants and the internal state of a truncated LCG.

Predicting Truncated LCGs

Knuth's paper included an algorithm to predict truncated LCGs, but it was an exponential time algorithm (based on the number of bits). Thus if a large number of bits were discarded, Knuth's algorithm becomes infeasible.

In 1989, Joan Boyar adapted her 1982 work to create an algorithm for predicting truncated LCGs. She claimed the algorithm was a polynomial-time algorithm, but she achieved this goal only through mathematical sleight of hand. She restricted the number of dropped bits to being the log of the total number of bits, which is fairly unrealistic. (After all, who generates 64-bit random numbers, and then only drops six bits?)

Meanwhile, various authors (Frieze et al 1984, Hastad & Shamir 1985, Stern 1987, Frieze et al 1988) developed true polynomial-time algorithms for recovering the state of a truncated generator. All these techniques use ideas from the Lenstra–Lenstra–Lovász lattice basis reduction algorithm.

All of these papers are heavily mathematical and can be somewhat challenging to read, and none of the authors provide code to implement their ideas. In 1998, Joux & Stern attempted to remedy the first issue (but not the second) in Lattice Reduction: A Toolbox for the Cryptanalyst, which discusses these algorithms from a more practical perspective.

Nevertheless, the academic literature for predicting truncated LCGs is somewhat frustrating to peruse. It isn't always easy to determine

  • How many outputs from the LCG the prediction algorithm needs to have a reasonable chance of success.
  • The exact computational complexity of the algorithm.
  • The point at which the technique breaks down (some techniques place restrictions on the number of bits dropped or the modulus, such as requiring that it be square-free).

Some algorithms (e.g., Stern 1987) use about \(\sqrt{3 \log_2 M}\) outputs where \(M\) is the modulus of the generator (e.g., for a modulus of 264, up to 14 values may be needed). In contrast, Frieze et al 1984 only requires four values, but requires us to know all the constants, including both multiplier and additive constants.

Limitations

In general, it isn't clear how robust these techniques are to slight variations in RNG technique. It would appear that introducing a small amount of non-linear behavior might stop them from working. Myself, I'm curious about whether something as simple as xoring with an unknown constant would derail them.

Knuth 1985 makes a similar claim:

Although we have seen that linear congruential sequences do not lead to secure encryptions by themselves, slight modifications would defeat the methods presented here. In particular, there appears to be no way to decipher the sequence of high-order bits generated by linear congruential sequences that have been shuffled [...]

Although various authors have refuted the central claim of Knuth 1985, this closing claim seems to have gone unrebutted thus far.

Observations

We've seen that some very smart people have made claims that particular random number generators weren't predictable, only to be proven wrong. Second, predicting a generator in practice requires two things, knowing that it algorithms exist for predicting it, and knowing how to apply those algorithms to the task. For example, in exploring the literature, I can find several random number generators from 1985 that are claimed to be hard to predict. Those claims could be wrong, but I didn't find it easy to know for sure one way or the other.

Cryptographically Secure Generators

Cryptographically secure generators are generators that are strongly believed to be very difficult to predict, in part because they have received considerable scrutiny from the cryptographic community and no one has found a good way to break them. One such example is arc4random on OS X, iOS, and various BSD Unix variants, but there are many others. In an effort to make it very unlikely that they can be predicted, most cryptographically secure generators are fairly expensive to compute.

Generators that May Be Hard to Predict

Some generators do not make any claims about their crypographic security, but nevertheless aren't trivial to predict. For example, some general-purpose generators use several generation schemes and then combine them to create their final output (e.g., Ran from the most recent edition of Numerical Recipes in C). It isn't obvious to me how someone could predict these generators, but that doesn't mean it is impossible. For example, some schemes for combining multiple linear-feedback shift-register RNGs are prone to a correlation attacks.

I would class these generators as more secure than trivially predictable generators. A casual user may not know how to predict them, and may not be able to find any work showing them to be predictable, but without detailed analysis from the cryptographic community, it's hard to be confident that they actually cannot be predicted in reasonable time.

Predictability of the PCG Family

The PCG family is designed with being difficult to predict in mind, and the default generators are not trivially predictable. But the primary design goal for most members of the PCG family is to be a fast statistically-good general purpose generator, and so by design they do not work quite as hard as most cryptographically secure generators.

Most of the PCG output functions involve nonlinear operations and only reveal partial state, but as we saw from Knuth's truncated LCGs, that's no guarantee of that PCG generators can't be cracked.

Let's consider pcg32, which has state-space size of 2127 (264 period × 263 streams) and produces 32-bit outputs. The output function for this generator includes a random rotation, which should make it harder to predict than a simple truncated LCG. One way to crack this generator would be take K outputs, apply to all possible rotations those outputs, and then pass each of these candidate rotations to a truncated-LCG-reconstruction algorithm (we would choose K to be the smallest value that would allow the reconstruction algorithm to work). Because there are 32 possible rotations for a 32-bit number, this would increase the work by a factor of 32K.

The pcg64 generator doubles the state space size and adds the nonlinearity of xor-folding to the mix, whereas pcg32_c64 adds even more state and more xor-based nonlinearity. I would think these would be harder.

I know that if I were trying to predict a random number generator, I'd want something easier than the PCG family. But if I wanted actual crypographic security for secure communication, I'd probably want to use something that has been around longer and seen more scrutiny.

Hopefully as time passes, the PCG generation scheme will receive scrutiny from people with far more expertise in crypographic security than me, and we will have a clearer picture about how easily it can be predicted. With that in mind, I hope to offer some crypographic secuity challenges in the future to encourage people to try to break it.