C++ Seeding Surprises

M.E. O'Neill

2015-04-16 01:22

Properly seeding random number generators doesn't always get the attention it deserves. Quite often, people do a terrible job, supplying low-quality seed data (such as the system time or process id) or combining multiple poor sources in a half-baked way. C++11 provides std::seed_seq as a way to encourage the use of better seeds, but if you haven't thought about what's really going on when you use it, you may be in for a few surprises.

In contrast to C++11, some languages, such as popular scripting languages like JavaScript, Python, or Perl, take care of good seeding for you (provided you're using their built-in RNGs). Today's operating systems have a built-in source of high-quality randomness (typically derived from the sequencing of unpredictable external and internal events), and so the implementations of these languages simply lean on the operating system to produce seed data.

C++11 provides access to operating-system–provided randomness via std::random_device, but, strangely, it isn't easy to use it directly to initialize C++'s random number generators. C++'s supplied generators only allow seeding with a std::seed_seq or a single integer, nothing else. This interface is, in many respects, a mistake, because it means that we are forced to use seed_seq (the “poor seed fixer”) even when it's not actually necessary.

In this post, we'll see two surprising things:

Low-quality seeding is harder to “fix” than you might think.
When std::seed_seq tries to “fix” high-quality seed data, it actually makes it worse.

Perils of Using a Single 32-Bit Integer

If you look online for how to initialize a C++11 random number generator, you'll see code that looks like

std::mt19937 my_rng(std::random_device{}());

or possibly the more explicit (but equivalent) version,

std::random_device rdev;
uint32_t random_seed = rdev();
std::seed_seq seeder{random_seed};
std::mt19937 my_rng(seeder);

Footnote: This example is specific to the Mersenne Twister. For some other generators, especially the simpler ones, if you just specify a single integer for the seed, as we did in the first piece of code, it uses it directly without involving seed_seq.

We're about to discover why this code is actually problematic, but first, let's be clear about what it's doing. It is

Creating a random-device object, which (hopefully) will ask the operating system for system-provided randomness.
Getting a single 32-bit value from that device.
Creating a seed_seq object.
Using that seed_seq object to initialize the Mersenne Twister.

It might seem odd to use just one 32-bit value to perform our initialization (the Mersenne Twister uses 624 32-bit integers to represent its internal state, plus a few more for housekeeping). If doing so seems wrong to you, you have good instincts! But as wrong as it is, it's nevertheless widely seen.

So if it is a bad idea, why would anyone write code like this? In fact, people do so largely because with the way that things are currently specified in C++, doing any better is quite tiresome. Actually, I asked the C++ community on Reddit to have a go at doing it right, and people found it surprisingly hard. (In fact, as we'll eventually see, no one really succeeded, because even if you try to use 624 integers to initialize the seed_seq, it's not going to do what you'd like!)

Given that this kind of code is widely seen, let's explore why it's problematic…

Predictability

If we seed with a single 32-bit integer, that only allows us to have about four billion (i.e., 2³²) possible initial states for our RNG. Unfortunately, in today's world, that's not very many at all. Suppose you generate 357913944 as your first number, I can just search all possible seeds and determine what your initial seed was (I won't spoil it by telling you ). If you do want to try it yourself, the Mersenne Twister is fairly slow so it'll take a while to slog through it (using all cores of a 3.40 GHz Haswell i7-4770, it takes nearly two hours).

Of course, speed problems can often be solved by trading some space for more speed. If we have a little over 16 GB of spare space on disk (or in memory), we can just precompute a lookup table that maps RNG outputs to the seeds that produced them.

However we do it, once we know the seed, we can predict all future output from the RNG. You can imagine how bad that would be in some contexts (e.g., anything involving gambling), but even for ordinary programs, predictability can allow algorithmic complexity attacks.

Bias

You might have thought predictability was our only worry (and maybe one you could dismiss if no one will see the numbers you generate), but it isn't the only one. To see an even worse problem, let's imagine a scenario.

Scenario: The Twitter App Mystery

Imagine you're working for Twitter. With more than 250 million active Twitter users, and more than a billion smartphones in active use, it's easy to imagine that the Twitter app is on a lot of phones; let's suppose it's used by about 125 million people. You'd like to get some very detailed usage information from your users, but you worry about the deluge of data you'll get if you got information from everyone every time the open the app. So, you decide to take a sampling approach. Each time the app starts up, it'll randomly decide whether to send in the usage information. Looking at the hugeness of the numbers (you guess the twitter app (re)starts about 8 times per day per user, for a total of 2 billion launches per day), it looks like a good (and convenient!) probability for whether to send in a report is 1 in 2³² to get a new report to look at every couple of days.

So, copying the code you saw above, you write

std::mt19937 my_rng(std::random_device{}());
if (my_rng() == 7) /* lucky seven!  you get to send in a report */
    send_detailed_tracking_info_secretly();

And then you push out the updated app and wait. Soon the reports will come rolling in and you can analyze them.

Except that nothing happens! Days pass and no reports ever come. In the next update, you change the lucky number 7 to the unlucky number 13, and still nothing happens. This behavior would only be possible if your random number generator has a severe aversion to producing seven or thirteen.

Walking through the code with a co-worker, you're told to stop being so cute and use 0 as the “lucky” number. With the new update, suddenly tracking reports are rolling in. In fact, you're now getting twice as many as your back-of-the-envelope calculations had lead you to expect. Does the Mersenne Twister like zero twice as much as it should?

The stats you're getting look interesting, but you decide you'd like to increase the number of reports twenty-fold, so you change the == 0 line to < 20. But the change only gives you 7.5× more reports!?!

Perplexed, you review the code one more time. You realize that because you were just using one number, you never needed a RNG at all. You change the code to just use the random device directly:

std::random_device rdev;
if (rdev() < 20) /* 20 in 2**32 chance, you get to send in a report */
    send_detailed_tracking_info_secretly();

and it all suddenly works the way your back-of-the-envelope calculations said it should. And when you set your code back to checking for “lucky 7”, or 13, or 0, they all give the same number of reports.

Sure, maybe you should have just used the random device in the first place, but why did your code fail so badly when you used seed_seq and the Mersenne Twister?

What's Going On...

Strangely enough, when you initialize the Mersenne Twister with a 32-bit seed (via seed_seq), it can't ever generate 7, or 13 as its first output. And two different seeds produce 0. Even more crazy, there are twelve different 32-bit seeds that can produce the “random” numbers 1226181350 and 1563636090, so those numbers show up twelve times more often than we'd expect.

In fact, if we look at the distribution of outputs, it very closely matches a binomial or Poisson distribution. In other words, these counts exactly match what you'd expect to see if you asked for 2³² random 32-bit numbers from a good source of randomness. It isn't just 7 or 13 that are missing, they're just two of 1,580,024,992 32-bit outputs that don't get output (36.7% of 2³², and almost exactly the 2³²/e we'd expect from a Poisson distribution).

Normally, when taking random samples, which numbers are missing and which show up multiple times are different each time so there is no systemic bias. But in our example the missing and duplicated numbers are always the same ones, baked in, forever. Because they never change, we have to consider the pattern to be a kind of bias, and, as we saw in our story, this kind of bias can have real and surprising consequences.

Don't Blame `std::seed_seq`

You might be thinking that the problem here is seed_seq or the Mersenne Twister, but the real problem is that we're asking the impossible. The Mersenne Twister has 19937 bits of internal state, and we can't initialize it in a bias-free way using fewer than 19937 bits. (If our seeding algorithm is deterministic, there have to be some states that can never appear.)

So, if you take away one lesson, it's that

You really shouldn't try to initialize a random number generator whose state is 624 32-bit ints with a single 32-bit int.
If you ignore that advice, std::seed_seq tries to come up with a sane initialization, but it can't work miracles.

Incidentally, similar problems also affect std::minstd_rand, which uses a single 32-bit word to store its internal state but whose number of states isn't an exact power of two. It's clear (via the pigeonhole principle) that there is no bias-free way to use a single full-range 32-bit value to initialize this RNG.

Why does seed_seq exist, if initializing 624 ints using a single int isn't a good idea in the first place? Well, just because something is a bad idea doesn't mean people won't want to do it! And without seed_seq, people might employ far more pathological initializations, such as repeating the same integer 624 times. The Mersenne Twister is prone to “bad states” (all zeros causes it to not work at all, whereas lots of zero bits are merely bad)—taking 0, or 1, or 2 and repeating it 624 times could be disastrous. The seed_seq algorithm avoids this issue (when given a single integer), resulting in an initialization that's relatively sane.

But a good rule of thumb is that if you're so cheap that you want to use a single 32-bit integer to initialize your random number generator, you should probably use an RNG with a 32-bit state. Otherwise, you should use the amount of seed randomness your generator actually needs—don't try to give it less.

Blame `std::seed_seq` After All

In the previous section, we saw that seed_seq was just trying to do something sane when faced with an impossible task (i.e., not being given enough seed data to work with). But what about when we do provide it with the right amount of seed data? Are we free of problems now? Alas, the answer is (at least sometimes), no.

To explore this issue, we'll look at something much smaller than the Mersenne Twister, a 64-bit RNG that thus needs 64 bits of seed data. C++ doesn't include such a generator, but it does provide the means to make one:

typedef std::linear_congruential_engine<
    uint64_t, 6364136223846793005U, 1442695040888963407U, 0U>
        knuth_lcg;  /* Knuth's preferred 64-bit LCG */

(Incidentally, I sometimes use this generator in preference to the PCG generators when all I care about is raw speed. Sometimes you just don't need anything statistically better.)

A reasonable way to initialize this generator would be

std::random_device rdev;
uint64_t       seed = (uint64_t(rdev()) << 32) | rdev();
knuth_lcg      rng{seed}

Here we just make a single 64-bit integer from two calls to the system's random device and use that as our seed. This approach works really well, but we're going to see what happens if we try to use seed_seq instead.

std::random_device rdev;
std::seed_seq  seeder{rdev(),rdev()};
knuth_lcg      rng{seeder}

This code seems like it ought to be broadly equivalent. We're not making the mistake of using just one 32-bit value, we're using two, and inside knuth_lcg it'll request two integers from the seed sequence. So we should be all good, right?

But no, it's not good at all. Once again, seed_seq introduces bias, and this time it really has no excuse.

`std::seed_seq` Is Not a Bijection

If we think of seed_seq in this context as taking in 64 bits (2 × 32-bit ints) and outputting 64 bits (again as 2 × 32-bit ints), we can see it as a simple function from 64-bit ints to 64-bit ints. The question is, what kind of function is it? The answer is shown in the following graph:

What we really want is for seed_seq to be a bijection—in other words, we want every possible output to have the same chance of being produced (1 in 2⁶⁴). But that's not what seed_seq does. Some output values are produced twice, and some are never produced at all. (Interestingly, it also doesn't look like a random distribution—it's its own thing.)

You might wonder how I got the histogram above. After all, did I really generate all 2⁶⁴ (18 quintillion!) possible inputs and see what it outputs? No. Did I do crazy mathematics to figure it out? No. In fact, I cheated. I made a 16-bit version of seed_seq and tested how it behaved when generating two 16-bit numbers (which gives us only 2³² possibilities to consider, a mere 4 billion).

“Ha!” you cry. “Maybe that's not a valid approach!”

But I went on to check my work. With good evidence to suspect that the real 32-bit seed_seq isn't a bijection, I looked for collisions. And I found them. For example, if you generate two ints from seq1 or seq2 below, you get the same output, {0x3bf5972a, 0xe304114d}:

seed_seq seq1{0xf5e5b5c0, 0xdcb8e4b1}
seed_seq seq2{0xd34295df, 0xba15c4d0}

This single example is enough to show that seed_seq is not acting as a bijection. If you'd like a few more examples, this program will print out about a hundred more.

Obviously I couldn't search the entire 64-bit space looking for duplicates, but I didn't have to. Instead, I sampled a mere 2³⁴ (about 17 billion) seedings, and, thanks to the birthday problem, I found about eight duplicates per run. (Done naïvely, this process would require a 256 GB hash table, but you can get by with a somewhat smaller table; I might write about exactly how I did that in a separate post.)

Of course, thus far I've only used generate to make two ints. Maybe that's a weird case and it behaves differently if we need a different amount of seed data? Actually, that's true, but not in a positive way, as we can see in the graph below:

Here, we find that if we're generating a single integer (from a single seed int), it is a bijection, but as we generate more and more integers (while properly giving the RNG a matching number of input seeds), the results actually begin to look more and more like the Poisson distribution we saw above. In fact, in the last one, some outputs show up six times (albeit a mere 0.0006% of them).

(Incidentally, these graphs were drawn using an 8-bit incarnation of seed_seq; you'll notice that the second histogram looks identical to its 16-bit counterpart, which lends credence to the idea that we can extrapolate about performance from smaller bit sizes.)

Scary All-Zero States

Since seed_seq can never produce some outputs, you might hope that among those never-produced outputs is the all-zero state. A reason to want this property is that the Mersenne Twister and other generators based on linear-feedback shift-register techniques (including XorShift) can't handle an all-zeros state—if you initialize them that way, they'll just output zero forever.

Unfortunately, seed_seq can produce the all-zeros state. In fact, in my tests with the 8-bit and 16-bit versions of seed_seq, I found that the all-zeros state occurred twice when generating two ints (but only once when generating one, three, and four ints). So putting good random data though seed_seq may actually be increasing the chance of a pathological state rather than decreasing it.

That said, even if we double the odds, it's still really unlikely and you probably shouldn't worry about it.

More `std::seed_seq` Flaws

Besides its somewhat surprising mathematical properties, there are a few other gotchas for C++ programmers when using seed_seq. One is that it is required to store its seed data so that every time generate is called in the same way, it produces the same output. Because the amount of seed data can vary, it stores it on the heap, but doesn't have a std::allocator as a parameter. Thus, if you're working on an embedded system that prohibits heap allocation, you can't (easily) use seed_seq at all.

Fixing Seed Sequences

In my opinion, the C++ standard should do two things for C++17 to help with these issues.

It should relax the requirements for the Seed Sequence concept so that it is not required to be almost identical to std::seed_seq in its implementation. The idea that repeated calls to generate must always generate the same values needs to go.
The std::random_device class should have a generate method so that it can be used as a Seed Sequence. This addition also aids efficiency because the library can then read multiple values from the operating system at once rather than requesting them one by one.

With those changes, we can let std::seed_seq take a reduced role in our code.

Another interesting question is whether we can write a better seed_seq, one that is both a bijection and does a good job when given low-quality seed data. I believe the answer is yes, but this post is long enough, so I'll leave that for another time.