Bugs in SplitMix(es)

M.E. O'Neill

2017-08-10 14:54

In my previous post about testing PCG with the PractRand statistical tests, I also ran PractRand on SplitMix. When I started the testing process, I had a bug in my implementation of SplitMix, and the same bug exists in many SplitMix implementations (e.g., at least four Haskell implementations and an OCaml implementation) because of a small but crucial error in the code for SplitMix given in the SplitMix paper. The specific bug does not exist in Oracle's JDK8 implementation of SplitMix (because that code was written before the paper, and not based on the code in paper).

I'm not the first person to discover this bug (e.g, it's documented in an implementation here), but it caught me out.

I had hoped that I could just correct the code to the JDK version and we'd be all good, but as I thought a little more deeply about what the code is doing, I discovered that even the code in JDK version doesn't successfully address the underlying issues. The flaw runs deeper.

Let's take a look...

Bad Seeds?

Take a look at these possible 64-bit seed values for a PRNG. Do they seem problematic?

0x61c8864680b583eb  0x7957d809e827ff4c  0x305cb877109d0686  0xefee3e7b93db3075
0xf8364607e9c949bd  0xf8d059aee4c53639  0x359e58eeafebd527  0x79629ee76aa83059
0x88e48f4fcc823718  0x9cd9f015db4e58b7  0xbeb721c511b0da6d  0x05d507d05e785673
0x7f83ab8da2e71dd1  0xf4077b0dbebc73c0  0x86466fd0fcc363a6  0x76442b62dddf926c

Hopefully they look pretty random and like quite plausible values to use to seed a PRNG. But it turns out that for many implementations of SplitMix, seeding with these seeds will lead to failing statistical tests; for example,

linux$ ./splitmix-buggy.64 0x61c8864680b583eb | ./RNG_test stdin64
SplitMix64(0x61c8864680b583eb) initialized.
Using next_int64() output function
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xa7407f7a
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xa7407f7a
length= 128 megabytes (2^27 bytes), time= 2.3 seconds
  no anomalies in 148 test result(s)

rng=RNG_stdin64, seed=0xa7407f7a
length= 256 megabytes (2^28 bytes), time= 5.3 seconds
  Test Name                         Raw       Processed     Evaluation
  BRank(12):2K(1)                   R= +88.2  p~=  1.4e-27    FAIL !!
  ...and 158 test result(s) without anomalies

Failing after five seconds of testing and only a very tiny fraction of the period of the generator is clearly bad!

How can a bug like this exist? It actually reveals one of the perils of PRNG implementation. If a PRNG implementation is buggy, it might not be obviously buggy. The seeds I listed above were carefully contrived to be terrible, but although there are many thousands of bad seeds, the vast majority of seeds won't show an issue, so testing with an arbitrarily chosen seed won't typically reveal any problem.

So what is going on?

How SplitMix Works

SplitMix has a lot in common with PCG. SplitMix's authors come from the same academic programming-languages community that I am part of, and there's actually a lot of commonality between their past work and my own (the disparate areas of memory management and parallel determinacy checking!). SplitMix was designed at around the same time as PCG. Like PCG, SplitMix has a state transition function based on simple arithmetic operations. Its state transition function is just a counter, rather than a linear congruential generator (although we can stretch the definition of an LCG to include counters if you bend the rules to allow the multiplier to be 1!). There are other counter-based PRNGs, such as ChaCha20 and Random123, but they make different design decisions. Let's contrast their state transition functions with SplitMix's.

Here's a sequence of internal states for Random123 or ChaCha20 (slightly simplified):

0x0000000000000001  0x0000000000000002  0x0000000000000003  0x0000000000000004
0x0000000000000005  0x0000000000000006  0x0000000000000007  0x0000000000000008
0x0000000000000009  0x000000000000000a  0x000000000000000b  0x000000000000000c
0x000000000000000d  0x000000000000000e  0x000000000000000f  0x0000000000000010

These counter-based PRNGs then apply a powerful hash function (e.g., a cryptographic hash) as the output function to turn the really similar internal states into drastically different final numbers.

Here's a sequence of internal states for SplitMix:

0x9e3779b97f4a7c15  0x3c6ef372fe94f82a  0xdaa66d2c7ddf743f  0x78dde6e5fd29f054
0x1715609f7c746c69  0xb54cda58fbbee87e  0x538454127b096493  0xf1bbcdcbfa53e0a8
0x8ff34785799e5cbd  0x2e2ac13ef8e8d8d2  0xcc623af8783354e7  0x6a99b4b1f77dd0fc
0x08d12e6b76c84d11  0xa708a824f612c926  0x454021de755d453b  0xe3779b97f4a7c150

Hopefully these internal state values look more random. But although these values might look random at a glance, they're really not very random at all. Much as Random123 is adding one each time, here SplitMix's internal state advances by adding a random-looking constant, 0x9e3779b97f4a7c15. The idea of making somewhat a sequence of random-seeming numbers merely by adding a magic constant actually has a name: it's a Weyl sequence. If you take that constant and divide by 2⁶⁴, you get 0.6180339887498948482, which is the first 19 digits of the fractional part of the golden ratio. Technically, to be a Weyl sequence you need to use the fractional part of an irrational number as the basis for your constant, but in reality any random looking number will work, so long as it's odd.

A Weyl sequence certainly isn't a very good random number generator, but it's better than just adding 1. Thus SplitMix doesn't need as strong a hash function for its output function, and a simpler hash function can be computed more quickly.

Thus, like PCG, SplitMix strives for a better balance between the state transition function and the output function. (Of course, I like PCG's balance even better!)

Bad Gamma Values

SplitMix achieves a larger state space and multiple streams by allowing different additive constants besides the golden ratio. It calls the additive constant gamma.

As I mentioned, almost any random-looking value will work, so long as it is odd. These random-looking numbers would be just as good: 0x243f6a8885a308d3, 0xb7e151628aed2a6b, 0x6a09e667f3bcc909, 0x93c467e37db0c7a5.

But if we picked 0x0000000000000001, that would not be a very good gamma value. SplitMix's output function is specifically designed on the basis that the internal state changes are at least a bit random looking. Likewise, something like 0x0101010101010101 is not going to be good enough, either.

The SplitMix algorithm tries to detect bad gamma values to make sure that if the gamma value seems bad, it corrects it to one that will hopefully be better.

The SplitMix paper says that “We believe that more investigation [of weak gamma values] is required.” (page 467) and, in the conclusion, questions the approach of trying to weed out bad gamma values, writing:

Finally, guided by intuition and informal argument, we made a somewhat arbitrary decision to suppress certain γ values in the method mixGamma—and yet we are haunted by the memory of the Enigma cipher machines, which were carefully designed so that “bad” encryptions could never occur by guaranteeing that no letter would ever map to itself, and so the word “LONDON” would never be encrypted as the word “LONDON”, yet that avoidance of “bad” encryptions was the Achilles’ heel that allowed the codebreaking machinery (the “cryptologic bombes” designed by Alan Turing and Gordon Welchman) to be so effective. By analogy, we worry that our effort to avoid “bad” random sequences that we believe would occur only rarely has in fact just introduced an unnecessary and undesirable bias. True randomness requires not that “miracles” or “disasters” be absent, but that they be present in due proportion—and human intuition is notoriously poor at judging this proportion. In the future, all of the choices outlined in this paragraph should be questioned and tested against better theory than we have so far mustered.

The Bug

Thus far, we know how SplitMix works and that it tries to avoid pathological gamma values. But let's check out Haskell's implementation of SplitMix:

macOS$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/  :? for help
Prelude> import System.Random.SplitMix
Prelude System.Random.SplitMix> mkSMGen 0x61c8864680b583eb
SMGen {_seed = 16269636050940713694, _gamma = 1}
Prelude System.Random.SplitMix>

I hope you see that that's not good. It's using a terrible gamma value! Let's try all the ones I listed at the start:

Prelude System.Random.SplitMix> map mkSMGen [0x61c8864680b583eb, 0xf8364607e9c949bd, 0x88e48f4fcc823718, 0x7f83ab8da2e71dd1, 0x7957d809e827ff4c, 0xf8d059aee4c53639, 0x9cd9f015db4e58b7, 0xf4077b0dbebc73c0, 0x305cb877109d0686, 0x359e58eeafebd527, 0xbeb721c511b0da6d, 0x86466fd0fcc363a6, 0xefee3e7b93db3075, 0x79629ee76aa83059, 0x05d507d05e785673, 0x76442b62dddf926c]
[SMGen {_seed = 16269636050940713694, _gamma = 1},SMGen {_seed = 11331281847540010644, _gamma = 1},SMGen {_seed = 5619286150931196027, _gamma = 1},SMGen {_seed = 10281499533683189089, _gamma = 1},SMGen {_seed = 17485363145453140727, _gamma = 7},SMGen {_seed = 17937441790592959953, _gamma = 7},SMGen {_seed = 15529190628949695919, _gamma = 7},SMGen {_seed = 6957216782316743206, _gamma = 7},SMGen {_seed = 13022414457527836317, _gamma = 15},SMGen {_seed = 2127073997383595602, _gamma = 15},SMGen {_seed = 7685868475082380828, _gamma = 15},SMGen {_seed = 6396586367974718702, _gamma = 15},SMGen {_seed = 13838712901582905874, _gamma = 31},SMGen {_seed = 15464227283579875280, _gamma = 31},SMGen {_seed = 17142043965827948305, _gamma = 31},SMGen {_seed = 10266498716653543189, _gamma = 31}]

Here we can see that four seeds map to the weak gamma value of 1. Four? Why four?

It comes from faithfully implementing this code from the SplitMix paper:

 private static long mixGamma(long z) {
   z = mix64variant13(z) | 1L;
   int n = Long.bitCount(z ^ (z >>> 1));
   if (n >= 24) z ^= 0xaaaaaaaaaaaaaaaaL;
   return z; }   // This result is always odd.

Let's contrast that code to the code Oracle has in their JDK8 implementation of SplitMix:

private static long mixGamma(long z) {
    z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL; // MurmurHash3 mix constants
    z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
    z = (z ^ (z >>> 33)) | 1L;                  // force to be odd
    int n = Long.bitCount(z ^ (z >>> 1));       // ensure enough transitions
    return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}

Do you see it? In converting the code for the paper, someone decided to clean it up and avoid the use of the ternary operator in favor of a more conventional if statement. Unfortunately, in doing so they inverted the logic.

The paper's implementation checks the gamma value, if it's bad it leaves it unchanged, and if it seems to be good, it tries to “repair” it, which can actually introduce problems when there were none before.

That's why there were four bad initializations. Two of them were already bad, and two of them were good ones turned into bad ones.

Problems with Split

I've focused on implementations that have an initialization that operates analogously to this Java constructor:

public SplittableRandom(long s) {
    this.seed = mix64(s);
    this.gamma = mixGamma(s + GOLDEN_GAMMA);
}

which is how the Haskell implementation is coded. But some implementations have a constructor that is more faithful to the paper's constructor, which behaves like this:

public SplittableRandom(long seed) {
    this.seed = seed;
    this.gamma = GOLDEN_GAMMA;
}

For this version, no freshly constructed generator will have an obviously problematic gamma (because they're all using the reasonable-seeming 0x9e3779b97f4a7c15 value), but if we call SplitMix's split operation we will execute the same the problematic gamma-fixing code. Here's an example in our Haskell session:

Prelude System.Random.SplitMix> let goldenGamma = 0x9e3779b97f4a7c15
Prelude System.Random.SplitMix> let mkSMGen1 s = seedSMGen s goldenGamma
Prelude System.Random.SplitMix> snd (splitSMGen (mkSMGen1 0xc3910c8d016b07d6))
SMGen {_seed = 16269636050940713694, _gamma = 1}

Takeaways So Far, Buggy Code in the Paper

It's interesting that this bug seems to have gone unnoticed for so long. Multiple programmers copied and pasted the code from the SplitMix paper and never noticed that it was wrong. Statistical tests were run, but none of them triggered enough pathological behavior to raise any red flags.

The original bug is understandable. People accidentally flip things around all the time. It's also quite easy to change the code to match the JDK version.

But there's a downside to fixing this bug. Results obtained from the buggy PRNG may not be reproducible on the new fixed PRNG, because its output values for a given seed (or split) will almost certainly be different after the bug—mixGamma will return completely different results and thus create completely different gammas. You might not think that reproducibility is important, but sometimes people depend on a particular seed generating a particular output. If you check out Andrej Bauer's random art website, you'll find that the seed defines the picture that is drawn.

Ordinarily, I would argue that the right thing to do is for people providing implementations of SplitMix based on the code from the paper to fix the bug, even with the disruption that fixing it might cause. They would then, alas, also announce to users that any results obtained using their implementations prior to the fix could be suspect, although that is fairly unlikely. Unfortunately, in the next section we'll see that perhaps fixing the bug doesn't make sense after all, so deciding on a right course of action is a bit more difficult.

For now, let's imagine that the code typo was the only issue.

It's a fact of life that papers sometimes contain errors. The peer-review process can catch many errors, but it is never guaranteed to catch them all. If the mistake is discovered, sometimes authors provide errata for their work. But that approach may not help much because the original paper with the error may be far easier to find than the errata document with the corrections. For example, Pierre L'Ecuyer provides an errata document for one of his papers that I've used, but I remained unaware of that document for a long time.

So what would have helped? I see two things that might have prevented people copying the buggy code out of the paper.

The paper could have included a download link for the code it contained. Code in a paper is often abbreviated to save space. Had the authors provided a download link, they would probably have made their unmodified code available rather than the version tweaked for presentation in the paper.
There should be test cases. That way, if someone writes a version of SplitMix, they can check that it works.

Even though the paper doesn't suggest where people can find the source, it does say that the code is in JDK8, and those with the appropriate savvy can seek it out in OpenJDK, but clearly people writing their own versions of SplitMix have not gone sleuthing for it.

I would argue that it might have been best to set up a repository on GitHub or a similar site to hold code, test cases, and allow bug reports.

We're Not Done Yet

I thought I was done at this point, little post about buggy code in a paper with correct code elsewhere. But thinking about it further, I realized that even the JDK version of the gamma test and gamma fix is flawed.

First, the repair doesn't always work. Let's go back to our earlier example where we had an increment of 1. In the fixed code, it'll repair it to 1 ^ 0xaaaaaaaaaaaaaaaa == 0xaaaaaaaaaaaaaaab. Let's test that “repaired” constant:

linux$ ./splitmix.64 0x5a383856a683771 0xaaaaaaaaaaaaaaab | ./RNG_test stdin64
SplitMix64(0x05a383856a683771,0xaaaaaaaaaaaaaaab) initialized.
Using next_int64() output function
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xf217d6bf
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xf217d6bf
length= 128 megabytes (2^27 bytes), time= 2.2 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low16/64]BCFN(2+2,13-5,T)        R=  -6.2  p =1-1.0e-3   unusual
  ...and 147 test result(s) without anomalies

rng=RNG_stdin64, seed=0xf217d6bf
length= 256 megabytes (2^28 bytes), time= 5.7 seconds
  Test Name                         Raw       Processed     Evaluation
  BRank(12):2K(1)                   R= +51.9  p~=  1.1e-16    FAIL !
  ...and 158 test result(s) without anomalies

As you can see, the repair didn't work—this “fixed” generator also quickly fails statistical tests. If you're curious as to why, you might want to try calculating 0xaaaaaaaaaaaaaaab * 3 to see what happens after three additions of this constant.

So SplitMix's repair strategy isn't guaranteed. That's unfortunate.

And the test doesn't always work either. It wouldn't consider the constant 0xaaaaaaaaaaaaaaab problematic, for example. There are, unfortunately, other bad constants too.

The theory for what makes a good constant for the Weyl sequence was too naïve. Alas, that flaw in the paper is not a straightforward bug fix.

Conclusion

In his 1999 book, The Cathedral and the Bazaar, Eric S. Raymond laid out something he called Linus's Law, which states that “given enough eyeballs, all bugs are shallow”. Given the number of people who have read the SplitMix paper, and implemented its code, the fact that these flaws in SplitMix have remained undiscovered for so long seems to be something of a refutation of this principle.

You could claim that, contrariwise, my discovery of these issues supports Linus's Law, but I read the paper when it first came out and didn't see these problems back then. I wrote my own implementation and still didn't see the flaw. I did notice other things, like its predictability (which the paper helpfully calls out), and the way it would fail the 64-bit birthday test (because it never repeats a 64-bit value). I noticed that with the split() operation it becomes a random mapping with similar properties to John von Neumann's much derided middle-square method. But it was only when I started to lay the groundwork to talk about those issues (posts yet to be written!), that I finally noticed this bug.

Finally, we shouldn't over dramatize the seriousness of this bug. Bad constants exist and can cause problems for SplitMix's algorithm, but the vast majority of constants are fine. And the bad constants vary as to how much output you need to have generated before see any noticeable harm—even statistically flawed generators work well enough for many purposes. Currently multiple implementations do erroneously try to repair perfectly good constants, but much of the time the unnecessary “repair” won't have caused any noticeable ill effects.

(I have been in correspondence with Guy Steele about issues in SplitMix, including providing him with access to drafts of this post. It's been a good conversation. He has actually given public talks about possible successors to SplitMix, so you might reasonably infer from public information sources that he might be aware of some issues, too. This post represents my own independent discoveries.)