classification
Title: random.Random should not read 2500 bytes from urandom
Type: Stage:
Components: Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: dstufft, martin.panter, rhettinger, tim.peters, vstinner
Priority: normal Keywords:

Created on 2016-06-08 21:58 by vstinner, last changed 2016-06-13 14:47 by rhettinger. This issue is now closed.

Messages (25)
msg267905 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-06-08 21:58
Reading more than 256 bytes from os.urandom() is different than reading 256 bytes or less. For example, see getrandom() and getentropy() manual pages for more details.

random.Random doesn't need a very high quality entropy. The glib library only reads 128 bits from /dev/urandom to initialize the Mersenne Twister RNG for example.
msg267906 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-06-08 21:58
Related issue #26839. Don't read this very long issue! You can read my summary :-D https://haypo-notes.readthedocs.io/pep_random.html#bug-reports
msg267909 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-06-08 22:13
PHP uses GENERATE_SEED() to initialize mt_rand():
https://github.com/php/php-src/blob/1c295d4a9ac78fcc2f77d6695987598bb7abcb83/ext/standard/php_rand.h#L50

This macro uses: time() (resolution of 1 second), process identifier, php_combined_lcg()

php_combined_lcg() is seeded using: gettimeofday(), and getpid() or thread id in ZTS mode, gettimeofday() (called again after getpid()):
https://github.com/php/php-src/blob/1c295d4a9ac78fcc2f77d6695987598bb7abcb83/ext/standard/lcg.c#L55

mt_rand():
https://github.com/php/php-src/blob/1c295d4a9ac78fcc2f77d6695987598bb7abcb83/ext/standard/rand.c#L308
msg267910 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-06-08 22:16
Donald Stufft proposed the patch no-urandom-by-default.diff in the issue #26839 which replaces os.urandom() with int(time.time()*256) in random.Random constructor.

I dislike this change because it becomes more likely that two Python processes produce the same random number sequence:
https://bugs.python.org/issue26839#msg267819
msg267912 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2016-06-08 22:18
To be clear, that was a minimal patch using a method that already existing. It could be made a lot better by mixing in other sources of entropy like PID#, id(self), etc.
msg267915 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2016-06-08 22:23
One of the key take aways though is that MT doesn't really *need* to be initialized with a CSPRNG, all it needs it some moderately random data. So we can eliminate the case all together by just not using it. Though a sticking point is that it's documented to seed itself from os.urandom if seed(None) is called.
msg267950 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-09 02:49
If the seed is to be based on time.time(), why not use something based on hash(time.time()) to avoid the 1/256 s resolution?

If there is a practical way to get pseudo-random data from the platform, it would be better to use that, rather than cooking up our own entropy from time.time(), PID, etc. But I guess that depends on the future of os.urandom() and friends.

In the meantime, limiting to os.urandom(256) seems reasonable, though I’m not a random number or cryptography expert.
msg267952 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2016-06-09 03:20
Didn't anyone here follow the discussion about the `secrets` module?  PHP was crucified by security wonks for its horridly naive ways of initializing its PRNGs:

https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_WP.pdf

Please don't even think about making Python a target of similar ridicule ;-)

The only sane approach is to use an _excellent_ source of randomness for initialization, and `urandom()` is the only obvious such source.  While the more the merrier, I agree 2500 utterly unpredictable bytes isn't necessary.

If this has to change, use the most possible without creating other problems on a major platform, but certainly no less than 128 crypto-strength bytes.

-1 on any poke-and-hope gibberish trying to brew our own out of time.time(), PID, id(), etc.  That stuff is easy for a malicious program to attack.  That's why Python switched to using `urandom()` to begin with, before security wonks noticed how poorly most libraries handle this.

It's not about supplying "enough randomness" for applications, it's about making it computationally intractable for well-funded expert attackers to out-guess.  That's why urandom() is used.
msg267955 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2016-06-09 04:05
Tim,

If MT is used in any of the security sensitive contexts that paper mentions, then it doesn't matter if you seed it with a static zero or a billion bytes read from the purest of purestrain randomness, your application is broken. In other words, it doesn't matter what we seed it with, random.py (outside of SystemRandom) is vulnerable to everything said in that paper.

It took me 5 minutes of googling to find https://github.com/fx5/not_random, which took 22 minutes on my iMac to generate my own copy of it's `magic_data` file, and then 15 seconds to clone the state of the MT using nothing but the output of it. This was on CPython 2.7.11 where MT is seeded with 2500 bytes off urandom.

Surely we're not making engineering trade offs based off whether or not someone who doesn't understand the difference between a CSPRNG and a PRNG would make fun of us for not using a CSPRNG where it's not needed.
msg267958 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2016-06-09 04:47
Donald, it does matter.  The code you found must be using some older version of Python, because the Python 3 version of randint() uses _randbelow(), which is an accept/reject method that consumes an _unpredictable_ number of 32-bit Twister outputs.  That utterly destroys the theoretical framework the code you found relies on.

I know all about this, because I did extensive research during the `secrets` module discussion - Python 3 isn't systematically vulnerable to any of the attacks in the paper.  Deduction of the Twister's internal state is blocked in "almost all" cases because of the accept/reject logic in _randbelow() (BTW, .choice() is of far more practical concern here than .randint(), but in Python 3 they both build on _randbelow()).

Code attempting to deduce the state can't know how many Twister outputs were consumed, so can't know which bit variables in its equations are involved.  It can make a relatively high-probability _guess_, but that's not good enough either:  it has to get many consecutive high-probability guesses right to deduce all the bits of the Twister's very large state.

The usual outcome:  at least one (usually more than one) guess is wrong, and if the equation solver is careful it notices that its equations have become mutually inconsistent.  Dead end.  More rarely, the equations remain consistent, but the "deduced" state is pure fantasy due to a wrong guess along the way.

There's nothing about this that's a mystery to me.  I wrote my own solver more capable than the one you found:  it can deduce full MT states quickly from partial outputs of any kind whatsoever (e.g., it only sees the bits of the form 2**i for prime i in every 7th Twister output).  However, it too needs to know exactly _which_ MT partial outputs it's seeing.  If it believes it's seeing every 7th, but actually sees the 8th in one case, all bets are off:  the equations may become inconsistent, or they remain consistent but deliver a nonsense state.

So I remain strongly -1 on any attempt to make Python's seeding stupid again.  Stupid seeding makes Python vulnerable to attacks by script kiddies; no relatively sophisticated equation solvers are needed if the seeding is lame.

Yes, the Twister remains unsuitable for crypto purposes.  That's why the `secrets` module is being introduced.  But that's no excuse for deliberately making existing naive code laughably easy to attack.

And, also yes, we have made some decisions based on worst-case scenarios about naivety.  That's why random switched to using urandom() to begin with.  Get over it ;-)
msg267967 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2016-06-09 07:32
Tim,

you are saying that some methods (e.g. randint) of the MT don't fail security properties as bad as other. I'm sorry, that argument is not good enough for me. The seed() method of the MT is causing real-world problems, e.g. #25420 where 'import random' in a VM blocks forever because the Kernel's RNG state hasn't been seeded yet.

I'm in favor of changing the default seed for the default random instance.
msg268012 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2016-06-09 09:53
New plan:

* Add a new uint64 _Py_HashSecret.random_seed
* On 'import random' seed random.Random() from _Py_HashSecret.random_seed + gettimeofday().tv_sec + gettimeofday().tv_usec + id(self). That way subinterpreters get a different init state.

On systems with a good entropy source, _Py_HashSecret will be filled with cryptographically strong material. On early boot / VM scenarios the situation is a bit degraded but still properly good enough. It's better than to seed from a bad RNG or just from time.time().
msg268013 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-06-09 10:06
> On 'import random' seed random.Random() from _Py_HashSecret.random_seed + gettimeofday().tv_sec + gettimeofday().tv_usec + id(self). That way subinterpreters get a different init state.

Can we use os.urandom() was random.Random is instanciated manually,
but use the "secret random seed" when the random module is imported?

You can easily restrict the feature to workaround blocking "import
random" on VM: consume the secret in the C code, and ensure that it's
only used once.

For example, reimplement random.Random.seed() in C (remove the Python
implementation) and use the secret in C, but only the first user of
seed(None) will get it. WDYT?
msg268016 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2016-06-09 11:01
If seeding from urandom was causing no problems, then I would not care if random.Random() wanted to seed from urandom, even though it doesn't need to. However it is causing problems, and thus it shouldn't.

Here's another script, this one runs on Python 3.5.1 that can clone MT using nothing but it's output: https://gist.github.com/dstufft/394ea7dd8f64159df10e25a75865c03c.
msg268036 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2016-06-09 16:44
Donald, your script appears to recreate the state from some hundreds of consecutive outputs of getrandbits(64).  Well, sure - but what of it?  That just requires inverting the MT's tempering permutation.  You may as well note that the state can be recreated from the output of random.getstate() - in which case, your script could be a lot shorter too ;-)

But that's not what real-life programs expose.  All the flaws in the PHP paper related to deducing PRNG state were found in real-life code using idiomatic PHP ways of spelling choice() or randrange(), with relatively few possible outputs.

Produce a program that can deduce the state, in Python 3, from - say - a million consecutive outputs of randrange(256), and _that_ would be interesting, because that would be relevant.  It's easy in Python 2.  But in Python 3, you can't tell from the outputs how many times MT was invoked under the covers (but, of course, you can from your contrived getrandbits(64) outputs - the C-level MT is called exactly twice for each of those outputs).

In any case, the vast bulk of the PHP flaws were found by out-brute-forcing dumb PRNG initialization, which requires nothing in the way of reproducing state via reverse-engineering outputs (see Figure 13).  Noting that idiomatic use of Python 3's choice() (etc) is resistant to the paper's equation-solving state-deducing approach is really just a footnote - the _point_ is that lame seeding is, all by itself, a Bad Idea.  That alone was enough to crack most of the PHP programs the paper covered.

As to the rest, there are already too many massive bug reports arguing about urandom()-in-general.  The title of _this_ bug report suggested it may be good enough to reduce the _number_ of urandom() bytes MT initialization uses.  But, so far, Victor & I appear to be the only ones who made an on-topic comment about that ;-)

What do you believe?  For example, do you believe it would remain a disaster if MT initialization wanted only 1 byte from urandom()?  Or is 0 the only value you can live with?
msg268040 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2016-06-09 17:08
> But that's not what real-life programs expose.

Are you sure? Searching github pulls up a number of results of people calling it, but I haven't looked through them to see how/why they're calling it.

> What do you believe?  For example, do you believe it would remain a disaster if MT initialization wanted only 1 byte from urandom()?  Or is 0 the only value you can live with?

I don't really care that much what random.Random initialized with except as it related to what os.urandom does by default. Here's a copy/paste from my email to python-dev about it:

* Use getrandom(GRND_NONBLOCK) for random.Random since it doesn't matter if we
  get cryptographically secure random numbers or not.
* Switch it to use something other than a CSPRNG by default since it doesn't
  need that.
* Instead of seeding itself from os.urandom on import, have it lazily do that
  the first time one of the random.rand* functions are called.
* Do nothing, and say that ``import random`` relies on having the kernel's
  urandom pool initialized.

Between these options, I have a slight preference for switching it to use a non CSPRNG, but I really don't care that much which of these options we pick. Using random.Random is not secure and none of the above options meaningfully change the security posture of something that accidently uses it.
msg268046 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2016-06-09 18:04
> Searching github pulls up a number of results of people
> calling it, but I haven't looked through them to see
> how/why they're calling it.

Sorry, I don't know what "it" refers to.  Surely not to a program exposing the output of .getstate()?!

Regardless, there was a long discussion about the `secrets` module at the time, and nobody found any real code vulnerable to the approaches in the PHP paper under Python 3 (contrived code, certainly - that's easy).  Again, exploiting lame seeding alone sufficed to crack most of their examples, and Python's use of urandom() for seeding eliminated that approach (in Python 2 too).  

Examples potentially vulnerable to state equation-solving were "just like" what the PHP coders rolled by hand:  uses of things like .choice() and .randrange() to build "random" strings (passwords, session tokens, ...), from relatively small alphabets.  The smaller the alphabet, the more resistant Python 3 is to this approach, because the more likely ._randbelow() will invisibly skip over MT outputs.

For a while an incorrect claim was mistakenly accepted:  that when len(alphabet) was a power of 2, choice(alphabet) made an always-known number of MT calls.  If that were true, the equation solver could deduce the state quickly in such cases, which are relatively common.  But it's false - ._randbelow() is actually _most_ likely to skip over MT outputs when it's making a choice from a power-of-2 number of possibilities.  That's not obvious from a glance at the code.

I remain -1 on making seeding "dumb" again.  But I don't care whether urandom() returns low-quality bytes in the boot-time edge cases people are upset about.  They're still likely to be "better" than anything spun out of stuff like time.time().
msg268047 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2016-06-09 18:05
> Sorry, I don't know what "it" refers to.  Surely not to a program exposing the output of .getstate()?!

random.getrandbits()
msg268051 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2016-06-09 18:43
Ah!  Yes, .getrandbits(N) outputs remain vulnerable to equation-solving in Python 3, for any value of N.  I haven't seen any code where that matters (may be "a security hole"), but would bet some _could_ be found.

There's no claim of absolute security here.  To the contrary.  What I'm opposed to is making _all_ naive code vulnerable to easy script-kiddie brute force attacks against lame seeding.

The kinds of things people _were_ jumping up & down about were the many instances of stuff like this on the web:

https://stackoverflow.com/questions/3854692/generate-password-in-python

Again, I'd be impressed if you could write code under Python 3 to deduce the MT state from any number of outputs from his naive approach in reasonable time.  Of course he should be using urandom() instead (as an unaccepted answer urges) - but much code plain doesn't, and in Python 3 it's resistant to any attack the PHP paper exposed.

Make seeding lame again, and the easiest attacks can succeed again (the equation-solving stuff remains a footnote to me).
msg268129 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-06-10 17:01
With anything less than than full seeding of the 19937-bit state space, we lose all the guarantees (proved properties) documented at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/mt.pdf .  It is very nice to be able to rely on those properties and I don't think we should abandon them because glibc currently has lower standards (the trend is towards making seeding stronger rather than weaker).

"Good enough" for random number generators is in the eye of the beholder.  If your agenda is to reduce the number of bytes extracted from urandom(), it is easy to not care how well the MT is seeded, but there are people who do care (or should care).

Note that sampling and shuffling both eat entropy like crazy.  Currently, we have enough to shuffle 2000 elements.   With 128 bits, we run out at 34 elements, not even enough to shuffle a deck of cards and have each permutation as a possible outcome.

Over time, the random module has evolved away from "good enough" and traded-away speed in return for equi-distribution (i.e. we use _randbelow to get a balanced choice over ranges that aren't an exact power-of-two).

We should not allow a regression in quality.  In particular, I object to the cavalier assertion, "random.Random doesn't need a very high quality entropy."  If you're running multiple simulations, this is something you should care about.  And to defend against criticism of random samples, it is nice to be able to say you were properly seeded (see RFC 1750) and that you have some basis for believing that every outcome was equally likely.
msg268137 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2016-06-10 18:15
Raymond, while I'm in general agreement with you, note that urandom() doesn't deliver "random" bytes to begin with.  A CSPRNG is still a PRNG.

For example, if the underlying urandom() generator is ChaCha20, _it_ has "only" 512 bits of state.  Seeding the Twister with 2500 bytes from urandom() far exceeds the maximum possible entropy that ChaCha20's comparatively tiny 64 bytes of state can deliver (but, yes, I'm ignoring the possibility that a good urandom() implementation may fold in fresh entropy during the time MT sucks out those 2500 bytes - nevertheless, MT's state is far larger).

That's why I said earlier I could live with seeding from 128 bytes - twice the size of a currently trendy urandom() implementation's state.

But I'll be happiest if nothing changes here (given that Guido ruled yesterday that Python's current urandom() implementation has to be reverted to once again match Linux's non-blocking urandom() behavior).
msg268163 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-06-10 22:07
> But I'll be happiest if nothing changes here (given that Guido ruled
> yesterday that Python's current urandom() implementation has to be 
> reverted to once again match Linux's non-blocking urandom() behavior).

With urandom() behavior restored, can we close this?
msg268192 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-11 06:25
Victor originally opened this mentioning a 256-byte limit. I guess that comes from Open BSD’s getentropy() function: <http://man.openbsd.org/OpenBSD-current/getentropy.2>. Solaris’s getrandom() function <https://docs.oracle.com/cd/E53394_01/html/E54765/getrandom-2.html> has a similar limit, but of 1024 bytes. But Python already works around these limits by doing multiple calls.

So if it really is valid to get 2500 bytes with as much entropy as possible, maybe there is no problem and we leave things as they are.
msg268199 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2016-06-11 08:05
On 2016-06-11 00:07, Raymond Hettinger wrote:
> 
> Raymond Hettinger added the comment:
> 
>> But I'll be happiest if nothing changes here (given that Guido ruled
>> yesterday that Python's current urandom() implementation has to be 
>> reverted to once again match Linux's non-blocking urandom() behavior).
> 
> With urandom() behavior restored, can we close this?

No, the import of random is still broken on all BSD platforms during
early boot.
msg268226 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2016-06-11 16:30
Christian, you should really be the first to vote to close this.  The title of this bug report is about whether it would be good to reduce the _number_ of bytes Random initialization consumes from os.urandom(), not whether to stop using os.urandom() entirely.

But if I grasp your position, anyone who thinks Random initialization should continue using os.urandom() at all is, ipso facto, so insufferably ignorant they shouldn't even be allowed to express an opinion ;-)

So, to me, your position on _this_ bug report is clear:  no, merely reducing the number of os.urandom() bytes consumed is of no value whatsoever, so this report should be rejected.

If you want random initialization to stop using os.urandom() entirely, that argument belongs in one of the other bug reports - trying to hijack this _different_ report just adds to the noise.

I haven't seen an actual reason advanced to believe that reducing the number of bytes would help anything either, so I'd also like to see this closed.

Note that, today, Guido suggested that Python's os.urandom() should grow a flag to specify the desired behavior (block; raise an exception; just keep going and maybe get lower-quality bytes).  In that case, the default random initialization would surely pass the "just keep going" flag.  In which case, I've still seen no reason to expect that reducing the number of bytes requested would help anything.
History
Date User Action Args
2016-06-13 14:47:44rhettingersetstatus: open -> closed
resolution: not a bug
2016-06-12 11:34:13christian.heimessetnosy: - christian.heimes
2016-06-11 16:30:16tim.peterssetmessages: + msg268226
2016-06-11 08:05:23christian.heimessetmessages: + msg268199
2016-06-11 06:25:01martin.pantersetmessages: + msg268192
2016-06-10 22:07:40rhettingersetmessages: + msg268163
2016-06-10 18:15:45tim.peterssetmessages: + msg268137
2016-06-10 17:02:00rhettingersetassignee: rhettinger

messages: + msg268129
nosy: + rhettinger
2016-06-09 18:43:47tim.peterssetmessages: + msg268051
2016-06-09 18:05:09dstufftsetmessages: + msg268047
2016-06-09 18:04:12tim.peterssetmessages: + msg268046
2016-06-09 17:08:56dstufftsetmessages: + msg268040
2016-06-09 16:44:47tim.peterssetmessages: + msg268036
2016-06-09 11:01:48dstufftsetmessages: + msg268016
2016-06-09 10:06:38vstinnersetmessages: + msg268013
2016-06-09 09:53:27christian.heimessetmessages: + msg268012
2016-06-09 07:32:45christian.heimessetmessages: + msg267967
2016-06-09 04:47:32tim.peterssetmessages: + msg267958
2016-06-09 04:05:03dstufftsetmessages: + msg267955
2016-06-09 03:20:48tim.peterssetnosy: + tim.peters
messages: + msg267952
2016-06-09 02:49:03martin.pantersetnosy: + martin.panter
messages: + msg267950
2016-06-08 22:23:44dstufftsetmessages: + msg267915
2016-06-08 22:18:12dstufftsetmessages: + msg267912
2016-06-08 22:16:57vstinnersetnosy: + dstufft
messages: + msg267910
2016-06-08 22:13:09vstinnersetmessages: + msg267909
2016-06-08 21:58:59vstinnersetmessages: + msg267906
2016-06-08 21:58:16vstinnercreate