Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing the random module on getrandom() #71026

doko42 · 2016-04-24T19:04:14Z

BPO	26839
Nosy	@malemburg, @rhettinger, @doko42, @ncoghlan, @vstinner, @larryhastings, @matejcik, @ned-deily, @alex, @skrah, @vadmium, @ztane, @dstufft, @Lukasa, @tpetazzoni
Files	nonblocking-getrandom.diff: Patch to py_getrandom to use nonblocking system call, and associated plumbing. getrandom-nonblocking-v2.patch: Patch random.c to use nonblocking getrandom() getrandom-nonblocking-v3.patch: Patch random.c to use nonblocking getrandom() (cleaned-up version). getrandom_nonblocking_v4.patch nonblocking_urandom_noraise.patch no-urandom-by-default.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-06-11.08:46:07.323>
created_at = <Date 2016-04-24.19:04:13.945>
labels = ['interpreter-core', 'type-bug', 'release-blocker']
title = 'Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing the random module on getrandom()'
updated_at = <Date 2016-06-15.20:00:44.118>
user = 'https://github.com/doko42'

bugs.python.org fields:

activity = <Date 2016-06-15.20:00:44.118>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2016-06-11.08:46:07.323>
closer = 'larry'
components = ['Interpreter Core']
creation = <Date 2016-04-24.19:04:13.945>
creator = 'doko'
dependencies = []
files = ['42837', '42842', '43265', '43267', '43278', '43282']
hgrepos = []
issue_num = 26839
keywords = ['patch']
message_count = 172.0
messages = ['264121', '264122', '264126', '264258', '264265', '264267', '264270', '264271', '264284', '264289', '264292', '264303', '265427', '265430', '265452', '265477', '265481', '265485', '265496', '265500', '265549', '265555', '266216', '267455', '267504', '267511', '267537', '267539', '267546', '267550', '267554', '267571', '267608', '267609', '267610', '267611', '267612', '267614', '267616', '267617', '267621', '267623', '267624', '267625', '267626', '267627', '267628', '267629', '267630', '267631', '267632', '267633', '267634', '267635', '267636', '267637', '267638', '267640', '267642', '267643', '267644', '267645', '267648', '267650', '267654', '267656', '267660', '267661', '267663', '267664', '267665', '267666', '267667', '267668', '267669', '267670', '267671', '267672', '267673', '267674', '267675', '267676', '267677', '267678', '267679', '267680', '267681', '267682', '267684', '267685', '267686', '267687', '267688', '267689', '267690', '267693', '267694', '267695', '267696', '267699', '267705', '267707', '267709', '267710', '267711', '267712', '267715', '267716', '267718', '267720', '267721', '267723', '267725', '267726', '267728', '267729', '267730', '267731', '267735', '267737', '267739', '267740', '267741', '267745', '267746', '267749', '267750', '267751', '267752', '267803', '267804', '267805', '267806', '267807', '267808', '267809', '267810', '267811', '267812', '267813', '267815', '267816', '267817', '267818', '267819', '267823', '267825', '267831', '267836', '267837', '267846', '267850', '267853', '267855', '267856', '267857', '267863', '267873', '267887', '267890', '267893', '267897', '267898', '267913', '267914', '267939', '268018', '268201', '268591', '268593', '268627', '268629']
nosy_count = 19.0
nosy_names = ['lemburg', 'rhettinger', 'doko', 'ncoghlan', 'vstinner', 'larry', 'matejcik', 'ned.deily', 'alex', 'skrah', 'python-dev', 'martin.panter', 'ztane', 'dstufft', 'pitti', 'Lukasa', 'thomas-petazzoni', 'Colm Buckley', 'Theodore Tso']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue26839'
versions = ['Python 3.5', 'Python 3.6']

doko42 · 2016-04-24T19:04:13Z

[forwarded from https://bugs.debian.org/822431]

This regression / change of behaviour was seen between 20160330 and 20160417 on the 3.5 branch. The only check-in which could affect this is the fix for issue bpo-26735.

3.5.1-11 = 20160330
3.5.1-12 = 20160417

Martin writes:
"""
I just debugged the adt-virt-qemu failure with python 3.5.1-11 and
tracked it down to python3.5 hanging for a long time when it gets
called before the kernel initializes its RNG (which can take a minute
in VMs which have low entropy sources).

With 3.5.1-10:

  $ strace -e getrandom python3 -c 'True'
  +++ exited with 0 +++

With -11:
$ strace -e getrandom python3 -c 'True'
getrandom("\300\0209\26&v\232\264\325\217\322\303:]\30\212Q\314\244\257t%\206\"", 24, 0) = 24
+++ exited with 0 +++

When you do this with -11 right after booting a VM, the getrandom()
can block for a long time, until the kernel initializes its random
pool:

11:21:36.118034 getrandom("/V#\200^O*HD+D_\32\345\223M\205a\336/\36x\335\246", 24, 0) = 24
11:21:57.939999 ioctl(0, TCGETS, 0x7ffde1d152a0) = -1 ENOTTY (Inappropriate ioctl for device)

[ 1.549882] [TTM] Initializing DMA pool allocator
[ 39.586483] random: nonblocking pool is initialized

(Note the time stamps in the strace in the first paragraph)

This is really unfriendly -- it essentially means that you stop being
able to use python3 early in the boot process or even early after
booting. It would be better to initialize that random stuff lazily,
until/if things actually need it.

In the diff between -10 and -11 I do seem some getrandom() fixes to
supply the correct buffer size (but that should be irrelevant as in
-10 getrandom() wasn't called in the first place), and a new call
which should apply to Solaris only (#ifdef sun), so it's not entirely
clear where that comes from or how to work around it.

It's very likely that this is the same cause as for bpo-821877, but the
description of that is both completely different and also very vague,
so I file this separately for now.
"""

doko42 · 2016-04-24T19:06:20Z

other issues fixed between these dates:

- Issue bpo-26659: Make the builtin slice type support cycle collection.
- Issue bpo-26718: super.__init__ no longer leaks memory if called multiple
  times.  NOTE: A direct call of super.__init__ is not endorsed!
- Issue bpo-25339: PYTHONIOENCODING now has priority over locale in setting
  the error handler for stdin and stdout.
- Issue bpo-26717: Stop encoding Latin-1-ized WSGI paths with UTF-8.
- Issue bpo-26735: Fix :func:`os.urandom` on Solaris 11.3 and newer when
  reading more than 1,024 bytes: call ``getrandom()`` multiple times with
  a limit of 1024 bytes per call.
- Issue bpo-16329: Add .webm to mimetypes.types_map.
- Issue bpo-13952: Add .csv to mimetypes.types_map.
- Issue bpo-26709: Fixed Y2038 problem in loading binary PLists.
- Issue bpo-23735: Handle terminal resizing with Readline 6.3+ by installing
  our own SIGWINCH handler.
- Issue bpo-26586: In http.server, respond with "413 Request header fields too
  large" if there are too many header fields to parse, rather than killing
  the connection and raising an unhandled exception.
- Issue bpo-22854: Change BufferedReader.writable() and
  BufferedWriter.readable() to always return False.
- Issue bpo-6953: Rework the Readline module documentation to group related
  functions together, and add more details such as what underlying Readline
  functions and variables are accessed.

vstinner · 2016-04-24T19:37:04Z

Python 3 uses os.urandom() at startup to randomize the hash function. os.urandom() now uses the new Linux getrandom() function which blocks until the Linux kernel is feeded with enough entropy. It's a deliberate choice.

The workaround is simple: set the PYTHONHASHSEED environment variable to use a fixed seed. For example, PYTHONHASHSEED=0 disables hash randomization.

If you use virtualization and Linux is not feeded with enough entropy, you have security issues.

I just debugged the adt-virt-qemu failure (...)

If you use qemu, you can use virt-rng to provide good entropy to the VM from the host kernel.

vstinner · 2016-04-26T11:47:19Z

See also the issue bpo-25420 which is similar but specific to "import random".

vstinner · 2016-04-26T12:11:48Z

The issue bpo-25420 has been closed as a duplicate of this issue.

Copy of the latest message:

msg264262 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-04-26 12:05

I still believe the underlying system API use should be fixed rather than all the different instances where it gets used.

getrandom() should not block. If it does on a platform, that's a bug on that platform and Python should revert to the alternative of using /dev/urandom directly (or whatever other source of randomness is available).

Disabling hash randomization is not a good workaround for the issue, since it will definitely pop up in other areas as well.

skrah · 2016-04-26T12:22:23Z

Hmm. Why does os.urandom(), which should explicitly not block, use a blocking getrandom() function?

This is quite unexpected on Linux.

skrah · 2016-04-26T12:35:11Z

Wow, it's by design:

" os.urandom(n)

Return a string of n random bytes suitable for cryptographic use."

``man urandom'':

"A read from the /dev/urandom device will not block waiting for more entropy. As a result, if there is
not sufficient entropy in the entropy pool, the returned values are theoretically vulnerable to a crypto-
graphic attack on the algorithms used by the driver."

vstinner · 2016-04-26T12:37:32Z

"Hmm. Why does os.urandom(), which should explicitly not block, use a blocking getrandom() function? This is quite unexpected on Linux."

I modified os.getrandom() in the issue bpo-22181 to use the new getrandom() syscall of Linux 3.17. The syscall blocks until the Linux kernel entropy pool is *initialized* with enough entropy. In a healthy system, it must never occur.

To be clear: you get read 10 MB (or 1 GB or more) of random data using os.urandom() even if the entropy pool is empty. You can test:

In a terminal 1, run "dd if=/dev/random of=random" to ensure that the entropy pool is empty
In a terminal 2, run "while true; do cat /proc/sys/kernel/random/entropy_avail; sleep 1; done" to see that entropy pool is empty (or very low, like less 100 bytes)
In a terminal 3, get a lot of random data using os.urandom(): ./python -c 'import os; x=os.urandom(1024*1024*10)'

=> it works, you *can* get 10 MB of random data even if the kernel entropy pool is empty.

Reminder: getrandom() is used to avoid a file descriptor which caused various issues (see issue bpo-22181 for more information).

Ok, now this issue. The lack of entropy is a known problem in virtual machine. It's common that SSH, HTTPS, or other operations block because because of the lack of entropy. On bare metal, the Linux entropy pool is feeded by physical events like interruptions, keyboard strokes, mouse moves, etc. On a virtual machine, there is *no* source of entropy.

The problem is not only known but also solved, at least for qemu: you must attach a virtio-rng device to your virtual machine. See for example https://fedoraproject.org/wiki/Features/Virtio_RNG The VM can now reads fresh and good quality entropy from the host.

To come back to Python: getrandom() syscall only blocks until the entropy pool is *initialized* with enough entropy.

The getrandom() syscall has a GRND_NONBLOCK to fail with EAGAIN if reading from /dev/random (not /dev/urandom) would block because the entropy pool has not enough entropy:
http://man7.org/linux/man-pages/man2/getrandom.2.html

IMHO it's a deliberate choice to block in getrandom() when reading /dev/urandom while the entropy pool is not initialized with enough entropy yet.

Ok, now the question is: should python do nothing to support VM badly configured (with no real source of entropy)?

It looks like the obvious change is to not use getrandom() but revert code to use a file descriptor and read from /dev/urandom. We will get bad entropy, but Python will be able to start.

I am not excited by this idea. The os.urandom() private file descriptor caused other kinds of issues and bad quality entropy is also an issue.

skrah · 2016-04-26T13:44:11Z

It is clear how /dev/urandom works. I just think that securing enough
entropy on startup should be done by the init scripts (if systemd still
allows that :) and not by an application.

[Unless the application is gpg or similar.]

vstinner · 2016-04-26T14:14:15Z

Since many years, Linux systems store entropy on disk to quickly feed the
entropy pool at startup.

It doesn't create magically entropy on VM where you start with zero entropy
at the first boot.

skrah · 2016-04-26T14:31:57Z

I did not claim that it magically creates entropy. -- Many VMs are throwaway test beds. It would be annoying to setup some entropy
gathering mechanism just so that Python starts.

malemburg · 2016-04-26T15:11:15Z

As mentioned on the other issue bpo-25420, this is a regression and a change in documented behavior of os.urandom(), which is expected to be non-blocking, regardless of whether entropy is available or not.

The fix should be easy (from reading the man page http://man7.org/linux/man-pages/man2/getrandom.2.html): set the GRND_NONBLOCK flag on getrandom(); then, if the function returns -1 and sets EAGAIN, fallback to reading from /dev/urandom directly.

ColmBuckley · 2016-05-12T21:18:22Z

It's worth noting that this issue now affects every installation of Debian testing track with systemd and systemd-cron installed; the python program /lib/systemd/system-generators/systemd-crontab-generator is called very early in the boot process; it imports hashlib (although only .md5() is used) and blocks on getrandom(), delaying boot time until a 90s timeout has occurred.

Suggestions: modify hashlib to avoid calling getrandom() until entropy is actually required, rather than on import; change the logic to use /dev/urandom (or an in-process PRNG) when getrandom() blocks; or both.

ColmBuckley · 2016-05-12T21:48:02Z

Oh; it's not actually hashlib which is calling getrandom(), it's the main runtime - the initialization of the per-process secret hash seed in _PyRandom_Init

Don't know enough about the internal logic here to comment on what the Right Thing is; but I second the suggestion of msg264303. This might just require setting "flags" to GRND_NONBLOCK in py_getrandom() assuming that's portable to other OS.

ColmBuckley · 2016-05-13T08:18:43Z

The attached patch (against 20160330) addresses the issue for me on Linux; it has not been tested on other platforms. It adds the GRND_NONBLOCK flag to the getrandom() call and sends the appropriate failure return if it returns due to lack of entropy. The enclosing functions fall back to reading from /dev/urandom in this case.

Affected files:

Python/random.c - changes to py_getrandom()
configure.ac and pyconfig.h.in - look for linux/random.h for inclusion

Can this, or something similar, be considered for integration with mainline?

ColmBuckley · 2016-05-13T14:39:59Z

A couple of things to note:

Despite the earlier title; this does not just apply to VMs; any system with a potentially-blocking getrandom() (including all Linux 3.17+ and Solaris 11+) is affected.
It's true that getrandom() only blocks on Linux when called before the RNG entropy pool is initialized. However, Python should not be limited to only being called after this initialization.
In particular, systemd-cron relies on a Python script being called very early in the boot process (before the urandom pool is initialized), this is now prevalent on the Debian testing track; causing a 90-second boot delay.
The patch I supplied causes getrandom() to be only called in nonblocking mode; this seems consistent with the desired semantics of os.urandom and _PyRandomInit.

Hope this helps.

Colm

ColmBuckley · 2016-05-13T16:31:41Z

@Haypo - yes, it's strange that Linux's getrandom() might block even when reading the urandom pool. However, I think we need to just cope with this and add the GRND_NONBLOCK flag rather than attempting to force a change in the Linux kernel

ColmBuckley · 2016-05-13T19:35:13Z

See https://lwn.net/Articles/606141/ for an explanation of the blocking behavior of getrandom(). This makes sense to me - before the pool has initialized, /dev/urandom will be readable but will return highly predictable data - ie: it should not be considered safe. In other words, I think that getrandom() offers a sensible API.

The only circumstances where we hit the EAGAIN in getrandom() should be when it's called extremely early in the boot process (as is the case for the systemd-cron generator script I mentioned earlier). I think this is safe enough; a more thorough approach would be to flag that the per-process hash seed (_Py_HashSecret) is predictable and shouldn't be used.

vstinner · 2016-05-13T23:24:10Z

Please elaborate the comment in the patch:

explain that the RNG is not initialized yet with enough entropy
add a referénce to this issue
explain that it's a deliberate choice to use weak (non initialized) RNG
for practical reasons

ColmBuckley · 2016-05-14T01:51:26Z

@Haypo - new version of patch attached with comments and references per your request.

vstinner · 2016-05-14T22:49:20Z

getrandom-nonblocking-v2.patch:

+ /* Alternative might be to return all-zeroes as a strong
+ * signal that these are not random data. */

I don't understand why you propose that in a comment of your change. I don't recall that this idea was proposed or discussed here.

IMHO it's a very bad idea to fill the buffer with zeros, the caller simply has no idea how to check the quality of the entropy. A buffer filled with zeros is "possible" even with high quality RNG, but it's really very very rare :-)

If you consider that a strong signal is required, you must raise an exception. But it looks like users don't care of the quality of the RNG, they request that Python "just works".

ColmBuckley · 2016-05-14T23:09:29Z

@Haypo - yes, I think you're right. Can you delete those two lines (or I can upload another version if you prefer).

I think the pragmatic thing here is to proceed by reading /dev/urandom (as we've discussed). It's not safe to raise an exception in py_getrandom from what I can see; a thorough effort to signal the lack of randomness to outer functions needs more code examination than I have time to carry out at the moment.

From looking at when PyRandom_Init is called and how the hash secret is used; I think it is safe to proceed with /dev/urandom. The general understanding is that urandom has a lower entropy quotient than random, so it's hopefully not going to be used in strong crypto contexts.

ColmBuckley · 2016-05-24T02:04:21Z

@Haypo - just wondering where things stand with this? Is this patch going to get pushed to the mainline?

ned-deily · 2016-06-05T18:43:08Z

Since 3.5.2 is almost upon us, I'm setting this to "release blocker" status so we can make a decision about whether this should be changed for 3.5.2 or not. @Haypo, do you have an opinion about the patch?

tiran · 2016-06-08T09:46:41Z

On 2016-06-08 11:39, STINNER Victor wrote:

STINNER Victor added the comment:

no-urandom-by-default.diff uses a very weak source of entropy for random.Random :-( I'm fighting against weak sources of entropy since many years...

It is totally fine to init the MT of random.random() from a weak entropy
source. Just keep in mind that a Mersenne Twister is not a CPRNG. There
is simply no reason why you want to init a MT from a CPRNG.

dstufft · 2016-06-08T10:23:58Z

Larry,

I would greatly prefer it if we would allow os.urandom to block on Linux, at least by default. This will make os.urandom behavior similarly on most modern platforms. The cases where this is going to matter are extreme edge cases, for most users they'll just silently be a bit more secure-- important for a number of use cases of Python (think for instance, if someone has a SSH server written in Twisted that generates it's own host keys, a perfectly reasonable use of os.urandom). We've been telling people that os.urandom is the right source for generating randomness for cryptographic use for ages, and I think it is important to use the tools provided to us by the platform to best satisfy that use case by default-- in this case, getrandom() in blocking mode is the best tool provided by the Linux platform.

People writing Python code cannot expect that os.urandom will not block, because on most platforms it *will* block on intialization. However, the cases where it will block are a fairly small window, so by allowing it to block we're giving a better guarantee for very little downside-- essentially that something early on in the boot process shouldn't call os.urandom(), which is the right behavior on Linux (and any other OS) anyways.

The problem is that the Python interpreter itself (essentially) calls os.urandom() as part of it's start up sequence which makes it unsuitable for use in very early stage boot programs. In the abstract, it's not possible to fix this on every single platform without removing all use of os.urandom from Python start up (which I think would be a bad idea). I think Colm's nonblocking_urandom_noraise.patch is a reasonable trade off (perhaps not the one I would personally make, but I think it's reasonable). If we wish to ensure that Python interpreter start up never blocks on Linux without needing to supply any command line flags or environment variables, then I would strongly urge us to adopt his patch, but allow os.urandom to still block.

In other words, please let's not let systemd's design weaken the security guarantees of os.urandom (generate cryptographically secure random bytes using the best tools provided by the platform). Let's make a targeted fix.

malemburg · 2016-06-08T11:34:41Z

Even though it may sound like a minor problem that os.urandom()
blocks on startup, I think the problem is getting more and more
something to consider given how systems are used nowadays.

Today, we no longer have the case where you keep a system up
and running for multiple years as we had in the past. VM,
containers and other virtualizations are spun up and down at
a high rate, so the boot cycle becomes more and more important.

FreeBSD, for example, is also concerned about the blocking issue
they have in their implementation:

https://wiki.freebsd.org/201308DevSummit/Security/DevRandom

and they are trying to resolve this by making sure to add as
much entropy they can find very early on in the process.

Now, most applications you run early on in the boot process
are not going to be applications that need crypto random
numbers and this is where I think the problem originates.

We've been telling everybody to use os.urandom() for seeding,
and so everyone uses it, including many many applications that
don't even require crypto random seeding.

The random module is the perfect example.

Essentially, we'd need to educate people that there's a difference
in requesting crypto random data and pseudo random data.

While we can fix the the cases in the stdlib and
the interpreter that don't need crypto random data to use
other means of seeding (e.g. reading straight from /dev/urandom
on Linux or gathering other data to mix into a seed), existing
applications out there will continue to use os.urandom() for
things that don't need crypto random numbers - after all, we told
them to use it.

Some of these will eventually be hit by the blocking problem,
even for applications such as Monte Carlo simulations that
don't need crypto random and should thus not have to wait for
some entropy pool to get initialized.

Now, applications that do need crypto random data should be
able to request this from Python via the stdlib and os.urandom()
may sound like a good basis, but since this is designed as
interface to /dev/urandom, it doesn't block on Linux, so
not such a good choice.

Using /dev/random probably doesn't work either, because this can
block unexpected even after initialization.

IMO, the best way forward and to educate application writers
about the problems is to introduce a two new APIs in 3.6:

os.cyptorandom() for getting OS crypto random data
os.pseudorandom() for getting OS pseudo random data

Crypto applications will then clearly know that
os.cryptorandom() is the right choice for them and
everyone else can use os.pseudorandom().

The APIs would on Linux and other platforms then use getrandom()
with appropriate default settings, i.e. blocking or raising
for os.cryptorandom() and non-blocking, non-raising for
os.pseudorandom().

As for the solving the current issue, we will have to
give people some way to get at non-blocking pseudo random data,
if they need it early in the boot process. With the
proposed change, this is still possible via reading
/dev/urandom directly on Linux, so not everything is
lost.

BTW: Wikipedia has a good overview of how the different
implementations of /dev/random work across platforms:

https://en.wikipedia.org/wiki//dev/random

tiran · 2016-06-08T11:49:56Z

I'm unsubscribing from this ticket for the second time. This form of discussion is becoming toxic for me because I strongly beliefe that it is not the correct way to handle a security-related problem.

vstinner · 2016-06-08T13:35:53Z

Donald: "Cory wasn't speaking about (non)blocking in general, but the case where (apparently) it's desired to not block even if that means you don't get cryptographically secure random in the CPython interpreter start up. (...)"

Oh sorry, I misunderstood his message.

vstinner · 2016-06-08T13:44:38Z

Cory Benfield (msg267637): "if the purpose of this patch was to prevent long startup delays, *it failed*. On all the systems above os.urandom may continue to block system startup."

I don't pretend fixing the issue on all operating systems. As stated in the issue title, this issue is specific to Linux. I understand that the bug still exists on other platforms and is likely to require a specific fix for each platform.

TheodoreTso · 2016-06-08T13:58:16Z

One of the reasons why trying to deal with randomness is hard is because a lot of it is about trust. Did Intel backdoor RDRAND to help out the NSA? You might have one answer if you work for the NSA, and perhaps if you are willing to assume the worst about the NSA balancing its equities between its signals intelligence mission and providing a secure infrastructure for its customers and keeping the US computing industry strong. Etc., etc.

It is true that OS developers are trying to make their random number generators be initialized more quickly at boot time. Part of this is because of the dynamic which we can all see at work on the discussion of this bug. Some people care about very much about not blocking; some people want Python to be useful during the boot sequences; some people care very much about security above all else; some people don't trust application programmers. (And if you fit in that camp; congratulations, now you know how I often feel when I worry about user space programmers doing potentially crazy things and I have no way of even knowing about them until the security researchers publish a web site such as http://www.factorable.net)

From the OS's perspective, one of the problems is that it's very hard to know when you have actually achieved a securely initialized random number generator. Sure, we can say we've done this once we have accumulated at least 128 bits of entropy, but that begs the question of when you've collected a bit of entropy. There's no way to know for sure. On current systems, we assume that each interrupt gathers 1/64th of a bit of entropy on average. This is an incredibly conservative number, and on real hardware, assuming the normal bootup activity, we achieve that within about 5 seconds (plus/minus 2 seconds) after boot. On Intel, on real hardware, I'm comfortable cutting this to 1 bit of entropy per interrupt, which will speed up things considerably. In an ARM SOC, or if you are on a VM and you don't trust the hypervisor so you don't use virtio-rng, is one bit of entropy per interrupt going to be good enough? It's hard to say.

On the other hand, if we use too conservative a number, there is a risk that userspace programmers (such as some have advocated on the discussionon this bug) to simply always use GRND_NONBLOCK, or fall back to /dev/urandom, and then if there's a security exposure, they'll cast the blame on the OS developers. The reality is that we really need to work together, because the real problem are the clueless people writing python scripts at boot time to create long-term RSA private keys for IOT devices[1]. :-)

So when people assert that developers at FreeBSD are at work trying to speedup /dev/random initialization, folks need to understand that there's no magic here. What's really happening is that we're all trying to figure out which defaults work the best. In some ways the FreeBSD folks have it easier, because they support a much fewer range of platforms. It's a lot easier to get things right on x86, where we have instructions like RDTSC and RDRAND to help us out. It's a lot harder to be sure you have things right for ARM SOC's. There are other techniques such as trying to carry entropy over from previous boot sessions, but (a) this requires support from the boot loaders, and on an OS with a large number of architectures, that means adding support to a large number of different ways of booting the kernel --- and it doesn't solve the "consumer device generating keys after a cold start when the device is freshly removed from the packaging".

As far as adding knobs, such as "blocking vs non-blocking", etc., keep in mind that as you add knobs, you increase the knowledge of the system that you force onto the next layer of the stack. So this goes to the question of whether you trust application programmers will be able to get things right.

So Ted, why does Linux expose /dev/random vs /dev/urandom? Historical reasons; some people don't believe that relying on cryptogaphic random number generators is sufficient, they *want* to use entropy which has minimal reliance on the belief that NSA ***probably*** didn't leave a back door into SHA-1, for example. It is for that reason that /dev/random exists. These days, the number of people who believe that to be true are very small, but I didn't want to make changes in existing interfaces. For similar reasons I didn't want to suddenly make /dev/urandom block. The fact that getrandom(2) blocks only until the cryptographic RNG has been initialized, and that it depends on a cryptogaphic RNG, is the consensus that *most* people have come to, and it reflects my recommendations that unless you ***really*** know what you are doing, the right thing to do is to call getrandom(2) with the flags field set to zero, and to be happy. Of course, many people are sure they know what they need to do than there are people who really *do* know what they are doing, which is why in BSD, they simply don't give people a choice with their getentropy(2) system call. If you assume that application/user-space programmers should never be trusted, and API's should come with a strong point of view, that's a reasonable design choice. At some level this is the same choice which is before the Python developer community. I'm not going to presume to tell you what the right thing to do is here, because it's filled with engineering and design tradeoffs. Hopefully this additional perspective is useful, though.

[1] This is a joke, folks. We need to all work together, even the application programmers. Some may say that means we're doomed from a security perspective, but security really has to be a collective responsibility if we don't want our "home of the future" to be completely pwned by the bad guys.....

TheodoreTso · 2016-06-08T14:21:34Z

Oh --- and about people wondering whether os.random is being used for cryptographic purposes or not "most of the time" or not --- again, welcome to my world. I get complaints all the time from people who try to do "dd if=/dev/urandom of=/dev/hdX bs=4k" and then complain this is too slow.

Creating an os.cryptorandom and os.pseudorandom might be a useful way to go here. I've often considered whether I should create a /dev/frandom for the crazies who want to use dd as a way to wipe a disk, but to date I've haven't thought it was worth the effort, and I didn't want to encourage them. Besides, isn't obviously the right answer is to create a quickie python script? :-)

Splitting os.random does beg the question of what os.random should do, however. If you go down that path, I'd suggest defaulting to the secure-but-slow choice.

I'd also suggest assuming it's OK to put the onus on the people who are trying to run python scripts during early boot to have to either add some command flags to the python interpreter, or to otherwise make adjustments, as being completely fair. But again, that's my bias, and if people don't want to deal with trying to ask the systemd folks to make a change in their code, I'd _completely_ understand.

My design preference is that outside of boot scripts, having os.random block in the same of security is completely fair, since in that case you won't deadlock the system. People of good will may disagree, of course, and I'm not on the Python development team, so take that with whatever grain of salt you wish. At the end of the day, this is all about tradeoffs, and you know your customer/developer base better than I do.\

Cheers!

vstinner · 2016-06-08T14:27:06Z

> The current behavior is that Python *will not start at all* if getrandom() blocks (because the hash secret initialization fails).
It starts jsut fine, it just can possible takes awhile.

In my experience, connecting to a VM using SSH with low entropy can take longer than 1 minute. As an user, I considered that the host was down. Longer than 1 minute is simply too long.

It's unclear to me if getrandom() can succeed (return random bytes) on embedded devices without hardware RNG. Can it take longer than 1 minute?

Is it possible that getrandom() simply blocks forever?

ColmBuckley · 2016-06-08T14:30:46Z

Victor -

Yes, it is possible for it to block forever - see the test I proposed for Ted upthread. The triggering case (systemd-crontab-generator) delays for 90 seconds, but is *killed* by systemd after that time; it doesn't itself time out.

Colm

ColmBuckley · 2016-06-08T15:23:32Z

Just to re-state; I think we have three problems:

_Py_HashSecret initialization blocking. Affects all Python invocations; already a substantial issue on Debian testing track (90s startup delay).

there seems to be general agreement that this does not need a 'strong' secret in a script called at/near startup.
On Linux, getrandom(GRND_NONBLOCK) *or* /dev/urandom are sufficient for this initialization.
On other OS, we don't have a non-blocking kernel PRNG; this is probably not an issue for Solaris or OS X, and only a possible issue for OpenBSD.
Is it acceptable to fall back to an in-process seed generation for the cases where initialization via /dev/urandom fails (NB : there have been no reports of this type of failure in the wild).
existing tip with or without nonblocking_urandom_noraise.patch addresses this for Linux. Solution for other OS remains to be written.
Possibly can be considered non-blocking for other OS, as there has been no recent regression in behavior.

Blocking on 'import random' and/or os.urandom. I don't see a clear consensus on the Right Thing for this case. Existing tip (without nonblocking_urandom_noraise.patch) addresses it for Linux, but solution is not universally accepted. Unclear whether this is a 3.5.2 blocker.
Design of future APIs for >= 3.6. The most frequent suggestion is something like os.pseudorandom() (guaranteed nonblocking) and os.cryptorandom() (guaranteed entropy); I guess this needs to go to the dev list for full discussion - is it safely out of scope for this bug?

My suggestion (for what it's worth): accept Victor's changeset plus nonblocking_urandom_noraise for 3.5.2 (I'll submit a proper patch shortly), recommend userspace workarounds for the blocking urandom issue, propose new APIs for 3.6 on the dev list.

vstinner · 2016-06-08T16:52:11Z

I spent almost my whole day to read this issue, some related issues, and some more related links. WOW! Amazing discussing. Sorry that Christian decided to quit the discussion (twice) :-(

Here is my summary: http://haypo-notes.readthedocs.io/pep_random.html

tl; dr "The issue is to find a solution to not block Python startup on such case, and keep getrandom() enhancement for os.urandom()."

--

Status of Python 3.5.2: http://haypo-notes.readthedocs.io/pep_random.html#status-of-python-3-5-2

My summary: "With the changeset 9de508dc4837: Python doesn’t block at startup anymore (issues bpo-25420 and bpo-26839 are fixed) and os.urandom() is as secure as Python 2.7, Python 3.4 and any application reading /dev/urandom."

=> STOP! don't touch anything, it's now fine ;-) (but maybe follow my link for more information)

--

To *enhance* os.urandom(), always use getrandom() syscall on Linux, I opened the issue bpo-27266. I changed the title to "Always use getrandom() in os.random() on Linux and add block=False parameter to os.urandom()" to make my intent more explicit.

As some of you have already noticed, it's not easy to implement this issue! There are technical issues to implement os.urandom(block=False).

In fact, this issue tries to fix two different but close issues:

(a) Always use getrandom() for os.urandom() on Linux
(b) Implement os.urandom(block=False) on *all* platforms

The requirement for (a) is to not reopen the bug bpo-25420 (block on "import random"). dstufft proposed no-urandom-by-default.diff (attached to this issue), but IMHO it makes the random module worse than before. I proposed (b) as the correct fix. It's a work-in-progress, please come on the issue bpo-27266 to help me!

--

Please contact me if we want to fix/enhance my doc http://haypo-notes.readthedocs.io/pep_random.html

Right now, I'm not interested to convert this summary to a real PEP. It looks like you agree on solutions. We should now invest our time on solutions rather than listing again all issues ;-)

I know that it's really hard, but I suggest to abandon this issue (since, again, it's closed!), and focus on more specific issues and work on fixing issues. No? What do you think?

--

IMHO The problem in this discussion is that it started with a very well defined issue (Python blocks at startup on Debian Testing in a script started by systemd when running in a VM) to a wide discussion about all RNG, all kinds of issues related to RNG and a little bit to security in general.

pitti · 2016-06-08T20:35:24Z

you could give some kind of command-line flag

That already exists -- set PYTHONHASHSEED=0.

But I'll let someone else have the joys of negotiating with Lennart, and I won't blame the Python devs if using GRND_NONBLOCK unconditionally is less painful than having to work with the systemd folks.

In case it's of any relief: This has nothing to do with having to change anything in systemd itself -- none of the services that systemd ships use Python. The practical case where this bug appeared was cloud-init (which is written in Python), and that wants to run early in the boot sequence even before the network is up (so that tools like "pollinate" which gather entropy from the cloud host don't kick in yet). So if there's any change needed at all, it would be in cloud-init and similar services which run Python during early boot.

ColmBuckley · 2016-06-08T20:41:46Z

@pitti -

We already discussed this; there are cases where it's not practical to set an environment variable. The discussion eventually converged on "it is not desirable that Python should block on startup, regardless of system RNG status".

Re: the triggering bug; it was actually /lib/systemd/system-generators/systemd-crontab-generator (in systemd-cron) which caused the behavior to be noticed in Debian. It wasn't a change in systemd behavior, per se (that has been a Python script for some time), it was the fact that it was being called before the system PRNG had been initialized. With the change from /dev/urandom to getrandom() in 3.5.1, this caused a deadlock at boot.

larryhastings · 2016-06-08T20:49:31Z

I am increasingly convinced that I'm right.

--

First, consider that the functions in the os module, as a rule, are a thin shell over the equivalent function provided by the operating system. If Python exposes a function called os.XYZ(), and it calls the OS, then with few exceptions it does so by calling a function called XYZ().**

This has several ramifications, and these are effectively guarantees for the Python programmer:

You can read your local man pages (or equivalent) to see how the function behaves oh your system. Python occasionally improves on the functionality provided; os.utime() provides a lot more functionality than POSIX utime. But it never *degrades* the functionality provided.
It's implied, and strongly preferred, that the function is atomic: it will make exactly one system call. I once proposed simulating behavior for an os module function using a series of system calls, and this approach was rejected because it wasn't atomic. So if you see a function os.XYZ(), you may predict that Python will call XYZ() exactly once, and with only a few exceptions you'll be right.

Now read this snippet of the documentation for os.urandom():

"The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a Unix-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom()."

That text has been in the documentation for os.urandom() since at least Python 2.6. (That's as old as we have on the web site; I didn't go hunting for older documentation.)

Thus the documentation for os.urandom():

explicitly says it uses /dev/urandom, and
explicitly *does not* guarantee cryptographic strength random numbers on all platforms at all times.

Thus, while it's laudable to try and give the user higher-quality random bits when they call os.urandom(), you cannot degrade the behavior of the system's /dev/urandom when doing so. On Linux /dev/urandom is *guaranteed* to never block. This guarantee is so strong, Mr. Ts'o had to add a separate facility to Linux (/dev/random) to permit blocking. os.urandom() *must* replicate this behavior.

What I'm proposing is that os.urandom() may use getrandom(RND_NOBLOCK) to attempt to get higher-quality random bits, but it *must not block*. If it fails, it will use /dev/urandom, *exactly as it is documented to do*.

(Naturally this flunks the "atomic operation" test. But in the case of procuring random bits, the atomicity of its operation is obviously irrelevant.)

** The exception to this, naturally, is Windows. Internally the os module is called "posixmodule"--and this is no coincidence. AFAIK every platform supported by CPython is POSIX-based except Windows. The choice was made long ago to simulate POSIX behavior on Windows so as to present a consistent API to the programmer. If you're curious about this, and have the time, read the implementation of os.stat for Windows. What a rush!

--

Second, I invoke the "consenting adults" rule. Python provides well-documented behavior for os.urandom(). You cannot make assumptions about the use case of the caller and decide for them that they would prefer the function block in an unbounded fashion rather than provide low-quality random bits.

And yes, unbounded. As covered earlier in the thread, it only blocked for 90 seconds before systemd killed it. We don't know how long it would actually have blocked. This is completely unacceptable--for startup, for "import random", and for "os.urandom()" on Linux.

--

Third, because the os module is in general a thin wrapper over what the OS provides, I disapprove of "cryptorandom()" and "pseudorandom()" going into the os module. There are no functions with these names on any OS of which I'm aware. This is why I proposed "os.getrandom(n, block=True)". From its signature, the function it calls on your OS will be obvious, and its semantics on your OS will be documented by your OS.

Thus I am completely unwilling to add os.cryptorandom() and os.pseudorandom() in 3.5.2.

malemburg · 2016-06-08T21:04:48Z

On 08.06.2016 22:49, Larry Hastings wrote:

Third, because the os module is in general a thin wrapper over what the OS provides, I disapprove of "cryptorandom()" and "pseudorandom()" going into the os module. There are no functions with these names on any OS of which I'm aware. This is why I proposed "os.getrandom(n, block=True)". From its signature, the function it calls on your OS will be obvious, and its semantics on your OS will be documented by your OS.

Thus I am completely unwilling to add os.cryptorandom() and os.pseudorandom() in 3.5.2.

That was a sketch for 3.6 to resolve the ambiguity between the
different use cases.

You're right, it's better to move such things to the random
module.

ColmBuckley · 2016-06-08T21:26:27Z

Larry -

Regardless of the behavior of os.urandom (and 'import random'), is it agreed that the current state of _PyRandom_Init is acceptable for 3.5.2?

The current behavior (as of 9de508dc4837) is that it will never block on Linux, but could still block on other OS if called before /dev/urandom is initialized. We have not determined a satisfactory solution for other operating systems. Note that no other OS have reported a problem 'in the wild', probably because of their extreme rarity in VM/container environments and the lack of Python in their early init sequence.

Colm

vstinner · 2016-06-08T22:19:26Z

I opened the issue bpo-27272: "random.Random should not read 2500 bytes from urandom".

vstinner · 2016-06-08T22:22:03Z

The current behavior (as of 9de508dc4837) is that it will never block on Linux, but could still block on other OS if called before /dev/urandom is initialized.

In practice, only Linux is impacted. See the rationale:
https://haypo-notes.readthedocs.io/pep_random.html#scope-of-the-python-blocks-at-startup-issue

We have not determined a satisfactory solution for other operating systems.

Stoooop. This issue is specific to Linux. If you want to fix the issue on other operating systems, please open a new issue.

Oh, you know what? I already opened such issue :-) The issue bpo-27266 wants to fix the issue on all platforms, not only Linux. Open a second issue if you prefer.

larryhastings · 2016-06-09T00:07:01Z

Regardless of the behavior of os.urandom (and 'import random'), is it agreed that the current state of _PyRandom_Init is acceptable for 3.5.2?

I'll get back to you with a specific yes or no. What I want is that it the behavior removed where "import random" can block unboundedly on Linux because it's waiting for the entropy pool to fill. If the code behaves like that, then yes, but I'm not giving it my official blessing until I read it.

larryhastings · 2016-06-09T11:26:39Z

I just posted to python-dev and asked Guido to make a BDFL ruling. I only represented my side, both because I worried I'd do a bad job of representing *cough* literally everybody else *cough*, and because it already took me so long to write the email. All of you who disagree with me, I'd appreciate it if you'd reply to my python-dev posting and state your point of view.

larryhastings · 2016-06-11T08:46:07Z

Colm Buckley: I've read the code, *and* stepped through it, and AFAICT it is no longer even possible for Python on Linux to call getrandom() in a blocking way. Thanks for doing this! I'm marking the issue as closed.

ncoghlan · 2016-06-14T23:35:06Z

One last fix needed to fully revert this is to remove the mention from the Python 3.5 What's New documentation: https://docs.python.org/3.5/whatsnew/3.5.html#os

vstinner · 2016-06-14T23:40:51Z

Nick Coghlan: "One last fix needed to fully revert this is to remove the mention from the Python 3.5 What's New documentation: https://docs.python.org/3.5/whatsnew/3.5.html#os"

This sentence?

"The urandom() function now uses the getrandom() syscall on Linux 3.17 or newer, and getentropy() on OpenBSD 5.6 and newer, removing the need to use /dev/urandom and avoiding failures due to potential file descriptor exhaustion."

Why removing it? It's still correct that getrandom() is used by os.urandom() in the common case.

The corner case (urandom entropy pool not initialized) is already documented (including a "Changed in version 3.5.2: ..."):
https://docs.python.org/3.5/library/os.html#os.urandom

I don't think that it's worth to mention the corner case in What's New in Python 3.5.

ncoghlan · 2016-06-15T18:14:29Z

Sorry, with all the different proposals kicking around, I somehow got the impression we'd reverted entirely to just reading from /dev/urandom without ever using the new syscall.

Re-reviewing your patch, I agree the What's New comment is still accurate.

vstinner · 2016-06-15T20:00:44Z

Re-reviewing your patch, I agree the What's New comment is still accurate.

Thanks for double checking ;-)

doko42 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 24, 2016

vstinner changed the title ~~python always calls getrandom() at start, causing long hang after boot~~ Python 3.5 running in a virtual machine blocks at startup or on importing the random module Apr 26, 2016

vstinner changed the title ~~Python 3.5 running in a virtual machine blocks at startup or on importing the random module~~ Python 3.5 running in a virtual machine with Linux kernel 3.17+ can block at startup or on importing the random module Apr 26, 2016

vstinner changed the title ~~Python 3.5 running in a virtual machine with Linux kernel 3.17+ can block at startup or on importing the random module~~ Python 3.5 running in a virtual machine with Linux kernel 3.17+ can block at startup or on importing the random module on getrandom() Apr 26, 2016

ColmBuckley mannequin added the type-bug An unexpected behavior, bug, or error label May 13, 2016

pitti mannequin changed the title ~~Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing the random module on getrandom()~~ Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing /arguinthe random module on getrandom() Jun 8, 2016

pppery mannequin changed the title ~~Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing /arguinthe random module on getrandom()~~ Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing the random module on getrandom() Jun 8, 2016

larryhastings closed this as completed Jun 11, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing the random module on getrandom() #71026

Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing the random module on getrandom() #71026

Comments

doko42 commented Apr 24, 2016

doko42 commented Apr 24, 2016

doko42 commented Apr 24, 2016

vstinner commented Apr 24, 2016

vstinner commented Apr 26, 2016

vstinner commented Apr 26, 2016

skrah mannequin commented Apr 26, 2016

skrah mannequin commented Apr 26, 2016

vstinner commented Apr 26, 2016

skrah mannequin commented Apr 26, 2016

vstinner commented Apr 26, 2016

skrah mannequin commented Apr 26, 2016

malemburg commented Apr 26, 2016

ColmBuckley mannequin commented May 12, 2016

ColmBuckley mannequin commented May 12, 2016

ColmBuckley mannequin commented May 13, 2016

ColmBuckley mannequin commented May 13, 2016

ColmBuckley mannequin commented May 13, 2016

ColmBuckley mannequin commented May 13, 2016

vstinner commented May 13, 2016

ColmBuckley mannequin commented May 14, 2016

vstinner commented May 14, 2016

ColmBuckley mannequin commented May 14, 2016

ColmBuckley mannequin commented May 24, 2016

ned-deily commented Jun 5, 2016

tiran commented Jun 8, 2016

dstufft commented Jun 8, 2016

malemburg commented Jun 8, 2016

tiran commented Jun 8, 2016

vstinner commented Jun 8, 2016

vstinner commented Jun 8, 2016

TheodoreTso mannequin commented Jun 8, 2016

TheodoreTso mannequin commented Jun 8, 2016

vstinner commented Jun 8, 2016

ColmBuckley mannequin commented Jun 8, 2016

ColmBuckley mannequin commented Jun 8, 2016

vstinner commented Jun 8, 2016

pitti mannequin commented Jun 8, 2016

ColmBuckley mannequin commented Jun 8, 2016

larryhastings commented Jun 8, 2016

malemburg commented Jun 8, 2016

ColmBuckley mannequin commented Jun 8, 2016

vstinner commented Jun 8, 2016

vstinner commented Jun 8, 2016

larryhastings commented Jun 9, 2016

larryhastings commented Jun 9, 2016

larryhastings commented Jun 11, 2016

ncoghlan commented Jun 14, 2016

vstinner commented Jun 14, 2016

ncoghlan commented Jun 15, 2016

vstinner commented Jun 15, 2016