Author izbyshev
Recipients gregory.p.smith, izbyshev, pablogsal, vstinner
Date 2019-01-25.02:03:25
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
This issue is to propose a (complementary) alternative to the usage of posix_spawn() in subprocess (see bpo-35537).

As mentioned by Victor Stinner in msg332236, posix_spawn() has the potential of being faster and safer than fork()/exec() approach. However, some of the currently available implementations of posix_spawn() have technical problems (this mostly summarizes discussions in bpo-35537):

* In glibc < 2.24 on Linux, posix_spawn() doesn't report errors to the parent properly, breaking existing subprocess behavior.

* In glibc >= 2.25 on Linux, posix_spawn() doesn't report errors to the parent in certain environments, such as QEMU user-mode emulation and Windows subsystem for Linux.

* In FreeBSD, as of this writing, posix_spawn() doesn't block signals in the child process, so a signal handler executed between vfork() and execve() may change memory shared with the parent [1].

Regardless of implementation, posix_spawn() is also unsuitable for some subprocess use cases:

* posix_spawnp() can't be used directly to implement file searching logic of subprocess because of different semantics, requiring workarounds.

* posix_spawn() has no standard way to specify the current working directory for the child.

* posix_spawn() has no way to close all file descriptors > 2 in the child, which is the *default* mode of operation of subprocess.Popen().

May be even more importantly, fundamentally, posix_spawn() will always be less flexible than fork()/exec() approach. Any additions will have to go through POSIX standardization or be unportable. Even if approved, a change will take years to get to actual users because of the requirement to update the C library, which may be more than a decade behind in enterprise Linux distros. This is in contrast to having an addition implemented in CPython. For example, a setrlimit() action for posix_spawn() is currently rejected in POSIX[2], despite being trivial to add.

I'm interested in avoiding posix_spawn() problems on Linux while still delivering comparable performance and safety. To that end I've studied implementations of posix_spawn() in glibc[3] and musl[4], which use vfork()/execve()-like approach, and investigated challenges of using vfork() safely on Linux (e.g. [5]) -- all of that for the purpose of using vfork()/exec() instead of fork()/exec() or posix_spawn() in subprocess where possible.

The unique property of vfork() is that the child shares the address space (including heap and stack) as well as thread-local storage with the parent, which means that the child must be very careful not to surprise the parent by changing the shared resources under its feet. The parent is suspended until the child performs execve(), _exit() or dies in any other way.

The most safe way to use vfork() is if one has access to the C library internals and can do the the following:

1) Disable thread cancellation before vfork() to ensure that the parent thread is not suddenly cancelled by another thread with pthread_cancel() while being in the middle of child creation.

2) Block all signals before vfork(). This ensures that no signal handlers are run in the child. But the signal mask is preserved by execve(), so the child must restore the original signal mask. To do that safely, it must reset dispositions of all non-ignored signals to the default, ensuring that no signal handlers are executed in the window between restoring the mask and execve().

Note that libc-internal signals should be blocked too, in particular, to avoid "setxid problem"[5].

3) Use a separate stack for the child via clone(CLONE_VM|CLONE_VFORK), which has exactly the same semantics as vfork(), but allows the caller to provide a separate stack. This way potential compiler bugs arising from the fact that vfork() returns twice to the same stack frame are avoided.

4) Call only async-signal-safe functions in the child.

In an application, only (1) and (4) can be done easily.

One can't disable internal libc signals for (2) without using syscall(), which requires knowledge of the kernel ABI for the particular architecture.

clone(CLONE_VM) can't be used at least before glibc 2.24 because it corrupts the glibc pid/tid cache in the parent process[6,7]. (As may be guessed, this problem was solved by glibc developers when they implemented posix_spawn() via clone()). Even now, the overall message seems to be that clone() is a low-level function not intended to be used by applications.

Even with the above, I still think that in context of subprocess/CPython the sufficient vfork()-safety requirements are provided by the following.

Despite being easy, (1) seems to be not necessary: CPython never uses pthread_cancel() internally, so Python code can't do that. A non-Python thread in an embedding app could try, but cancellation, in my knowledge, is not supported by CPython in any case (there is no way for an app to cleanup after the cancelled thread), so subprocess has no reason to care.

For (2), we don't have to worry about the internal signal used for thread cancellation because of the above. The only other internal signal is used for setxid syncronization[5]. The "setxid problem" is mitigated in Python because the spawning thread holds GIL, so Python code can't call os.setuid() concurrently. Again, a non-Python thread could, but I argue that an application that spawns a child and calls setuid() in non-synchronized manner is not worth supporting: a child will have "random" privileges depending on who wins the race, so this is hardly a good security practice. Even if such apps are considered worthy to support, we may limit vfork()/exec() path only to the non-embedded use case.

For (3), with production-quality compilers, using vfork() should be OK. Both GCC and Clang recognize it and handle in a special way (similar to setjmp(), which also has "returning twice" semantics). The supporting evidence is that Java has been using vfork() for ages, Go has migrated to vfork(), and, coincidentally, dotnet is doing it right now[8].

(4) is already done in _posixsubprocess on Linux.

I've implemented a simple proof-of-concept that uses vfork() in subprocess on Linux by default in all cases except if preexec_fn is not None. It passes all tests on OpenSUSE (Linux 4.15, glibc 2.27) and Ubuntu 14.04 (Linux 4.4, glibc 2.19), but triggers spurious GCC warnings, probably due to a long-standing GCC bug:

I've also run a variant of (by Victor Stinner from bpo-35537) with close_fds=False and restore_signals=False removed on OpenSUSE:

$ env/bin/python -m perf compare_to fork.json vfork.json
Mean +- std dev: [fork] 154 ms +- 18 ms -> [vfork] 1.23 ms +- 0.04 ms: 125.52x faster (-99%)

Compared to posix_spawn, the results on the same machine are similar:

$ env/bin/python -m perf compare_to posix_spawn.json vfork.json
Mean +- std dev: [posix_spawn] 1.24 ms +- 0.04 ms -> [vfork] 1.22 ms +- 0.05 ms: 1.02x faster (-2%)

Note that my implementation should work even for QEMU user-mode (and probably WSL) because it doesn't rely on address space sharing.

Things to do:

* Decide whether pthread_setcancelstate() should be used. I'd be grateful for opinions from Python threading experts.

* Decide whether "setxid problem"[5] is important enough to worry about.

* Deal with GCC warnings.

* Test in user-mode QEMU and WSL.

Date User Action Args
2019-01-25 02:03:28izbyshevsetrecipients: + izbyshev, gregory.p.smith, vstinner, pablogsal
2019-01-25 02:03:26izbyshevsetmessageid: <>
2019-01-25 02:03:25izbyshevlinkissue35823 messages
2019-01-25 02:03:25izbyshevcreate