This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use vfork() in subprocess on Linux
Type: enhancement Stage: resolved
Components: Extension Modules Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: izbyshev Nosy List: Yonatan Goldschmidt, gregory.p.smith, izbyshev, koobs, pablogsal, ronaldoussoren, socketpair
Priority: normal Keywords: patch, patch, patch

Created on 2019-01-25 02:03 by izbyshev, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 11671 merged izbyshev, 2019-01-25 02:07
PR 11671 merged izbyshev, 2019-01-25 02:07
PR 11671 merged izbyshev, 2019-01-25 02:07
PR 11668 vstinner, 2019-01-25 14:00
PR 11668 vstinner, 2019-01-25 14:00
PR 22944 merged izbyshev, 2020-10-24 17:13
PR 22945 merged gregory.p.smith, 2020-10-24 18:34
Messages (30)
msg334336 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2019-01-25 02:03
This issue is to propose a (complementary) alternative to the usage of posix_spawn() in subprocess (see bpo-35537).

As mentioned by Victor Stinner in msg332236, posix_spawn() has the potential of being faster and safer than fork()/exec() approach. However, some of the currently available implementations of posix_spawn() have technical problems (this mostly summarizes discussions in bpo-35537):

* In glibc < 2.24 on Linux, posix_spawn() doesn't report errors to the parent properly, breaking existing subprocess behavior.

* In glibc >= 2.25 on Linux, posix_spawn() doesn't report errors to the parent in certain environments, such as QEMU user-mode emulation and Windows subsystem for Linux.

* In FreeBSD, as of this writing, posix_spawn() doesn't block signals in the child process, so a signal handler executed between vfork() and execve() may change memory shared with the parent [1].

Regardless of implementation, posix_spawn() is also unsuitable for some subprocess use cases:

* posix_spawnp() can't be used directly to implement file searching logic of subprocess because of different semantics, requiring workarounds.

* posix_spawn() has no standard way to specify the current working directory for the child.

* posix_spawn() has no way to close all file descriptors > 2 in the child, which is the *default* mode of operation of subprocess.Popen().

May be even more importantly, fundamentally, posix_spawn() will always be less flexible than fork()/exec() approach. Any additions will have to go through POSIX standardization or be unportable. Even if approved, a change will take years to get to actual users because of the requirement to update the C library, which may be more than a decade behind in enterprise Linux distros. This is in contrast to having an addition implemented in CPython. For example, a setrlimit() action for posix_spawn() is currently rejected in POSIX[2], despite being trivial to add.

I'm interested in avoiding posix_spawn() problems on Linux while still delivering comparable performance and safety. To that end I've studied implementations of posix_spawn() in glibc[3] and musl[4], which use vfork()/execve()-like approach, and investigated challenges of using vfork() safely on Linux (e.g. [5]) -- all of that for the purpose of using vfork()/exec() instead of fork()/exec() or posix_spawn() in subprocess where possible.

The unique property of vfork() is that the child shares the address space (including heap and stack) as well as thread-local storage with the parent, which means that the child must be very careful not to surprise the parent by changing the shared resources under its feet. The parent is suspended until the child performs execve(), _exit() or dies in any other way.

The most safe way to use vfork() is if one has access to the C library internals and can do the the following:

1) Disable thread cancellation before vfork() to ensure that the parent thread is not suddenly cancelled by another thread with pthread_cancel() while being in the middle of child creation.

2) Block all signals before vfork(). This ensures that no signal handlers are run in the child. But the signal mask is preserved by execve(), so the child must restore the original signal mask. To do that safely, it must reset dispositions of all non-ignored signals to the default, ensuring that no signal handlers are executed in the window between restoring the mask and execve().

Note that libc-internal signals should be blocked too, in particular, to avoid "setxid problem"[5].

3) Use a separate stack for the child via clone(CLONE_VM|CLONE_VFORK), which has exactly the same semantics as vfork(), but allows the caller to provide a separate stack. This way potential compiler bugs arising from the fact that vfork() returns twice to the same stack frame are avoided.

4) Call only async-signal-safe functions in the child.

In an application, only (1) and (4) can be done easily.

One can't disable internal libc signals for (2) without using syscall(), which requires knowledge of the kernel ABI for the particular architecture.

clone(CLONE_VM) can't be used at least before glibc 2.24 because it corrupts the glibc pid/tid cache in the parent process[6,7]. (As may be guessed, this problem was solved by glibc developers when they implemented posix_spawn() via clone()). Even now, the overall message seems to be that clone() is a low-level function not intended to be used by applications.

Even with the above, I still think that in context of subprocess/CPython the sufficient vfork()-safety requirements are provided by the following.

Despite being easy, (1) seems to be not necessary: CPython never uses pthread_cancel() internally, so Python code can't do that. A non-Python thread in an embedding app could try, but cancellation, in my knowledge, is not supported by CPython in any case (there is no way for an app to cleanup after the cancelled thread), so subprocess has no reason to care.

For (2), we don't have to worry about the internal signal used for thread cancellation because of the above. The only other internal signal is used for setxid syncronization[5]. The "setxid problem" is mitigated in Python because the spawning thread holds GIL, so Python code can't call os.setuid() concurrently. Again, a non-Python thread could, but I argue that an application that spawns a child and calls setuid() in non-synchronized manner is not worth supporting: a child will have "random" privileges depending on who wins the race, so this is hardly a good security practice. Even if such apps are considered worthy to support, we may limit vfork()/exec() path only to the non-embedded use case.

For (3), with production-quality compilers, using vfork() should be OK. Both GCC and Clang recognize it and handle in a special way (similar to setjmp(), which also has "returning twice" semantics). The supporting evidence is that Java has been using vfork() for ages, Go has migrated to vfork(), and, coincidentally, dotnet is doing it right now[8].

(4) is already done in _posixsubprocess on Linux.

I've implemented a simple proof-of-concept that uses vfork() in subprocess on Linux by default in all cases except if preexec_fn is not None. It passes all tests on OpenSUSE (Linux 4.15, glibc 2.27) and Ubuntu 14.04 (Linux 4.4, glibc 2.19), but triggers spurious GCC warnings, probably due to a long-standing GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21161

I've also run a variant of subprocess_bench.py (by Victor Stinner from bpo-35537) with close_fds=False and restore_signals=False removed on OpenSUSE:

$ env/bin/python -m perf compare_to fork.json vfork.json
Mean +- std dev: [fork] 154 ms +- 18 ms -> [vfork] 1.23 ms +- 0.04 ms: 125.52x faster (-99%)

Compared to posix_spawn, the results on the same machine are similar:

$ env/bin/python -m perf compare_to posix_spawn.json vfork.json
Mean +- std dev: [posix_spawn] 1.24 ms +- 0.04 ms -> [vfork] 1.22 ms +- 0.05 ms: 1.02x faster (-2%)

Note that my implementation should work even for QEMU user-mode (and probably WSL) because it doesn't rely on address space sharing.

Things to do:

* Decide whether pthread_setcancelstate() should be used. I'd be grateful for opinions from Python threading experts.

* Decide whether "setxid problem"[5] is important enough to worry about.

* Deal with GCC warnings.

* Test in user-mode QEMU and WSL.

[1] https://svnweb.freebsd.org/base/head/lib/libc/gen/posix_spawn.c?view=markup&pathrev=326193
[2] http://austingroupbugs.net/view.php?id=603
[3] https://sourceware.org/git/?p=glibc.git;a=history;f=sysdeps/unix/sysv/linux/spawni.c;h=353bcf5b333457d191320e358d35775a2e9b319b;hb=HEAD
[4] http://git.musl-libc.org/cgit/musl/log/src/process/posix_spawn.c
[5] https://ewontfix.com/7
[6] https://sourceware.org/bugzilla/show_bug.cgi?id=10311
[7] https://sourceware.org/bugzilla/show_bug.cgi?id=18862
[8] https://github.com/dotnet/corefx/pull/33289
msg334345 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2019-01-25 07:41
Issue #34663 contains some earlier discussion about vfork, in the context of supporting a specific posix_spawn flag on Linux.

W.r.t. closing all file descriptors > 2: posix_spawn_file_actions_addclose can do this when using posix_spawn. That would have a performance cost, you'd basically have to resort to closing all possible file descriptors and cannot use the smarter logic used in _posixsubprocess. However, the smarter closing code in _posixsubprocess is not safe w.r.t. vfork according to the comment above _close_open_fds_maybe_unsafe: that function uses some functions that aren't async-safe and one of those calls malloc.
msg334353 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2019-01-25 11:17
> W.r.t. closing all file descriptors > 2: posix_spawn_file_actions_addclose can do this when using posix_spawn. That would have a performance cost, you'd basically have to resort to closing all possible file descriptors and cannot use the smarter logic used in _posixsubprocess.

This is costly to the point of people reporting bugs: bpo-35757. If that really causes 0.1s delay like the reporter said, it would defeat the purpose of using posix_spawn() in the first place.

> However, the smarter closing code in _posixsubprocess is not safe w.r.t. vfork according to the comment above _close_open_fds_maybe_unsafe: that function uses some functions that aren't async-safe and one of those calls malloc.

No, it's not so on Linux: https://github.com/python/cpython/blob/62c35a8a8ff5854ed470b1c16a7a14f3bb80368c/Modules/_posixsubprocess.c#L314

Moreover, as I already argued in msg332235, using malloc() after vfork() should be *safer* than after fork() for a simple reason: all memory-based locks will still work as though the child is merely a thread in the parent process. This is true even for things like futex(FUTEX_WAKE_PRIVATE), despite that this operation is mistakenly documented as "process-private" in man pages. It's actually more like "private to tasks sharing the same address space".

This is in contrast with fork(): if it's called while another thread is holding some mutex in malloc(), nobody would unlock it in the child unless the C library has precautions against that.
msg334363 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2019-01-25 15:49
BTW. I hadn't noticed _close_open_fds_safe, that should be safe when using vfork(). 

Finally: I have no strong opinion on this patch, your explanation looks fine and I'm not up-to-speed w.r.t. implementation details of vfork and the like to feel comfortable about giving a proper review without spending a lot more time on it.
msg334402 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2019-01-26 21:51
I've checked subprocess.Popen() error reporting in QEMU user-mode and WSL and confirm that it works both with my patch (vfork/exec) and the traditional fork/exec, but doesn't work with glibc's posix_spawn.

The first command below uses posix_spawn() internally because close_fds=False. Under QEMU posix_spawn() from glibc doesn't report errors because it relies on address-space sharing of clone(CLONE_VM|CLONE_VFORK), which is not emulated by QEMU. The second command uses vfork()/exec() in _posixsubprocess, but lack of address-space sharing doesn't matter because the error data is transferred via a pipe.

$ qemu-x86_64 --version
qemu-x86_64 version 2.11.1
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
$ ldd --version
ldd (GNU libc) 2.27
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
$ qemu-x86_64 ./python3 -c 'import subprocess; subprocess.call("/xxx", close_fds=False)'
$ qemu-x86_64 ./python3 -c 'import subprocess; subprocess.call("/xxx")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/izbyshev/install/python-3.8a/lib/python3.8/subprocess.py", line 324, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/home/izbyshev/install/python-3.8a/lib/python3.8/subprocess.py", line 830, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/izbyshev/install/python-3.8a/lib/python3.8/subprocess.py", line 1648, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'xxx'

For WSL, I've tested Ubuntu 18.04 (glibc 2.27) on Windows 10 1709 (Fall Creators Update) and 1803.

In 1709, the result is the same as for QEMU (i.e. the first command silently succeeds). In 1803, both commands raise the exception because address-space sharing was fixed for clone(CLONE_VM|CLONE_VFORK) and vfork() in WSL: https://github.com/Microsoft/WSL/issues/1878

I've also run all subprocess tests under QEMU user-mode (by making sys.executable to point to a wrapper that runs python under QEMU) for the following code bases:
* current master (fork/exec or posix_spawn can be used)
* classic (fork/exec always)
* my patch + master (vfork/exec or posix_spawn)
* vfork/exec always
 All tests pass in all four flavors, which indicates that we don't have a test for error reporting with close_fds=False :)

Out of curiosity I also did the same on WSL in 1803. Only 3 groups of tests fail, all because of WSL bugs:

* It's not possible to send signals to zombies, e.g.:
======================================================================
ERROR: test_terminate_dead (test.test_subprocess.POSIXProcessTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/test/cpython/Lib/test/test_subprocess.py", line 1972, in test_terminate_dead
    self._kill_dead_process('terminate')
  File "/home/test/cpython/Lib/test/test_subprocess.py", line 1941, in _kill_dead_process
    getattr(p, method)(*args)
  File "/home/test/cpython/Lib/subprocess.py", line 1877, in terminate
    self.send_signal(signal.SIGTERM)
  File "/home/test/cpython/Lib/subprocess.py", line 1872, in send_signal
    os.kill(self.pid, sig)
ProcessLookupError: [Errno 3] No such process
----------------------------------------------------------------------

* getdents64 syscall doesn't work properly (doesn't return the rest of entries on the second call, so close_fds=True doesn't fully work with large number of open descriptors).

======================================================================
FAIL: test_close_fds_when_max_fd_is_lowered (test.test_subprocess.POSIXProcessTestCase)
Confirm that issue21618 is fixed (may fail under valgrind).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/test/cpython/Lib/test/test_subprocess.py", line 2467, in test_close_fds_when_max_fd_is_lowered
    self.assertFalse(remaining_fds & opened_fds,
AssertionError: {34, 35, 36, 37, 38, 39, 40, 41, 42} is not false : Some fds were left open.
----------------------------------------------------------------------

* Signal masks in /proc/self/status are not correct (always zero)

======================================================================
FAIL: test_restore_signals (test.test_subprocess.POSIXProcessTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/test/cpython/Lib/test/test_subprocess.py", line 1643, in test_restore_signals
    self.assertNotEqual(default_sig_ign_mask, restored_sig_ign_mask,
AssertionError: b'SigIgn:\t0000000000000000' == b'SigIgn:\t0000000000000000' : restore_signals=True should've unblocked SIGPIPE and friends.

----------------------------------------------------------------------

None of the above is related to fork/vfork/posix_spawn -- the set of failing tests is the same in all four flavors.
msg334404 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-01-26 22:45
Thanks for your _extremely detailed_ analysii of the (often sad) state of posix_spawn() on platforms in the world today.

My first reaction to this was "but then we'll be owning our own custom posix_spawn-like implementation as if we'll do better at it than every individual libc variant."

After reading this through and looking at your PR... I now consider that a good thing. =)  We clearly can do better.  vfork() is pretty simple and allows us to keep our semantics; providing benefits to existing users at no cost.

The plethora of libc bugs surrounding posix_spawn() seem likely to persist within various environments in the world for years to come.  No sense in us waiting for that to settle.

As for your PR... a configure check for vfork, a news entry, and whatever other documentation updates seem appropriate.

With this in place we may want to make the _use_posix_spawn() logic in subprocess.py stricter?  That could be its own followup PR.
msg334407 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-01-26 23:18
> * Decide whether "setxid problem"[5] is important enough to worry about.
> [5] https://ewontfix.com/7

This is a scary issue.  But I think a reasonable approach could be to never use vfork when running as whatever we choose to define a "privileged user" to be.

getuid() or geteuid() return 0?  don't use vfork.

the concept of "privileged user" can obviously mean a lot more than that and likely goes beyond what we should attempt to ascertain ourselves.

How about also providing a disable-only global setting so that someone writing code they consider to have elevated privileges can prevent its use entirely.  subprocess.disable_use_of_vfork() and subprocess.is_vfork_enabled() calls perhaps (just setting/reading a static int vfork_disallowed = 0 flag within _posixsubprocess.c).

If we did that, on systems where posix_spawn() _might_ be implemented using vfork() we'd want to avoid using it based on is_vfork_enabled().

True setuid vs vfork attack security would suggest code needs to opt-in to vfork() or posix_spawn() rather than opt-out.  Which would destroy the benefit for most users (who won't bother) for the sake of an issue that just is not what most code ever does (setuid/setgid/etc. calls are very uncommon for most software).

I think documenting "HEY, if you are running as with elevated privileges, here's a reason why you might want to disable vfork, and how to do it." should be enough.  Hopefully not famous last words.
msg334534 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2019-01-29 18:06
Thank you for the review and your thoughts, Gregory.

> With this in place we may want to make the _use_posix_spawn() logic in subprocess.py stricter?  That could be its own followup PR.

Yes, I think we can always use vfork() on Linux unless we find some issues. (Victor Stinner doesn't seem to be against: https://github.com/python/cpython/pull/11668#issuecomment-457637207). Though C libraries are in a better position to provide safe vfork/exec, by using our implementation we will:
* avoid breaking QEMU user-mode (and preserve existing subprocess behavior)
* automatically support fast process creation on musl (which has a good posix_spawn, but doesn't have easy means (macros) on detect that we're on musl and thus avoid unacceptable posix_spawn).
* avoid having different code paths (and thus potential surprises) depending on arguments of subprocess.Popen()

> This is a scary issue.  But I think a reasonable approach could be to never use vfork when running as whatever we choose to define a "privileged user" to be.

> getuid() or geteuid() return 0?  don't use vfork.

I thought about something like this, but initially dismissed it because I (mistakenly) decided that an application may concurrently switch between arbitrary uids back and forth, so that checking getuid() becomes useless (if we're already talking about an exotic app that calls setuid() concurrently with spawning children, why stop there?). But now I realize that an app may use *arbitrary* uids only if one of its real, effective or saved uids is zero (that is, if we don't take capabilities into account), so at least we could call getresuid() to get all 3 uids in a race-free way and check whether the app is *definitely* privileged...

> the concept of "privileged user" can obviously mean a lot more than that and likely goes beyond what we should attempt to ascertain ourselves.

...but you're making a good point here. So I'm not sure that we want such checks, but if we do, we'd probably need to allow users to disable them -- what if some heavy privileged daemon wants a faster subprocess?

> How about also providing a disable-only global setting so that someone writing code they consider to have elevated privileges can prevent its use entirely.  subprocess.disable_use_of_vfork() and subprocess.is_vfork_enabled() calls perhaps (just setting/reading a static int vfork_disallowed = 0 flag within _posixsubprocess.c).

I think it's reasonable. I'll look into it when I'm done with current issues.

> True setuid vs vfork attack security would suggest code needs to opt-in to vfork() or posix_spawn() rather than opt-out.  Which would destroy the benefit for most users (who won't bother) for the sake of an issue that just is not what most code ever does (setuid/setgid/etc. calls are very uncommon for most software).

Agree, especially since we're not talking about "just" using setuid() but rather about using setuid() *in a multithreaded process in a racy manner while spawning children*. Honestly, I'm not sure such apps can blame Python if they get a security hole :)
msg334559 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2019-01-30 14:12
I've been struggling with fixing spurious -Wclobbered GCC warnings. Originally, I've got the following:

/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c: In function ‘subprocess_fork_exec’:
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:612:15: warning: variable ‘gc_module’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     PyObject *gc_module = NULL;
               ^~~~~~~~~
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:616:15: warning: variable ‘preexec_fn_args_tuple’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     PyObject *preexec_fn_args_tuple = NULL;
               ^~~~~~~~~~~~~~~~~~~~~
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:621:17: warning: variable ‘cwd’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     const char *cwd;
                 ^~~
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:623:9: warning: variable ‘need_to_reenable_gc’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     int need_to_reenable_gc = 0;
         ^~~~~~~~~~~~~~~~~~~
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:624:38: warning: variable ‘argv’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     char *const *exec_array, *const *argv = NULL, *const *envp = NULL;
                                      ^~~~
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:624:59: warning: variable ‘envp’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     char *const *exec_array, *const *argv = NULL, *const *envp = NULL;
                                                           ^~~~
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:626:9: warning: variable ‘need_after_fork’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     int need_after_fork = 0;
         ^~~~~~~~~~~~~~~
/scratch2/izbyshev/cpython/Modules/_posixsubprocess.c:627:9: warning: variable ‘saved_errno’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]
     int saved_errno = 0;
         ^~~~~~~~~~~

I've checked that all warnings are spurious: all flagged variables are either not modified in the child or modified and used only by the child. A simple way to suppress the warnings would be "volatile", but I don't want to spray "volatile" over the huge declaration block of subprocess_fork_exec().

Another way is to move vfork() to a separate function and ensure that this function does as little as possible with its local variables. I've implemented two versions of this approach, both are ugly in some sense. I've pushed the first into the PR branch and the second into a separate branch https://github.com/izbyshev/cpython/tree/single-do-fork-exec.

Yet another way would be to simply disable this diagnostic for _posixsubprocess (e.g. via #pragma GCC diagnostic), but I'm not sure we want to do that -- may be it'll be fixed in the future or a real defect will be introduced into our code.
msg372426 - (view) Author: Yonatan Goldschmidt (Yonatan Goldschmidt) * Date: 2020-06-26 14:40
With issue35537 merged, do we have any intention to get on with this? From what I can tell, it provides roughly the same performance benefit - but works also with close_fds=True, which is the default of Popen, so IMO it's a worthy addition.

I tested a rebase on current master, test_subprocess passes on my Linux box and there are no more -Wclobbered warnings after Alexey's latest change of the do_fork_exec() helper.
msg372429 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-06-26 15:48
I'd really like to get this merged eventually because vfork()-based solution is fundamentally more generic than posix_spawn(). Apart from having no issue with close_fds=True, it will also continue to allow subprocess to add any process context tweaks without waiting for availability of corresponding features in posix_spawn().

If there is no objection from Gregory or other devs, I can pick up from where I left in two weeks.
msg372444 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-06-26 22:02
No objections, it would be great to see this finished up and land.

I've encountered a minority of users who are using a wrapped vfork-based C/C++ library for process spawning as fork+exec isn't fast enough for them.
msg378664 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-10-15 05:15
Well, much later than promised, but I'm picking it up. Since in the meantime support for setting uid/gid/groups was merged, and I'm aware about potential issues with calling corresponding C library functions in a vfork()-child, I asked a question on musl mailing list: https://www.openwall.com/lists/musl/2020/10/12/1

So, it seems we'll need to fallback to fork() if set*id() is needed, which is in line with our previous discussion about avoidance of vfork() in privileged processes anyway.

I'm also discussing -Wclobbered warnings with a GCC developer. I wouldn't like to restructure code just to avoid GCC false positives, so currently I'm leaning towards disabling this warning entirely for subprocess_fork_exec() and documenting that arbitrary stores to local variables between vfork() and child_exec() are not allowed due to stack sharing, but we'll see if a better solution emerges.
msg378703 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-10-16 05:41
I've updated my PR.

* After a discussion with Alexander Monakov (a GCC developer), moved vfork() into a small function to isolate it from both subprocess_fork_exec() and child_exec(). This appears to be the best strategy to avoid -Wclobbered false positives as well as potential compiler bugs.

* Got rid of VFORK_USABLE checks in function parameter lists. Now `child_sigmask` parameter is always present, but is NULL if vfork() is not available or usable. The type is `void *` to avoid hard dependence on sigset_t, which other CPython source files appear to avoid.

* Disabled vfork() in case setuid()/setgid()/setgroups() needed.

* Added more comments explaining vfork()-related stuff.

* Added commit message and NEWS entry.

Potential improvements:

* To avoid repeating long parameter lists in several functions, we can move them to a struct. The downside is that we'd need to convert child_exec() to use that struct all over the place. I don't have a strong preference here.

* Are any documentation changes needed? We don't change any interfaces, so I'm not sure.

Potential future directions:

* If desired, a vfork() opt-out may be implemented. But it'd need to disable posix_spawn() on Linux as well, since it might be implemented via vfork(), and we have no good way to check that.
msg379503 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 00:47
New changeset 976da903a746a5455998e9ca45fbc4d3ad3479d8 by Alexey Izbyshev in branch 'master':
bpo-35823: subprocess: Use vfork() instead of fork() on Linux when safe (GH-11671)
https://github.com/python/cpython/commit/976da903a746a5455998e9ca45fbc4d3ad3479d8
msg379504 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 00:54
now waiting to see how happy all of the buildbots are...

We currently have a `__linux__` check in the code deciding VFORK_USABLE.

From what I can tell, vfork probably also works on macOS (darwin).

Lets let this run for a bit on Linux and it can be a separate issue to open vfork usage up to other platforms.
msg379507 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 01:15
> * To avoid repeating long parameter lists in several functions, we can move them to a struct. The downside is that we'd need to convert child_exec() to use that struct all over the place. I don't have a strong preference here.

Agreed that the long parameter lists are annoying.  I don't like shoving that much on the stack and by adding an additional call with so many params we've increase stack usage.  I consider this refactoring work that can be done on its own outside of this issue though.

> * Are any documentation changes needed? We don't change any interfaces, so I'm not sure.

I don't think so.  I looked things over and I think all that is needed is the existing Misc/NEWS entry.

> Potential future directions:
> 
> * If desired, a vfork() opt-out may be implemented. But it'd need to disable posix_spawn() on Linux as well, since it might be implemented via vfork(), and we have no good way to check that.

We have such an opt-out capabilty for posix_spawn via `subprocess._USE_POSIX_SPAWN` https://github.com/python/cpython/blob/master/Lib/subprocess.py#L651
along with code in there that determines the default value based on the detected runtime environment.

I'm not sure we'll need that for vfork() as it seems to pretty much be a low level raw system call wrapper rather than something to be implemented via libc that can have its own bugs.  If we do wind up wanting one, I'd structure it the same way.
msg379511 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 02:20
Thank you for taking this on!  I'm calling it fixed for now as the buildbots are looking happy with it.  If issues with it arise we can address them.
msg379527 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-10-24 09:47
> From what I can tell, vfork probably also works on macOS (darwin).
> 
> Lets let this run for a bit on Linux and it can be a separate issue to 
> open vfork usage up to other platforms.

I'd prefer to not use vfork on macOS. For one I don't particularly trust that vfork would work reliably when using higher level APIs, but more importantly posix_spawn on macOS has some options that are hard to achieve otherwise and might be useful to expose in subprocess. An example of this is selecting the CPU architecture to use for the launched process when using fat binaries and a system that supports CPU emulation, such as the intel emulation on the upcoming Apple Silicon systems.
msg379529 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-10-24 17:29
> Thank you for taking this on!  I'm calling it fixed for now as the buildbots are looking happy with it.  If issues with it arise we can address them.

Thank you for reviewing and merging!

Using POSIX_CALL for pthread_sigmask() is incorrect, however, since it *returns* the error instead of setting errno. I've opened a PR with a fix and additional handling of the first pthread_sigmask() call in the parent (which can be done easily).

I've also noticed a memory leak why doing the above: cwd_obj2 is not released on the error path. It probably slipped with patches adding user/groups support. I'll file a separate issue.

Would you also clarify why you disallowed vfork() in case setsid() is needed? Have you discovered any potential issues with setsid()? Note that setsid() doesn't suffer from thread sync issues like setuid()/setgid()/etc.
msg379530 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 17:47
New changeset 473db47747bb8bc986d88ad81799bcbd88153ac5 by Alexey Izbyshev in branch 'master':
bpo-35823: subprocess: Fix handling of pthread_sigmask() errors (GH-22944)
https://github.com/python/cpython/commit/473db47747bb8bc986d88ad81799bcbd88153ac5
msg379531 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 18:12
regarding excluding the setsid() case: I was being conservative as I couldn't find a reference of what was and wasn't allowed after vfork.

I found one thing suggesting that on macOS setsid() was not safe after vfork().  But that appeared to be a Darwin-ism.  I expect that is not true on Linux as it should just be a syscall updating a couple of fields in the process info.  Confirming, in glibc is appears to be a shim for the setsid syscall (based on not finding any code implementing anything special for it) and in uclibc (*much* easier to read) it is clearly just a setsid syscall shim.

I'll make a PR to undo the setsid restriction given we're Linux only.
msg379535 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 19:07
New changeset be3c3a0e468237430ad7d19a33c60d306199a7f2 by Gregory P. Smith in branch 'master':
bpo-35823: Allow setsid() after vfork() on Linux. (GH-22945)
https://github.com/python/cpython/commit/be3c3a0e468237430ad7d19a33c60d306199a7f2
msg379536 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-10-24 19:12
> regarding excluding the setsid() case: I was being conservative as I couldn't find a reference of what was and wasn't allowed after vfork.

Yes, there is no list of functions allowed after vfork(), except for the conservative POSIX.1 list consisting only of _exit() and execve(), so we can only take async-signal-safe functions as a first approximation and work from there. Thankfully, on Linux, C libraries don't do anything fancy in most cases. But, for example, it appears that on FreeBSD we wouldn't be able to use sigprocmask()/sigaction()[1]. BTW, commit[2] and the linked review are an interesting reading for anybody who would like to support posix_spawn() and/or vfork() in subprocess on FreeBSD.

> Confirming, in glibc is appears to be a shim for the setsid syscall (based on not finding any code implementing anything special for it) and in uclibc (*much* easier to read) it is clearly just a setsid syscall shim.

I also recommend musl[3] when simplicity (and correctness) is required :)

[1] https://svnweb.freebsd.org/base/head/lib/libc/gen/posix_spawn.c?view=markup&pathrev=362111#l126
[2] https://svnweb.freebsd.org/base?view=revision&revision=352712
[3] https://git.musl-libc.org/cgit/musl/tree/src/unistd/setsid.c?id=a5aff1972c9e3981566414b09a28e331ccd2be5d
msg379537 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-10-24 19:22
@ronaldoussoren

> I'd prefer to not use vfork on macOS. For one I don't particularly trust that vfork would work reliably when using higher level APIs, but more importantly posix_spawn on macOS has some options that are hard to achieve otherwise and might be useful to expose in subprocess.

I can't comment on vfork() usability on macOS myself, but given the number of issues/considerations described here, I expect that significant research would be needed to check that.

Regarding your second point about extra posix_spawn() options, I actually don't see why it would be an argument against vfork(). Even on Linux, subprocess prefers posix_spawn() to vfork()/fork() when possible, so vfork() support doesn't hinder posix_spawn().
msg379538 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 19:30
Nice links.  LOL, yes, musl source was going to be my next stop if the larger libc sources proved impossible for a mere mortal to reason about. :)

regarding macOS, agreed. If someone needs vfork() to work there, I believe it could be made to.

Options like selecting the architecture of the child process could be higher level options to the subprocess API.  Hiding the platform specific details of how that happens and deciding which underlying approach to use based on the flags.

multi-arch systems are a thing.  It is conceivable that we may even see non-esoteric multi-arch hardware at some point.
msg379568 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-25 06:38
Performance improvement measured on a 1.4Ghz quad aarch64 A57 (nvidia jetson nano):

#define VFORK_USABLE 1
 test_subprocess: 36.5 seconds

#undef VFORK_USABLE
 test_subprocess: 45 seconds

Nice.  I really didn't expect a 20% speedup on its testsuite alone!

Lets dive into that with a microbenchmark:

$ ./build-novfork/python ten_seconds_of_truth.py
Launching /bin/true for 10 seconds in a loop.
Launched 2713 subprocesses in 10.00194378796732 seconds.
271.247275281014 subprocesses/second.
Increased our mapped pages by 778240KiB.
Launching /bin/true for 10 seconds in a loop.
Launched 212 subprocesses in 10.006392606999725 seconds.
21.186456331095847 subprocesses/second.

$ ./build/python ten_seconds_of_truth.py
Launching /bin/true for 10 seconds in a loop.
Launched 3310 subprocesses in 10.001623224001378 seconds.
330.94628000551285 subprocesses/second.
Increased our mapped pages by 778240KiB.
Launching /bin/true for 10 seconds in a loop.
Launched 3312 subprocesses in 10.001519071985967 seconds.
331.1496959773679 subprocesses/second.

Demonstrating perfectly the benefit of vfork().  The more mapped pages, the slower regular fork() becomes.  With vfork() the size of the process's memory map does not matter.

ten_seconds_of_truth.py:

```python
from time import monotonic as now
import subprocess

def benchmark_for_ten():
    print('Launching /bin/true for 10 seconds in a loop.')

    count = 0
    start = now()
    while now() - start < 10.:
        subprocess.run(['/bin/true'])
        count += 1
    end = now()
    duration = end-start

    print(f'Launched {count} subprocesses in {duration} seconds.')
    print(f'{count/duration} subprocesses/second.')

benchmark_for_ten()

map_a_bunch_of_pages = '4agkahglahaa^#\0ag3\3'*1024*1024*40

print(f'Increased our mapped pages by {len(map_a_bunch_of_pages)//1024}KiB.')

benchmark_for_ten()
```
msg416901 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-06 21:16
See #47245.

https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/spawni.c#L309

In short - do not use vfork(). Use clone(CLONE_VM | CLONE_VFORK). And build separate stack.

Current implementation is heavily broken.

Another guy has failed: https://bugzilla.kernel.org/show_bug.cgi?id=215813.


Please comment calling vfork() as soon as possible in order to fallback to fork() implementation. Until you (or me?) write good vfork children code.
msg416970 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2022-04-08 10:11
The preceding comment is wrong, see discussion in #47245 and https://bugzilla.kernel.org/show_bug.cgi?id=215813#c14 for explanation of why that bug report is irrelevant for CPython.
msg416975 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-08 10:51
Yes, you are almost right. Error-path is not so clear (it is discussed in another issue), but in general, yes, my previous comment is wrong. So, finally, there are no bugs around the stack at all.
History
Date User Action Args
2022-04-11 14:59:10adminsetgithub: 80004
2022-04-08 10:51:23socketpairsetmessages: + msg416975
2022-04-08 10:11:25izbyshevsetkeywords: patch, patch, patch

messages: + msg416970
2022-04-06 21:16:41socketpairsetnosy: + socketpair
messages: + msg416901
2020-10-25 06:38:16gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg379568
2020-10-24 19:30:59gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg379538
2020-10-24 19:22:50izbyshevsetkeywords: patch, patch, patch

messages: + msg379537
2020-10-24 19:12:42izbyshevsetkeywords: patch, patch, patch

messages: + msg379536
2020-10-24 19:07:43gregory.p.smithsetmessages: + msg379535
2020-10-24 18:34:49gregory.p.smithsetpull_requests: + pull_request21863
2020-10-24 18:12:19gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg379531
2020-10-24 17:47:44gregory.p.smithsetmessages: + msg379530
2020-10-24 17:29:41izbyshevsetkeywords: patch, patch, patch

messages: + msg379529
2020-10-24 17:13:06izbyshevsetpull_requests: + pull_request21862
2020-10-24 09:47:57ronaldoussorensetkeywords: patch, patch, patch

messages: + msg379527
2020-10-24 02:20:26gregory.p.smithsetstatus: open -> closed
messages: + msg379511

keywords: patch, patch, patch
resolution: fixed
stage: commit review -> resolved
2020-10-24 01:15:11gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg379507
2020-10-24 00:54:32gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg379504
stage: patch review -> commit review
2020-10-24 00:47:09gregory.p.smithsetmessages: + msg379503
2020-10-16 05:41:48izbyshevsetkeywords: patch, patch, patch

messages: + msg378703
2020-10-15 05:15:34izbyshevsetkeywords: patch, patch, patch
assignee: izbyshev
messages: + msg378664
2020-06-26 22:02:45gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg372444
2020-06-26 15:48:04izbyshevsetkeywords: patch, patch, patch

messages: + msg372429
2020-06-26 14:40:45Yonatan Goldschmidtsetmessages: + msg372426
2020-06-24 23:04:17Yonatan Goldschmidtsetnosy: + Yonatan Goldschmidt
2020-06-08 22:30:52vstinnersetkeywords: patch, patch, patch
nosy: - vstinner
2020-06-08 22:12:32gregory.p.smithsetkeywords: patch, patch, patch
versions: + Python 3.10, - Python 3.8
2019-01-30 14:12:47izbyshevsetmessages: + msg334559
2019-01-29 18:06:31izbyshevsetmessages: + msg334534
2019-01-26 23:18:52gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg334407
2019-01-26 22:45:04gregory.p.smithsetkeywords: patch, patch, patch

messages: + msg334404
2019-01-26 21:51:18izbyshevsetmessages: + msg334402
2019-01-25 15:49:25ronaldoussorensetkeywords: patch, patch, patch

messages: + msg334363
2019-01-25 14:00:59vstinnersetpull_requests: + pull_request11493
2019-01-25 14:00:46vstinnersetpull_requests: + pull_request11492
2019-01-25 11:17:10izbyshevsetmessages: + msg334353
2019-01-25 11:03:32koobssetkeywords: patch, patch, patch
nosy: + koobs
2019-01-25 07:41:52ronaldoussorensetkeywords: patch, patch, patch
nosy: + ronaldoussoren
messages: + msg334345

2019-01-25 02:08:02izbyshevsetkeywords: + patch
stage: patch review
pull_requests: + pull_request11486
2019-01-25 02:07:56izbyshevsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11485
2019-01-25 02:07:49izbyshevsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11484
2019-01-25 02:03:26izbyshevcreate