This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: potential undefined behavior with subprocess using vfork() on Linux?
Type: behavior Stage: resolved
Components: Extension Modules Versions: Python 3.11, Python 3.10
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: gregory.p.smith, izbyshev, socketpair
Priority: normal Keywords: 3.10regression

Created on 2022-04-06 20:24 by socketpair, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (13)
msg416897 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-06 20:24
Using vfork in #35823 is VERY tricky. Please comment out vfork() usage for now. Yes, we can (should) use vfork(), but we have to rewrite the child code.

https://bugzilla.kernel.org/show_bug.cgi?id=215813

I would say it's URGENT.
msg416898 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-06 20:30
Rewriting in a way that guarantee no stack (and heap) usage. Because stack is shared between child and parent. It seems there is no crossplatform way. Happily we can use some code like I wrote by link in the first message. OR, stick to posix_nspawn which is cross-platform.
msg416899 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-06 20:37
https://github.com/python/cpython/blob/4a08c4c469d36f99d3a5e0f17ad82ab35dcf2835/Modules/_posixsubprocess.c#L717

    child_exec(exec_array, argv, envp, cwd,
               p2cread, p2cwrite, c2pread, c2pwrite,
               errread, errwrite, errpipe_read, errpipe_write,
               close_fds, restore_signals, call_setsid,
               call_setgid, gid, call_setgroups, groups_size, groups,
               call_setuid, uid, child_umask, child_sigmask,
               py_fds_to_keep, preexec_fn, preexec_fn_args_tuple);

----------------

Py_NO_INLINE static void
child_exec(char *const exec_array[],
           char *const argv[],
           char *const envp[],
           const char *cwd,
           int p2cread, int p2cwrite,
           int c2pread, int c2pwrite,


-------------

So, calling child_exec pushes to stack GUARANTEED. This is the bug. In fact everything works, but is too weak and break at any moment.

Again. Please comment calling vfork() for now. And re-implement child part later.
msg416900 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-06 21:13
Solution:

https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/spawni.c#L309

In short - do not use vfork(). Use clone(CLONE_VM | CLONE_VFORK). and do something with stack.
msg416904 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-04-06 22:59
Can you provide a reproducable way to demonstrate evidence of a problem in CPython's use of the Linux libc vfork() implementation?  A test case that causes a CPython parent or child process on Linux when built with HAVE_VFORK failing to function properly would help prioritize this because in practice nobody has reported problems in 3.10.

(we've deployed the subprocess vfork code into thousands production Python programs at work in the past year w/o issues being reported - though we have a constrained environment with use on only a couple of libc versions and limited set of kernels on a few very common architectures)

General thinking (possible dated and incorrect - against what https://man7.org/linux/man-pages/man2/vfork.2.html wording claims with its "or calls any other function" text):

Pushing additional data onto the stack in the child process _should_ not a problem.  That by definition lands in previously unused pre-allocated stack space.  If that page faults, that could map a new page into the process state shared by both the paused parent and running child. But this page mapping should be fine - the child exec that resumes the parent means the parent is the only one who sees it.

When the parent process resumes, sure, that data will be in that memory on the unallocated portion of stack, but critically the *stack pointer* in the parent process (a register) never changes.  As far as I understand things, registers are not shared between vfork()ed processes.  So the parent only sees some temporary data having been written to the unused region of the stack by the since-replaced by exec() child process.  No big deal.

**Untrue wishful thinking**: If a new stack were needed on a given platform for use in the vfork()ed child, I'd like it to be the job of libc to take care of that for us.  glibc sources do no such thing (every vfork supporting architecture has a vfork.S code that appears to make the syscall and jump back to the caller without stack manipulation).
msg416905 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-04-06 23:01
Immediate action item: Add a way for people to disable vfork at runtime by setting a flag in the subprocess module, just in case.

This can be backported to 3.10 - It'd provide an escape hatch for anyone without a need to rebuild Python to disable use of vfork() without resorting to LD_PRELOAD hacks.

This is not an urgent issue unless actual practical problems are being observed in the field.
msg416909 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-04-06 23:15
Our current assumptions around the use of vfork() are very much glibc specific.

Another useful reference for reasoning, comments, and history is https://github.com/golang/go/blob/master/src/syscall/exec_linux.go#L146 `forkAndExecInChild1`
msg416917 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2022-04-07 08:10
In short: both this bug report and [1] are invalid.

The reason why doing syscall(SYS_vfork) is illegal is explained by Florian Weimer in [2]:

>The syscall function in glibc does not protect the on-stack return address against overwriting, so it can't be used to call SYS_vfork on x86.

This is off-topic here because CPython calls vfork() via libc, but I'll still expand the Florian's comment. Suppose one wants to write my_vfork() wrapper over vfork syscall. When vfork syscall returns the first time, my_vfork() has to return to its caller. This necessarily involves knowing the return address. On some architectures this return address is stored on the stack by the caller (e.g. x86). The problem is that any data in my_vfork() stack frame can be overwritten by its caller once it returns the first time. Then, when vfork syscall returns the second time, my_vfork() could be unable to return to its caller because the data it fetches from its (now invalid) stack frame is garbage. This is precisely what happens when one implements my_vfork() as syscall(SYS_vfork). To avoid this, the most common strategy is to store the return address into a register that's guaranteed to be preserved around syscall by the OS ABI. For example, the x86-64 musl implementation [3] stores the return address in rdx (which is preserved around syscall) and then restores it after syscall (both on the first and the second return of the syscall).

Now back to CPython. The real problem with stack sharing between the child and the parent could be due to compiler bugs, e.g. if a variable stored on the stack is modified in the child branch of "if (vfork())", but the compiler assumes that some other variable sharing the stack slot with the first one is *not* modified in the parent branch (i.e. after vfork() returns the second time). But all production compilers handle vfork() (and setjmp(), which has a similar issue) in a special way to avoid this, and GCC has __attribute__((returns_twice)) that a programmer could use for custom functions behaving in this way (my_vfork() would have to use this attribute).

Regarding a separate stack for the child and clone(CLONE_VM|CLONE_VFORK), I considered this in #35823, but this has its own problems. The most important one is that CPython would be in business of choosing the correct stack size for the child's stack, but CPython is *not* in control of all the code executing in the child because it calls into libc. In practice, people use various LD_PRELOAD-based software that wraps various libc functions (e.g. Scratchbox 2 build environment for Sailfish OS is an LD_PRELOAD-based sandbox), so even common-sense assumptions like "execve() in libc can't use a lot of stack!" might turn out to be wrong. CPython *could* work around that by using direct syscalls for everything in the child, or by using some "large" size that "should be enough for everybody", but I wouldn't want to see that unless we have a real problem with vfork(). Note that vfork()-based posix_spawn() implementations in C libraries do *not* have this problem because they fully control all code in the child (e.g. they would use a non-interposable execve() symbol or a direct syscall).

> Immediate action item: Add a way for people to disable vfork at runtime by setting a flag in the subprocess module, just in case.

I don't think any action is needed at all, and I think this issue should be closed.

> Our current assumptions around the use of vfork() are very much glibc specific.

Could you clarify what glibc-specific assumptions you mean? In #35823 I tried to use as little assumptions as possible.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215813
[2] https://bugzilla.kernel.org/show_bug.cgi?id=215813#c2
[3] https://git.musl-libc.org/cgit/musl/tree/src/process/x86_64/vfork.s?id=ced75472d7e3d73d5b057e36ccbc7b7fcba95104
msg416946 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-04-07 21:46
Thanks!  I agree with you that this is probably not an actual problem on Linux.

_I did look at the various glibc architecture vfork.s implementations: Cute tricks used on some where they need to avoid a stack modifying traditional return from vfork()._

As for glibc specifics, I'm mostly thinking of the calls we do in the child.

According to the "Standard Description (POSIX.1)" calls to anything other than `_exit()` or `exec*()` are not allowed.  But the longer "Linux Description" in that vfork(2) man page does not say that.  Which implies merely by omission that calls to other things are okay so long as you understand everything they do to the process heap/stack/state.  (I wish it were *explicit* about that)

Some of the calls we do from our child_exec() code... many are likely "just" syscall shims and thus fine - but that is technically up to libc.

A few others are Py functions that go elsewhere in CPython and while they may be fine for practical reasons today with dangerous bits on conditional branches that technically should not be possible to hit given the state by the time we're at this point in _posixsubprocess, pose a future risk - anyone touching implementations of those is likely unaware of vfork'ed child limitations that must be met.

For example if one of the potential code paths that trigger an indirect Py_FatalError() is hit... that fatal exit code is definitely not post-vfork-child safe.  The pre-exec child dying via that could screw up the vfork parent process's state.
msg416964 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2022-04-08 08:28
> As for glibc specifics, I'm mostly thinking of the calls we do in the child.

> According to the "Standard Description (POSIX.1)" calls to anything other than `_exit()` or `exec*()` are not allowed.  But the longer "Linux Description" in that vfork(2) man page does not say that.  Which implies merely by omission that calls to other things are okay so long as you understand everything they do to the process heap/stack/state.  (I wish it were *explicit* about that)

If we're talking about the kernel side of things, sure, we rely on Linux being "sane" here, though I suppose on *BSDs the situation is similar.

> Some of the calls we do from our child_exec() code... many are likely "just" syscall shims and thus fine - but that is technically up to libc.

Yes, but I wouldn't say that "being just syscall shims" is specific for glibc. It's just a "natural" property that just about any libc is likely to possess. (Yeah, I know, those are vague words, but in my experience "glibc-specific" is usually applied to some functionality/bug present in glibc and absent in other libcs, and I don't think we rely on something like that).

Of course, there are also LD_PRELOAD things that could be called instead of libc, but good news here is that we don't create new constrains for them (CPython is not the only software that uses vfork()), and they're on their own otherwise.

> A few others are Py functions that go elsewhere in CPython and while they may be fine for practical reasons today with dangerous bits on conditional branches that technically should not be possible to hit given the state by the time we're at this point in _posixsubprocess, pose a future risk - anyone touching implementations of those is likely unaware of vfork'ed child limitations that must be met.

We already have async-signal-safety requirement for all such code because of fork(). Requirements of vfork() are a bit more strict, but at least the set of functions we have to watch for dangerous changes is the same. And I suspect that most practical violations of vfork()-safety also violate async-signal-safety.

> For example if one of the potential code paths that trigger an indirect Py_FatalError() is hit... that fatal exit code is definitely not post-vfork-child safe.  The pre-exec child dying via that could screw up the vfork parent process's state.

Yeah, and it can break the fork parent too, at least because it uses exit() (not _exit()), so stdio buffers will be flushed twice, in the child and in the parent.
msg416974 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-08 10:47
So, finally: 

1. Regarding vfork() and stack - everything is nice. No bugs because libc has nasty hacks for stack restoration.

2. Having the ability to turn off vfork using environment variables is NICE. At least, one can easily compare the performance.

3. We have to fix error-path in order not to change heap state (contents and allocations), possibly do not touch locks. During vfork() child execution - the only parent THREAD (not the process) is blocked. For example, it's not allowed to touch GIL. Child process may die unexpectedly and leave GIL locked. Is it possible to rewrite children path for vfork() case without any Py* calls ? As an idea, we can prepare all low-level things (all the pointers to strings and plain values) before vfork(), so child code will use only that data.
msg416976 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2022-04-08 10:59
> 3. We have to fix error-path in order not to change heap state (contents and allocations), possibly do not touch locks. During vfork() child execution - the only parent THREAD (not the process) is blocked. For example, it's not allowed to touch GIL. Child process may die unexpectedly and leave GIL locked. Is it possible to rewrite children path for vfork() case without any Py* calls ? As an idea, we can prepare all low-level things (all the pointers to strings and plain values) before vfork(), so child code will use only that data.

What specifically do you propose to fix? There is no problem with GIL if the child dies because the GIL is locked and unlocked only by the parent and the child never touches it. Similarly, only Py_* calls known to be safe are used. As for "pointers to strings", it's not clear to me what you mean, but if you mean allocations, they are already done before (v)fork(), since the child code is required to be async-signal-safe even if plain fork() is used.
msg416984 - (view) Author: Марк Коренберг (socketpair) * Date: 2022-04-08 13:33
I have studied assembler output of _posixsubprocess.o compilation. Yes, everything seems safe. So, I'm closing the bug.
History
Date User Action Args
2022-04-11 14:59:58adminsetgithub: 91401
2022-04-08 13:33:50socketpairsetstatus: open -> closed
resolution: works for me
messages: + msg416984

stage: test needed -> resolved
2022-04-08 10:59:58izbyshevsetmessages: + msg416976
2022-04-08 10:47:24socketpairsetmessages: + msg416974
2022-04-08 08:28:46izbyshevsetmessages: + msg416964
2022-04-07 21:46:09gregory.p.smithsetmessages: + msg416946
title: potential undefined behavior with subprocess using vfork() on Linux -> potential undefined behavior with subprocess using vfork() on Linux?
2022-04-07 08:10:42izbyshevsetnosy: + izbyshev
messages: + msg416917
2022-04-06 23:15:31gregory.p.smithsetmessages: + msg416909
2022-04-06 23:01:56gregory.p.smithsetkeywords: + 3.10regression
2022-04-06 23:01:45gregory.p.smithsetmessages: + msg416905
2022-04-06 22:59:16gregory.p.smithsetassignee: gregory.p.smith
type: behavior
title: Subprocess with vfork() is broken -> potential undefined behavior with subprocess using vfork() on Linux
nosy: + gregory.p.smith

messages: + msg416904
stage: test needed
2022-04-06 21:13:05socketpairsetmessages: + msg416900
2022-04-06 20:37:58socketpairsetmessages: + msg416899
2022-04-06 20:30:00socketpairsetmessages: + msg416898
2022-04-06 20:24:52socketpaircreate