classification
Title: subprocess_fork_exec more stable with vfork
Type: enhancement Stage: resolved
Components: Extension Modules Versions: Python 3.8
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: Albert.Zeyer, gregory.p.smith, lukasz.langa
Priority: normal Keywords:

Created on 2017-10-18 16:24 by Albert.Zeyer, last changed 2018-04-04 07:13 by gregory.p.smith. This issue is now closed.

Messages (6)
msg304587 - (view) Author: Albert Zeyer (Albert.Zeyer) Date: 2017-10-18 16:24
subprocess_fork_exec currently calls fork().

I propose to use vfork() or posix_spawn() or syscall(SYS_clone, SIGCHLD, 0) instead if possible and if there is no preexec_fn. The difference would be that fork() will call any atfork handlers (registered via pthread_atfork()), while the suggested calls would not.

There are cases where atfork handlers are registered which are not save to be called e.g. in multi-threaded environments. In the case of subprocess_fork_exec without preexec_fn, there is no need to call those atfork handlers, so avoiding this could avoid potential problems. It's maybe acceptable if a pure fork() without exec() doesn't work in this case anymore, but there is no reason that a fork()+exec() should not work in any such cases. This is fixed by my proposed solution.

An example case is OpenBlas and OpenMP, which registers an atfork handler which is safe to be called if there are other threads running.
See here:
https://github.com/tensorflow/tensorflow/issues/13802
https://github.com/xianyi/OpenBLAS/issues/240
https://trac.sagemath.org/ticket/22021

About fork+exec without the atfork handlers, see here for alternatives (like vfork):
https://stackoverflow.com/questions/46810597/forkexec-without-atfork-handlers/
msg304613 - (view) Author: Albert Zeyer (Albert.Zeyer) Date: 2017-10-19 10:17
This is a related issue, although with different argumentation:
https://bugs.python.org/issue20104
msg304653 - (view) Author: Albert Zeyer (Albert.Zeyer) Date: 2017-10-20 13:04
Here is some more background for a case where this occurs:
https://stackoverflow.com/questions/46849566/multi-threaded-openblas-and-spawning-subprocesses

My proposal here would fix this.
msg304677 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-10-20 20:14
Using new syscalls for _posixsubprocess.c would be a feature so it would be limited to 3.7+ if done at all.

My gut feeling is that the real bug is in *any* library code that uses pthread_atfork() in a non thread safe manner.  That code is fundamentally broken as it is placing a burden upon the entire application that it lives within that the application and no other libraries may use threads or if it does that it may not launch subprocesses.

That is unrealistic.  So I'd blame OpenBlas and OpenMP.

Supporting alternate system calls seems like it would just paper over the issue of improper use of pthread_atfork rather than address the fundamental problem.
msg314844 - (view) Author: Ɓukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-03 00:57
Greg, the flip side is that now any Python user is at the mercy of third-party library providers to do the right thing in terms of pthread_atfork(). It won't always be feasible for the user to influence the third party to fix the problem. You're right that vfork() won't solve the fundamental issue but will at least remove one case where it currently causes crashes.
msg314928 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2018-04-04 07:13
A raw os.posix_spawn() API has been added to 3.7.

vfork() would likely all sorts of other problems.  It is extremely complicated and implementations have bugs.  CERT says never to use it.
  https://wiki.sei.cmu.edu/confluence/pages/viewpage.action?pageId=87152373

When forking or threading is involved, you are already always at the mercy of all other code running in your process.  This is not new.

That some C/C++ libraries misuse pthread_atfork() is not a surprise.  But the bug is theirs.

We should not try to paper over others C and system call mistakes.  Their code needs fixing.
History
Date User Action Args
2018-04-04 07:13:55gregory.p.smithsetstatus: open -> closed
assignee: gregory.p.smith
resolution: rejected
stage: resolved
2018-04-04 07:13:29gregory.p.smithsetmessages: + msg314928
2018-04-03 01:24:19lukasz.langasetversions: - Python 3.7
2018-04-03 00:57:37lukasz.langasetnosy: + lukasz.langa
messages: + msg314844
2017-10-20 20:14:47gregory.p.smithsettype: behavior -> enhancement
components: + Extension Modules, - Interpreter Core
2017-10-20 20:14:29gregory.p.smithsetmessages: + msg304677
versions: - Python 2.7, Python 3.4, Python 3.5, Python 3.6
2017-10-20 19:30:59ned.deilysetnosy: + gregory.p.smith
2017-10-20 13:04:35Albert.Zeyersetmessages: + msg304653
2017-10-19 10:17:55Albert.Zeyersetmessages: + msg304613
2017-10-18 16:24:24Albert.Zeyercreate