classification
Title: race condition in ThreadChildWatcher (default) and MultiLoopChildWatcher
Type: Stage:
Components: asyncio Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: asvetlov, sourcejedi, yselivanov
Priority: normal Keywords:

Created on 2020-10-21 18:44 by sourcejedi, last changed 2020-10-22 15:09 by sourcejedi.

Messages (3)
msg379229 - (view) Author: Alan Jenkins (sourcejedi) * Date: 2020-10-21 18:44
## Test program ##

import asyncio
import time
import os
import signal
import sys

# This bug happens with the default, ThreadedChildWatcher
# It also happens with MultiLoopChildWatcher,
# but not the other three watcher types.
#asyncio.set_child_watcher(asyncio.MultiLoopChildWatcher())

# Patch os.kill to call sleep(1) first,
# opening up the window for a race condition
os_kill = os.kill
def kill(p, n):
    time.sleep(1)
    os_kill(p, n)

os.kill = kill

async def main():
    p = await asyncio.create_subprocess_exec(sys.executable, '-c', 'import sys; sys.exit(0)')
    p.send_signal(signal.SIGTERM)
    # cleanup
    await p.wait()

asyncio.run(main())


## Test output ##

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alan-sysop/src/cpython/Lib/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/alan-sysop/src/cpython/Lib/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "<stdin>", line 3, in main
  File "/home/alan-sysop/src/cpython/Lib/asyncio/subprocess.py", line 138, in send_signal
    self._transport.send_signal(signal)
  File "/home/alan-sysop/src/cpython/Lib/asyncio/base_subprocess.py", line 146, in send_signal
    self._proc.send_signal(signal)
  File "/home/alan-sysop/src/cpython/Lib/subprocess.py", line 2081, in send_signal
    os.kill(self.pid, sig)
  File "<stdin>", line 3, in kill
ProcessLookupError: [Errno 3] No such process


## Tested versions ##

* v3.10.0a1-121-gc60394c7fc
* python39-3.9.0-1.fc32.x86_64
* python3-3.8.6-1.fc32.x86_64


## Race condition ##

main thread	vs	ThreadedChildWatcher._do_waitpid() thread

p=create_subprocess_exec(...)
			waitpid(...)  # wait for process exit
<Process p exits>
			<waitpid() returns.  Process p is reaped, and no longer exists>
p.send_signal(9)


## Result ##

A signal is sent to p.pid, after p.pid has been reaped by waitpid().  It might raise an error because p.pid no longer exists.

In the worst case the signal will be sent successfully - because an unrelated process has started with the same PID.


## How easy is it to reproduce? ##

It turns out the window for this race condition has been kept short, due to mitigations in the subprocess module.  IIUC, the mitigation protects against incorrect parallel use of a subprocess object by multiple threads.

        def send_signal(self, sig):
            # bpo-38630: Polling reduces the risk of sending a signal to the
            # wrong process if the process completed, the Popen.returncode
            # attribute is still None, and the pid has been reassigned
            # (recycled) to a new different process. This race condition can
            # happens in two cases [...]
            self.poll()
            if self.returncode is not None:
                # Skip signalling a process that we know has already died.
                return
            os.kill(self.pid, sig)




## Possible workarounds ##

* SafeChildWatcher and FastChildWatcher should not have this defect.  However we use ThreadedChildWatcher and MultiLoopChildWatcher to support running event loops in different threads.

* PidfdChildWatcher should not have this defect.  It is only available on Linux, kernel version 5.3 or above.

It would be possible to avoid the ThreadedChildWatcher race by using native code and pthread_cancel(), so that the corresponding waitpid() call is canceled before sending a signal.  Except the current implementation of pthread_cancel() is also unsound, because of race conditions.

* https://lwn.net/Articles/683118/ "This is why we can't have safe cancellation points"
* https://sourceware.org/bugzilla/show_bug.cgi?id=12683 "Race conditions in pthread cancellation"
msg379296 - (view) Author: Alan Jenkins (sourcejedi) * Date: 2020-10-22 14:29
There's one way to fix this in MultiLoopChildWatcher (but not ThreadedChildWatcher).  Make sure the waitpid() runs on the same thread that created the subprocess.  Prototype: https://github.com/sourcejedi/cpython/commit/92f979bce4582e807facb1c274a962b3caf0d2eb

The other approach would be to copy subprocess.wait(timeout) - keep waking up regularly and polling to see if the process has exited yet.

I'm not convinced ThreadedChildWatcher is fixable.  You can't call waitpid() in one thread, and let someone call kill() in a different thread.

You could try avoiding calling waitpid() until someone does `await asyncio.subprocess.Process.wait()`.  I think I didn't like it because - what happens if you cancel a wait() task?  I think you want to cancel the waitpid() so you could safely kill() again... And we don't have any way to cancel the blocking waitpid() call, at least not from python, definitely not since PEP-475.
msg379297 - (view) Author: Alan Jenkins (sourcejedi) * Date: 2020-10-22 15:09
Put the other way, if you wanted to fix this bug in ThreadedChildWatcher, and go as far as allowing cancelling Process.wait(), followed by kill() / send_signal(), then I think you need -

* siginterrupt(SIGCHLD, 1)
* not to mind about any random C code that doesn't really handle being interrupted.  Like printf(), ho hum.  https://groups.google.com/g/comp.unix.programmer/c/QZmFw1VytYs/m/BSBXBHTI1REJ
* add & use a new call like os.waitpid_interruptible(), which doesn't restart on EINTR, as a workaround for PEP-475.
* set a "stop" flag for the watcher thread
* use threading.pthread_kill() (available on *most* python platforms) to interrupt the watcher thread.  Spoof SIGCHLD, this will avoid conflicts with a python handler for any other signal.
* wait for the watcher thread to finish using Thread.join()
* now you can safely find out whether the child process has been reaped, or whether it's safe to kill it.
History
Date User Action Args
2020-10-22 15:09:14sourcejedisetmessages: + msg379297
2020-10-22 14:29:06sourcejedisetmessages: + msg379296
2020-10-21 18:44:25sourcejedicreate