This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author byllyfish
Recipients aeros, asvetlov, byllyfish, chris.jerdonek, jstasiak, kj, miss-islington, njs, pablogsal, shreyanavigyan, vstinner, yselivanov
Date 2021-11-05.17:45:33
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1636134333.74.0.0799731326707.issue38323@roundup.psfhosted.org>
In-reply-to
Content
asyncio.MultiLoopChildWatcher has two problems that create a race condition.

1. The SIGCHLD signal handler does not guard against interruption/re-entry.
2. The SIGCHLD signal handler can interrupt add_child_handler's `self._do_waitpid(pid)`.


Symptoms:

Log messages that look like this:

1634935451.761 WARNING Unknown child process pid 8747, will report returncode 255
...
1634935451.762 WARNING Child watcher got an unexpected pid: 8747
Traceback (most recent call last):
  File "/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py", line 1306, in _do_waitpid
    loop, callback, args = self._callbacks.pop(pid)
KeyError: 8747


Background:

I've been working on a library to make calling asyncio subprocesses more convenient. As part of my testing, I've been stress testing asyncio using different child watcher policies. My CI build runs more than 200 tests each on macOS, Linux and FreeBSD. I've noticed a small percentage of sporadic failures using MultiLoopChildWatcher.

My understanding of Python signal functions is that:

1. Upon receipt of a signal, the native "C" signal handler sets a flag that indicates the signal arrived.
2. The main thread checks the signal flags after each bytecode instruction. If a signal flag is set, Python saves the call stack, runs the signal handler on the main thread immediately, then pops the stack when it returns (assuming no exception raised by signal handler).
3. If you are in the middle of a signal handler running on the main thread and Python detects another signal flag, your signal handler can be interrupted.
4. Stacked signal handlers run in LIFO order. The last one that enters will run to completion without interruption.


Explanation:

I wrapped MultiLoopChildWatcher's sig_chld function in a decorator that logs when it is entered and exited. This clearly shows that _sig_chld is being re-entered. In the following log snippet, I'm running two commands in a pipeline "tr | cat".

1634935451.743 DEBUG process '/usr/bin/tr' created: pid 8747
...
1634935451.746 DEBUG process '/bin/cat' created: pid 8748
...
1634935451.761 DEBUG enter '_sig_chld' 20
1634935451.761 DEBUG enter '_sig_chld' 20
1634935451.761 WARNING Unknown child process pid 8747, will report returncode 255
1634935451.762 DEBUG process 8748 exited with returncode 0
1634935451.762 DEBUG exit '_sig_chld' 20
1634935451.762 WARNING Child watcher got an unexpected pid: 8747
Traceback (most recent call last):
  File "/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py", line 1306, in _do_waitpid
    loop, callback, args = self._callbacks.pop(pid)
KeyError: 8747
1634935451.763 WARNING Unknown child process pid 8748, will report returncode 255
1634935451.763 WARNING Child watcher got an unexpected pid: 8748
Traceback (most recent call last):
  File "/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py", line 1306, in _do_waitpid
    loop, callback, args = self._callbacks.pop(pid)
KeyError: 8748
1634935451.763 DEBUG exit '_sig_chld' 20


Here is the breakdown of what happens:

1. Pid 8747 exits and we enter _sig_chld #1.
2. sig_chld #1 calls os.waitpid which gives (pid, status) = (8747, 0).
3. Before sig_chld #1 has a chance to call `self._callbacks.pop(pid)`, it is interrupted.
4. sig_chld #2 calls os.waitpid for pid 8747. We get a ChildProcessError and then "Unknown child process pid 8747, will report returncode 255"
5. sig_chld #2 invokes the callback for pid 8747 saying the returncode=255.
6. sig_chld #2 continues to completion. It reaps pid 8748 normally.
7. sig_chld #1 picks up where it left off. We get an error when we try to pop the callback for 8747.
8. sig_chld #1 calls os.waitpid for pid 8748. This gives us failure messages because it was done by sig_chld #2.

The issue of interruption can also happen in the case of running a single process. If the _sig_chld interrupts the call to `self._do_waitpid(pid)` in add_child_watcher, a similar interleaving can occur.


Work-Around:

In my tests, I patched MultiLoopChildWatcher and so far, it appears to be more reliable. In add_child_handler, I call raise_signal(SIGCHLD) so that all the work is done in the signal handler.


class PatchedMultiLoopChildWatcher(asyncio.MultiLoopChildWatcher):
    "Test race condition fixes in MultiLoopChildWatcher."

    def add_child_handler(self, pid, callback, *args):
        loop = asyncio.get_running_loop()
        self._callbacks[pid] = (loop, callback, args)

        # Prevent a race condition in case signal was delivered before
        # callback added.
        signal.raise_signal(signal.SIGCHLD)

    @_serialize
    def _sig_chld(self, signum, frame):
        super()._sig_chld(signum, frame)


_serialize is a decorator that looks like this:


def _serialize(func):
    """Decorator to serialize a non-reentrant signal function.
    If one client is already in the critical section, set a flag to run the
    section one more time. Testing purposes only.
    """

    lock = threading.Lock()  # Used as atomic test-and-set.
    retry = False

    @functools.wraps(func)
    def _decorator(*args, **kwargs):
        nonlocal retry

        while True:
            if lock.acquire(blocking=False):  # pylint: disable=consider-using-with
                try:
                    retry = False
                    func(*args, **kwargs)
                finally:
                    lock.release()
                if retry:
                    continue
            else:
                # A signal handler that interrupts an existing handler will
                # run to completion (LIFO).
                retry = True
            break

    return _decorator
History
Date User Action Args
2021-11-05 17:45:33byllyfishsetrecipients: + byllyfish, vstinner, njs, asvetlov, chris.jerdonek, yselivanov, jstasiak, pablogsal, miss-islington, aeros, kj, shreyanavigyan
2021-11-05 17:45:33byllyfishsetmessageid: <1636134333.74.0.0799731326707.issue38323@roundup.psfhosted.org>
2021-11-05 17:45:33byllyfishlinkissue38323 messages
2021-11-05 17:45:33byllyfishcreate