New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asyncio: Race condition in signal handling on FreeBSD #65844
Comments
http://buildbot.python.org/all/builders/AMD64%20FreeBSD%209.0%203.4/builds/191/steps/test/logs/stdio [ 8/389] test_asyncio |
Maybe see Tulip issue 168? That test was added to support that. Maybe the fix doesn't work for this platform? |
New changeset 69d474dab479 by Victor Stinner in branch 'default': |
New changeset a0e6a370755f by Victor Stinner in branch 'default': |
New changeset dbf991650441 by Victor Stinner in branch 'default': |
Buildbot where the issue occurs: |
I created a new repository just for this issue: I added a lot of debug. It looks like a race condition: the SIGCHLD signal is handled in a thread B (a C thread, not a Python), whereas the asyncio event loop is running in a thread A (the main thread, a Python thread). Debug of the test which blocks: PIPE: (7, 8) |
I'm running tests on the buildbot "AMD64 FreeBSD 9.0 custom": |
(1) the thread B gets the SIGCHLD signal: it writes a byte into the "self" pipe of the event loop C signal_handler: sig_num=20, thread=34468857856 (2) in the main thread, the event loop is awaken by the write in the pipe... but there is nothing to do _read_from_self -> b'\x14', thread=34384933888 (3) the thread B schedules the callback Py_AddPendingCall(checksignals_witharg), thread=34468857856 (I modified the output a little bit for readability.) IMO the problem is that asyncio relies on two events:
A solution would be to schedule the callback in the event loop. Since Python 3.3, the C signal handler writes the signal number, which should be enough to find and schedule the Python callback. |
New changeset 1ff9ce2204ee by Victor Stinner in branch 'default': |
"A solution would be to schedule the callback in the event loop. Since Python 3.3, the C signal handler writes the signal number, which should be enough to find and schedule the Python callback." I wrote a patch implementing this idea to fix the issue: |
I ran test_asyncio on my own FreeBSD VM, with the sandbox/issue21645 repository and I got a new (different) error on signal handling: RuntimeError('reentrant call inside') when writing into sys.stderr. To be fair: this bug was introduced by my changes in this repository adding debug traces ("logger.error("handle signal %s" % sig)"). IMO the Py_AddPendingCall() function in Python is dangerous: the callback should be "signal-safe" but this concept is not defined Python. In the C language, there is a strict list of allowed functions, in Python it's unclear: the reentrant call is just an example. It's probably easy to deadlock if you call the "wrong" function". Py_AddPendingCall() injects arbitrary Python code between *any* other Python bytecode, it's like the evil greenlet with greenthreads. That's why I consider that it would be safer to call _handle_signal() from the event loop, not using Py_AddPendingCall(): Output of the new bug on FreeBSD: Traceback (most recent call last):
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 980, in emit
stream.write(self.terminator)
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 980, in emit
stream.write(self.terminator)
File "/usr/home/haypo/prog/python/default/Lib/asyncio/unix_events.py", line 95, in _handle_signal
logger.error("handle signal %s" % sig)
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 1302, in error
self._log(ERROR, msg, args, **kwargs)
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 1408, in _log
self.handle(record)
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 1418, in handle
self.callHandlers(record)
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 1480, in callHandlers
hdlr.handle(record)
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 852, in handle
self.emit(record)
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 983, in emit
self.handleError(record)
File "/usr/home/haypo/prog/python/default/Lib/logging/__init__.py", line 904, in handleError
sys.stderr.write('--- Logging error ---\n')
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
Call stack:
File "/usr/home/haypo/prog/python/default/Lib/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/home/haypo/prog/python/default/Lib/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/home/haypo/prog/python/default/Lib/test/__main__.py", line 3, in <module>
regrtest.main_in_temp_cwd()
File "/usr/home/haypo/prog/python/default/Lib/test/regrtest.py", line 1562, in main_in_temp_cwd
main()
File "/usr/home/haypo/prog/python/default/Lib/test/regrtest.py", line 763, in main
match_tests=ns.match_tests)
File "/usr/home/haypo/prog/python/default/Lib/test/regrtest.py", line 978, in runtest
display_failure=not verbose)
File "/usr/home/haypo/prog/python/default/Lib/test/regrtest.py", line 1278, in runtest_inner
test_runner()
File "/usr/home/haypo/prog/python/default/Lib/test/test_asyncio/__init__.py", line 29, in test_main
run_unittest(suite())
File "/usr/home/haypo/prog/python/default/Lib/test/support/__init__.py", line 1758, in run_unittest
_run_suite(suite)
File "/usr/home/haypo/prog/python/default/Lib/test/support/__init__.py", line 1724, in _run_suite
result = runner.run(suite)
File "/usr/home/haypo/prog/python/default/Lib/unittest/runner.py", line 168, in run
test(result)
File "/usr/home/haypo/prog/python/default/Lib/unittest/suite.py", line 87, in __call__
return self.run(*args, **kwds)
File "/usr/home/haypo/prog/python/default/Lib/unittest/suite.py", line 125, in run
test(result)
File "/usr/home/haypo/prog/python/default/Lib/unittest/suite.py", line 87, in __call__
return self.run(*args, **kwds)
File "/usr/home/haypo/prog/python/default/Lib/unittest/suite.py", line 125, in run
test(result)
File "/usr/home/haypo/prog/python/default/Lib/unittest/suite.py", line 87, in __call__
return self.run(*args, **kwds)
File "/usr/home/haypo/prog/python/default/Lib/unittest/suite.py", line 125, in run
test(result)
File "/usr/home/haypo/prog/python/default/Lib/unittest/case.py", line 625, in __call__
return self.run(*args, **kwds)
File "/usr/home/haypo/prog/python/default/Lib/unittest/case.py", line 577, in run
testMethod()
File "/usr/home/haypo/prog/python/default/Lib/test/test_asyncio/test_streams.py", line 647, in test_read_all_from_pipe_reader
self.loop.run_until_complete(proc.wait())
File "/usr/home/haypo/prog/python/default/Lib/asyncio/base_events.py", line 265, in run_until_complete
self.run_forever()
File "/usr/home/haypo/prog/python/default/Lib/asyncio/base_events.py", line 238, in run_forever
self._run_once()
File "/usr/home/haypo/prog/python/default/Lib/asyncio/base_events.py", line 984, in _run_once
logger.error("[pid %s] _run_once: thread=%s" % (os.getpid(), threading.get_ident()))
Message: '[pid 1085] _run_once: thread=34380739584' |
IMO we must handle signals correctly when there are more than one thread. On Linux, it looks like the C signal handler is always called from the main thread. On FreeBSD, it looks like the C signal handler can be called in any thread, C thread, Python thread holding the GIL, Python thread not holding the GIL. I found why test_read_all_from_pipe_reader() is running with more than one thread. I reproduced the issue on my FreeBSD VM. run_test_server() and run_test_unix_server() functions of test_utils create Python threads, these functions are used in the test_asyncio. The problem is that the PyThreadState_DeleteCurrent() function (called from t_bootstrap() of _threadmodule.c) may yield the CPU to another thread. Other Python threads are running when the line "MUTEX_UNLOCK(gil_mutex);" is executed in the drop_gil() function, before drop_gil() exits. t_bootstrap() -> PyThreadState_DeleteCurrent() -> PyEval_ReleaseLock() -> drop_gil(NULL) It means that Python threads are not deleted immediatly, but may be deleted "later". |
I did my meditation. I now understand the race:
The intention if the code was that the Py_AddPendingCall() would complete before the thread switch. Perhaps a CPython fix could be to call it before writing a byte to the pipe??? But that would require someone else to meditate more on the C code. I think your patch to Tulip is fine. I wouldn't worry about missing a signal when things are busy -- AFAIK the child process reaper will reap multiple processes. |
In fact I also expected Py_AddPendingCall() to be called *before* writing the signal number into the "wakeup fd". I guess that before nobody was relying on the "wakeup fd" *and* and the callback.
It doesn't work for Tulip which targets Python 3.3. Oh... I just saw that Python 2.7 calls Py_AddPendingCall() and *then* writes "\0" into wakeup fd... So Python 2.7 does not have the issue! But yes, we may change the C signal handler in Python 3.5 because it makes more sense to call Py_AddPendingCall() to avoid this specific race condition.
Good :) I will test it on FreeBSD. If it fixes the issue, I will apply it. We will see if it makes all buildbots happy.
Well, Py_AddPendingCall() also has a limit which is smaller and it does *not* warn when signals are dropped :-( -- I don't know if it can be useful, but there is also the signal.pthread_sigmask() which can be used to "queue" pending signals. Unblocking signals will flush the queue. It might help to mitigate burst of signals in a short window.... I don't know if signal.pthread_sigmask() has a longer queue than Python. And it may be tricky to decide when and where the signals should be blocked and unblocked. If a signal is blocked and the process gets the signal more than once, the signal is only raised once when the signal is unblocked. I don't know if it is an issue or not if asyncio only calls the signal handler once in this case. |
Oh, I didn't remember that I opened the issue bpo-12060 to discuss "real time signals". The summary is that the wakeup fd should be used... Which is nice because it is the choice made by asyncio ;-) FYI I also opened the issue bpo-21998: "asyncio: a new self-pipe should be created in the child process after fork" which is somehow related to this issue. |
The more esoteric system calls you use the more you end up debugging it on On Thu, Jul 17, 2014 at 9:17 AM, STINNER Victor <report@bugs.python.org>
|
New changeset c4f053f1b47f by Victor Stinner in branch '3.4': New changeset 2176496951a4 by Victor Stinner in branch 'default': |
I commited the patch into Tulip (c149370c8027), Python 3.4 (c4f053f1b47f), Python 3.5 (2176496951a4). I'm now waiting for the buildbot. |
The test suite passed on FreeBSD 9 buildbot (3.4, 3.5 and my custom builds). I consider that the bug is now fixed. |
New changeset fe456770b454 by Yury Selivanov in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: