New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888
Comments
regrtest hangs while running test_multiprocessing_forkserver in a worker process on the "x86-64 Sierra 3.x" buildbot. Bug seen twice: build 343 and 337. http://buildbot.python.org/all/builders/x86-64%20Sierra%203.x/builds/337/steps/test/logs/stdio Run tests in parallel using 2 child processes command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill |
Different issue, but same behaviour: test hangs and then killed by buildbot, whereas a single test was still running. See also bpo-30351 which tracks similar bugs, but on Python 2.7. I really hate such bugs :-( http://buildbot.python.org/all/builders/AMD64%20Debian%20root%203.x/builds/928/steps/test/logs/stdio ... command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill |
I've been debugging this and I can repro on El Capitan on a different machine as well -- it's an infrequent hang, I've been running in a loop: mattb@mattb-mbp:~/src/misc/cpython master$ for i in $(seq 1000); do echo "Run: And in this case after ~30 successful runs: Run: 31 -- Tue Jun 20 05:12:19 PDT 2017 real 3m0.115s I also ran ~600 passes overnight on this test on an Ubuntu 16.04 machine without a hang, so it seems to be OSX specific. |
Hum, a timeout of 3 minutes seems short for test_multiprocessing_forkserver. How long does it take *usually* to run this test? At least, we have a first clue: test_many_processes() of Lib/test/_test_multiprocessing.py. |
It consistently takes between ~61 and ~73 seconds with this setup. |
Matt, if you try the following command, it will run the specific test in a loop in verbose mode: $ ./python -m test --timeout=30 -F -m test_many_processes -v test_multiprocessing_forkserver |
Cool -- do you need me to do something more to help debug this? |
It would be nice to know if there is additional output (for example exceptions happening in other threads or processes) when you run that command and manage to trigger a hang. |
I don't see anything odd -- it runs for awhile and then times out once it's deadlocked: 0:03:18 load avg: 3.20 [224] test_multiprocessing_forkserver ---------------------------------------------------------------------- OK (skipped=1) ---------------------------------------------------------------------- OK (skipped=1) |
Thanks for trying :-) Hmm, that's annoying. I don't know if you'd like to give me shell access to the machine (and a CPython checkout I can play with, perhaps)? |
Yes, I'll email you the details. |
Ok, I think I've managed to dig to the core issue. It is actually the same issue as https://bugs.python.org/issue11768 (which was wrongly closed as fixed, apparently :-)). Py_AddPendingCall() calls PyThread_acquire_lock() to try and take the pending calls lock. Unfortunately, PyThread_acquire_lock() is not reentrant in the case where semaphores are not used (e.g. on OS X). We can probably fix that first issue by calling pthread_mutex_trylock() instead of pthread_mutex_lock(). There is a second more fundamental issue, though, which is that PyThread_acquire_lock() calls into functions that are not async-signal-safe (see http://man7.org/linux/man-pages/man7/signal-safety.7.html for a list). So, depending on the particular OS and libc implementation, PyThread_acquire_lock() can fail in mysterious ways (including hang the process) when called from a signal handler. So perhaps the ultimate fix would be to remove the OS-based locking in Py_AddPendingCall and use a busy spinwait... The performance implications may be bad, though. |
Using pthread_mutex_trylock() *and* disabling the CHECK_STATUS_PTHREAD() calls (which use the non-async-signal-safe fprintf()) at least seems to suppress the hangs on Matt's OS X machine (after more than 1000 runs). |
Well, don't be confused by the issue title. The only made change is to only call Py_AddPendingCall() only once, instead of calling it for each received signal. I don't know if the signal handler is really reentrant or not. Signal handling is hard :-( |
If I understood corretly, the problem is that the Python C signal handler is not reentrant because it calls Py_AddPendingCall() which uses a lock and a list. Before, the signal handler queued a new call to checksignals_witharg() (which just calls PyErr_CheckSignals()) for each received signal. Now, we only queue a single call to checksignals_witharg(). To prevent reentrency issues, can't we hardcoded a call to PyErr_CheckSignals() in ceval.c when SIGNAL_PENDING_CALLS() is called? Py_AddPendingCall() feature is rarely used, it's mostly used for processing signals, no? Calling PyErr_CheckSignals() when no signal was received is cheap, so it shouldn't hurt. |
I think Kristjan uses Py_AddPendingCall, so ideally we would make it more reliable. However, we're right that for the purpose of signal delivery, we can hardcall a call inside Py_MakePendingCalls(). |
It seems #2408 alone solves the issue on Matt's machine. It's also a pleasantly simple patch :-) |
Thanks for the mention, @pitrou. |
For the record, #2415 is ready. |
I will wait a bit and then backport this to 3.6. |
Verified this by letting the test_many_processes loop overnight (master@42bc8beadd49)-- 34k passes later and no hangs. Nice work! |
I concur with Matt: nice job Antoine, thanks for making Python more reliable ;-) |
Thanks for running such a lengthy test, Matt :-) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: