Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888

vstinner · 2017-06-19T15:20:57Z

BPO	30703
Nosy	@ronaldoussoren, @pitrou, @kristjanvalur, @vstinner, @ned-deily, @njsmith, @ericsnowcurrently, @mattbillenstein
PRs	[WIP] bpo-30703: More reentrant signal handler #2408 bpo-30703: Improve signal delivery #2415 [3.6] bpo-30703: Improve signal delivery (GH-2415) #2527 bpo-30854: Fix compile error when --without-threads #2581

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-07-01.17:14:23.115>
created_at = <Date 2017-06-19.15:20:56.969>
labels = ['OS-mac', 'type-bug', 'tests', '3.7']
title = 'Non-reentrant signal handler (test_multiprocessing_forkserver hangs)'
updated_at = <Date 2019-03-29.16:08:52.590>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2019-03-29.16:08:52.590>
actor = 'eric.snow'
assignee = 'none'
closed = True
closed_date = <Date 2017-07-01.17:14:23.115>
closer = 'pitrou'
components = ['macOS', 'Tests']
creation = <Date 2017-06-19.15:20:56.969>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 30703
keywords = ['buildbot']
message_count = 25.0
messages = ['296348', '296376', '296419', '296420', '296460', '296641', '296774', '296775', '296777', '296778', '296824', '296876', '296879', '296881', '296895', '296897', '296903', '296905', '297208', '297234', '297361', '297405', '297413', '297418', '297484']
nosy_count = 9.0
nosy_names = ['ronaldoussoren', 'pitrou', 'kristjan.jonsson', 'vstinner', 'ned.deily', 'njs', 'neologix', 'eric.snow', 'mattbillenstein']
pr_nums = ['2408', '2415', '2527', '2581']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue30703'
versions = ['Python 3.6', 'Python 3.7']

vstinner · 2017-06-19T15:20:57Z

regrtest hangs while running test_multiprocessing_forkserver in a worker process on the "x86-64 Sierra 3.x" buildbot.

Bug seen twice: build 343 and 337.

http://buildbot.python.org/all/builders/x86-64%20Sierra%203.x/builds/337/steps/test/logs/stdio

Run tests in parallel using 2 child processes
...
0:29:36 load avg: 5.24 [402/406] test_deque passed -- running: test_multiprocessing_forkserver (1346 sec)
0:29:37 load avg: 5.24 [403/406] test_descrtut passed -- running: test_multiprocessing_forkserver (1346 sec)
0:29:37 load avg: 4.90 [404/406] test_weakset passed -- running: test_multiprocessing_forkserver (1347 sec)
0:29:40 load avg: 4.90 [405/406] test_distutils passed -- running: test_multiprocessing_forkserver (1350 sec)

command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=2983.908679

vstinner · 2017-06-19T21:17:11Z

Different issue, but same behaviour: test hangs and then killed by buildbot, whereas a single test was still running.

See also bpo-30351 which tracks similar bugs, but on Python 2.7.

I really hate such bugs :-(

http://buildbot.python.org/all/builders/AMD64%20Debian%20root%203.x/builds/928/steps/test/logs/stdio

...
0:15:10 load avg: 0.52 [404/406] test_unicodedata passed -- running: test_pydoc (401 sec)
0:15:10 load avg: 0.52 [405/406] test_exception_hierarchy passed -- running: test_pydoc (401 sec)

command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=2111.415970

mattbillenstein · 2017-06-20T12:49:28Z

I've been debugging this and I can repro on El Capitan on a different machine as well -- it's an infrequent hang, I've been running in a loop:

mattb@mattb-mbp:~/src/misc/cpython master$ for i in $(seq 1000); do echo "Run: $i -- $(date)"; time./python.exe -m test --timeout 180 test_multiprocessing_forkserver || break; sleep 10; done

And in this case after ~30 successful runs:

Run: 31 -- Tue Jun 20 05:12:19 PDT 2017
Run tests sequentially
0:00:00 load avg: 4.28 [1/1] test_multiprocessing_forkserver
Timeout (0:03:00)!
Thread 0x00007fffa1b3a3c0 (most recent call first):
File "/Users/mattb/src/misc/cpython/Lib/selectors.py", line 415 in select
File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/connection.py", line 916 in wait
File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/popen_fork.py", line 45 in wait
File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/process.py", line 121 in join
File "/Users/mattb/src/misc/cpython/Lib/test/test_multiprocessing.py", line 431 in test_many_processes
File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 605 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 653 in __call_
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1772 in run
File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1896 in _run_suite
File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1936 in run_unittest
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 168 in test_runner
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 169 in runtest_inner
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 137 in runtest
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 374 in run_tests_sequential
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 454 in run_tests
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 530 in _main
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 504 in main
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 573 in main
File "/Users/mattb/src/misc/cpython/Lib/test/main.py", line 2 in <module>
File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 85 in _run_code
File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 193 in _run_module_as_main

real 3m0.115s
user 0m5.804s
sys 0m0.683s

I also ran ~600 passes overnight on this test on an Ubuntu 16.04 machine without a hang, so it seems to be OSX specific.

vstinner · 2017-06-20T12:53:02Z

I've been debugging this and I can repro on El Capitan on a different machine as well -- it's an infrequent hang, I've been running in a loop:

Hum, a timeout of 3 minutes seems short for test_multiprocessing_forkserver. How long does it take *usually* to run this test?

At least, we have a first clue: test_many_processes() of Lib/test/_test_multiprocessing.py.

mattbillenstein · 2017-06-20T15:40:00Z

It consistently takes between ~61 and ~73 seconds with this setup.

pitrou · 2017-06-22T16:11:26Z

Matt, if you try the following command, it will run the specific test in a loop in verbose mode:

$ ./python -m test --timeout=30 -F -m test_many_processes -v test_multiprocessing_forkserver

mattbillenstein · 2017-06-24T15:56:56Z

Cool -- do you need me to do something more to help debug this?

pitrou · 2017-06-24T15:57:56Z

It would be nice to know if there is additional output (for example exceptions happening in other threads or processes) when you run that command and manage to trigger a hang.

mattbillenstein · 2017-06-24T16:56:20Z

I don't see anything odd -- it runs for awhile and then times out once it's deadlocked:

0:03:18 load avg: 3.20 [224] test_multiprocessing_forkserver
test_many_processes (test.test_multiprocessing_forkserver.WithProcessesTestProcess) ... ok
test_many_processes (test.test_multiprocessing_forkserver.WithThreadsTestProcess) ... skipped 'test not appropriate for threads'

----------------------------------------------------------------------
Ran 2 tests in 0.849s

OK (skipped=1)
0:03:18 load avg: 3.11 [225] test_multiprocessing_forkserver
test_many_processes (test.test_multiprocessing_forkserver.WithProcessesTestProcess) ... ok
test_many_processes (test.test_multiprocessing_forkserver.WithThreadsTestProcess) ... skipped 'test not appropriate for threads'

----------------------------------------------------------------------
Ran 2 tests in 0.865s

OK (skipped=1)
0:03:19 load avg: 3.11 [226] test_multiprocessing_forkserver
test_many_processes (test.test_multiprocessing_forkserver.WithProcessesTestProcess) ... Timeout (0:00:30)!
Thread 0x00007fff790c1000 (most recent call first):
File "/Users/mattb/src/misc/cpython/Lib/selectors.py", line 415 in select
File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/connection.py", line 916 in wait
File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/popen_fork.py", line 45 in wait
File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/process.py", line 121 in join
File "/Users/mattb/src/misc/cpython/Lib/test/test_multiprocessing.py", line 431 in test_many_processes
File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 605 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 653 in __call_
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
File "/Users/mattb/src/misc/cpython/Lib/unittest/runner.py", line 176 in run
File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1896 in _run_suite
File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1936 in run_unittest
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 168 in test_runner
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 169 in runtest_inner
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 137 in runtest
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 374 in run_tests_sequential
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 454 in run_tests
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 530 in _main
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 504 in main
File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 573 in main
File "/Users/mattb/src/misc/cpython/Lib/test/main.py", line 2 in <module>
File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 85 in _run_code
File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 193 in _run_module_as_main

pitrou · 2017-06-24T16:59:17Z

Thanks for trying :-) Hmm, that's annoying. I don't know if you'd like to give me shell access to the machine (and a CPython checkout I can play with, perhaps)?

mattbillenstein · 2017-06-25T14:20:41Z

Yes, I'll email you the details.

pitrou · 2017-06-26T11:46:23Z

Ok, I think I've managed to dig to the core issue. It is actually the same issue as https://bugs.python.org/issue11768 (which was wrongly closed as fixed, apparently :-)).

Py_AddPendingCall() calls PyThread_acquire_lock() to try and take the pending calls lock. Unfortunately, PyThread_acquire_lock() is not reentrant in the case where semaphores are not used (e.g. on OS X). We can probably fix that first issue by calling pthread_mutex_trylock() instead of pthread_mutex_lock().

There is a second more fundamental issue, though, which is that PyThread_acquire_lock() calls into functions that are not async-signal-safe (see http://man7.org/linux/man-pages/man7/signal-safety.7.html for a list). So, depending on the particular OS and libc implementation, PyThread_acquire_lock() can fail in mysterious ways (including hang the process) when called from a signal handler.

So perhaps the ultimate fix would be to remove the OS-based locking in Py_AddPendingCall and use a busy spinwait... The performance implications may be bad, though.

pitrou · 2017-06-26T11:51:14Z

Using pthread_mutex_trylock() *and* disabling the CHECK_STATUS_PTHREAD() calls (which use the non-async-signal-safe fprintf()) at least seems to suppress the hangs on Matt's OS X machine (after more than 1000 runs).

vstinner · 2017-06-26T11:55:09Z

It is actually the same issue as https://bugs.python.org/issue11768 (which was wrongly closed as fixed, apparently :-))

Well, don't be confused by the issue title. The only made change is to only call Py_AddPendingCall() only once, instead of calling it for each received signal.

I don't know if the signal handler is really reentrant or not.

Signal handling is hard :-(

vstinner · 2017-06-26T13:44:41Z

If I understood corretly, the problem is that the Python C signal handler is not reentrant because it calls Py_AddPendingCall() which uses a lock and a list.

Before, the signal handler queued a new call to checksignals_witharg() (which just calls PyErr_CheckSignals()) for each received signal. Now, we only queue a single call to checksignals_witharg().

To prevent reentrency issues, can't we hardcoded a call to PyErr_CheckSignals() in ceval.c when SIGNAL_PENDING_CALLS() is called?

Py_AddPendingCall() feature is rarely used, it's mostly used for processing signals, no? Calling PyErr_CheckSignals() when no signal was received is cheap, so it shouldn't hurt.

pitrou · 2017-06-26T13:50:51Z

I think Kristjan uses Py_AddPendingCall, so ideally we would make it more reliable. However, we're right that for the purpose of signal delivery, we can hardcall a call inside Py_MakePendingCalls().

pitrou · 2017-06-26T14:25:49Z

It seems #2408 alone solves the issue on Matt's machine. It's also a pleasantly simple patch :-)

kristjanvalur · 2017-06-26T14:55:13Z

Thanks for the mention, @pitrou.
CCP was using Py_AddPendingCall but not from signal handlers, but external threads. Also on windows only.
You'll also be happy to know that I have left CCP and the Eve codebase is being kept stable while regularly adding security patches from the 2.7 codebase, as far as I know :)

pitrou · 2017-06-28T19:03:58Z

For the record, #2415 is ready.

pitrou · 2017-06-28T21:29:31Z

New changeset c08177a by Antoine Pitrou in branch 'master':
bpo-30703: Improve signal delivery (bpo-2415)
c08177a

pitrou · 2017-06-30T08:37:49Z

I will wait a bit and then backport this to 3.6.

mattbillenstein · 2017-06-30T14:14:49Z

Verified this by letting the test_many_processes loop overnight (master@42bc8beadd49)-- 34k passes later and no hangs. Nice work!

vstinner · 2017-06-30T14:47:33Z

I concur with Matt: nice job Antoine, thanks for making Python more reliable ;-)

pitrou · 2017-06-30T15:28:06Z

Thanks for running such a lengthy test, Matt :-)

pitrou · 2017-07-01T17:12:08Z

New changeset 3024c05 by Antoine Pitrou in branch '3.6':
[3.6] bpo-30703: Improve signal delivery (GH-2415) (bpo-2527)
3024c05

vstinner added 3.7 (EOL) end of life tests Tests in the Lib/test dir OS-mac labels Jun 19, 2017

vstinner changed the title ~~test_multiprocessing_forkserver hangs on x86-64 Sierra 3.x. buildbot~~ regrtest hangs on the master branch Jun 19, 2017

pitrou changed the title ~~regrtest hangs on the master branch~~ test_multiprocessing_forkserver hangs on the master branch Jun 22, 2017

pitrou added the type-bug An unexpected behavior, bug, or error label Jun 26, 2017

vstinner changed the title ~~test_multiprocessing_forkserver hangs on the master branch~~ Non-reentrant signal handler (test_multiprocessing_forkserver hangs) Jun 26, 2017

pitrou closed this as completed Jul 1, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888

Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888

vstinner commented Jun 19, 2017

vstinner commented Jun 19, 2017

vstinner commented Jun 19, 2017

mattbillenstein commented Jun 20, 2017

vstinner commented Jun 20, 2017

mattbillenstein commented Jun 20, 2017

pitrou commented Jun 22, 2017

mattbillenstein commented Jun 24, 2017

pitrou commented Jun 24, 2017

mattbillenstein commented Jun 24, 2017

pitrou commented Jun 24, 2017

mattbillenstein commented Jun 25, 2017

pitrou commented Jun 26, 2017

pitrou commented Jun 26, 2017

vstinner commented Jun 26, 2017

vstinner commented Jun 26, 2017

pitrou commented Jun 26, 2017

pitrou commented Jun 26, 2017

kristjanvalur mannequin commented Jun 26, 2017

pitrou commented Jun 28, 2017

pitrou commented Jun 28, 2017

pitrou commented Jun 30, 2017

mattbillenstein commented Jun 30, 2017

vstinner commented Jun 30, 2017

pitrou commented Jun 30, 2017

pitrou commented Jul 1, 2017

Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888

Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888

Comments

vstinner commented Jun 19, 2017

vstinner commented Jun 19, 2017

vstinner commented Jun 19, 2017

mattbillenstein commented Jun 20, 2017

vstinner commented Jun 20, 2017

mattbillenstein commented Jun 20, 2017

pitrou commented Jun 22, 2017

mattbillenstein commented Jun 24, 2017

pitrou commented Jun 24, 2017

mattbillenstein commented Jun 24, 2017

pitrou commented Jun 24, 2017

mattbillenstein commented Jun 25, 2017

pitrou commented Jun 26, 2017

pitrou commented Jun 26, 2017

vstinner commented Jun 26, 2017

vstinner commented Jun 26, 2017

pitrou commented Jun 26, 2017

pitrou commented Jun 26, 2017

kristjanvalur mannequin commented Jun 26, 2017

pitrou commented Jun 28, 2017

pitrou commented Jun 28, 2017

pitrou commented Jun 30, 2017

mattbillenstein commented Jun 30, 2017

vstinner commented Jun 30, 2017

pitrou commented Jun 30, 2017

pitrou commented Jul 1, 2017