Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888

Closed
vstinner opened this issue Jun 19, 2017 · 25 comments
Closed

Non-reentrant signal handler (test_multiprocessing_forkserver hangs) #74888

vstinner opened this issue Jun 19, 2017 · 25 comments
Labels
3.7 (EOL) end of life OS-mac tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 30703
Nosy @ronaldoussoren, @pitrou, @kristjanvalur, @vstinner, @ned-deily, @njsmith, @ericsnowcurrently, @mattbillenstein
PRs
  • [WIP] bpo-30703: More reentrant signal handler #2408
  • bpo-30703: Improve signal delivery #2415
  • [3.6] bpo-30703: Improve signal delivery (GH-2415) #2527
  • bpo-30854: Fix compile error when --without-threads #2581
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-07-01.17:14:23.115>
    created_at = <Date 2017-06-19.15:20:56.969>
    labels = ['OS-mac', 'type-bug', 'tests', '3.7']
    title = 'Non-reentrant signal handler (test_multiprocessing_forkserver hangs)'
    updated_at = <Date 2019-03-29.16:08:52.590>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-03-29.16:08:52.590>
    actor = 'eric.snow'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-07-01.17:14:23.115>
    closer = 'pitrou'
    components = ['macOS', 'Tests']
    creation = <Date 2017-06-19.15:20:56.969>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 30703
    keywords = ['buildbot']
    message_count = 25.0
    messages = ['296348', '296376', '296419', '296420', '296460', '296641', '296774', '296775', '296777', '296778', '296824', '296876', '296879', '296881', '296895', '296897', '296903', '296905', '297208', '297234', '297361', '297405', '297413', '297418', '297484']
    nosy_count = 9.0
    nosy_names = ['ronaldoussoren', 'pitrou', 'kristjan.jonsson', 'vstinner', 'ned.deily', 'njs', 'neologix', 'eric.snow', 'mattbillenstein']
    pr_nums = ['2408', '2415', '2527', '2581']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue30703'
    versions = ['Python 3.6', 'Python 3.7']

    @vstinner
    Copy link
    Member Author

    regrtest hangs while running test_multiprocessing_forkserver in a worker process on the "x86-64 Sierra 3.x" buildbot.

    Bug seen twice: build 343 and 337.

    http://buildbot.python.org/all/builders/x86-64%20Sierra%203.x/builds/337/steps/test/logs/stdio

    Run tests in parallel using 2 child processes
    ...
    0:29:36 load avg: 5.24 [402/406] test_deque passed -- running: test_multiprocessing_forkserver (1346 sec)
    0:29:37 load avg: 5.24 [403/406] test_descrtut passed -- running: test_multiprocessing_forkserver (1346 sec)
    0:29:37 load avg: 4.90 [404/406] test_weakset passed -- running: test_multiprocessing_forkserver (1347 sec)
    0:29:40 load avg: 4.90 [405/406] test_distutils passed -- running: test_multiprocessing_forkserver (1350 sec)

    command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill
    process killed by signal 9
    program finished with exit code -1
    elapsedTime=2983.908679

    @vstinner vstinner added 3.7 (EOL) end of life tests Tests in the Lib/test dir OS-mac labels Jun 19, 2017
    @vstinner
    Copy link
    Member Author

    Different issue, but same behaviour: test hangs and then killed by buildbot, whereas a single test was still running.

    See also bpo-30351 which tracks similar bugs, but on Python 2.7.

    I really hate such bugs :-(

    http://buildbot.python.org/all/builders/AMD64%20Debian%20root%203.x/builds/928/steps/test/logs/stdio

    ...
    0:15:10 load avg: 0.52 [404/406] test_unicodedata passed -- running: test_pydoc (401 sec)
    0:15:10 load avg: 0.52 [405/406] test_exception_hierarchy passed -- running: test_pydoc (401 sec)

    command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill
    process killed by signal 9
    program finished with exit code -1
    elapsedTime=2111.415970

    @vstinner vstinner changed the title test_multiprocessing_forkserver hangs on x86-64 Sierra 3.x. buildbot regrtest hangs on the master branch Jun 19, 2017
    @mattbillenstein
    Copy link

    I've been debugging this and I can repro on El Capitan on a different machine as well -- it's an infrequent hang, I've been running in a loop:

    mattb@mattb-mbp:~/src/misc/cpython master$ for i in $(seq 1000); do echo "Run: $i -- $(date)"; time./python.exe -m test --timeout 180 test_multiprocessing_forkserver || break; sleep 10; done

    And in this case after ~30 successful runs:

    Run: 31 -- Tue Jun 20 05:12:19 PDT 2017
    Run tests sequentially
    0:00:00 load avg: 4.28 [1/1] test_multiprocessing_forkserver
    Timeout (0:03:00)!
    Thread 0x00007fffa1b3a3c0 (most recent call first):
    File "/Users/mattb/src/misc/cpython/Lib/selectors.py", line 415 in select
    File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/connection.py", line 916 in wait
    File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/popen_fork.py", line 45 in wait
    File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/process.py", line 121 in join
    File "/Users/mattb/src/misc/cpython/Lib/test/test_multiprocessing.py", line 431 in test_many_processes
    File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 605 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 653 in __call
    _
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
    File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1772 in run
    File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1896 in _run_suite
    File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1936 in run_unittest
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 168 in test_runner
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 169 in runtest_inner
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 137 in runtest
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 374 in run_tests_sequential
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 454 in run_tests
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 530 in _main
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 504 in main
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 573 in main
    File "/Users/mattb/src/misc/cpython/Lib/test/main.py", line 2 in <module>
    File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 85 in _run_code
    File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 193 in _run_module_as_main

    real 3m0.115s
    user 0m5.804s
    sys 0m0.683s

    I also ran ~600 passes overnight on this test on an Ubuntu 16.04 machine without a hang, so it seems to be OSX specific.

    @vstinner
    Copy link
    Member Author

    I've been debugging this and I can repro on El Capitan on a different machine as well -- it's an infrequent hang, I've been running in a loop:

    Hum, a timeout of 3 minutes seems short for test_multiprocessing_forkserver. How long does it take *usually* to run this test?

    At least, we have a first clue: test_many_processes() of Lib/test/_test_multiprocessing.py.

    @mattbillenstein
    Copy link

    It consistently takes between ~61 and ~73 seconds with this setup.

    @pitrou pitrou changed the title regrtest hangs on the master branch test_multiprocessing_forkserver hangs on the master branch Jun 22, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Jun 22, 2017

    Matt, if you try the following command, it will run the specific test in a loop in verbose mode:

    $ ./python -m test --timeout=30 -F -m test_many_processes -v test_multiprocessing_forkserver

    @mattbillenstein
    Copy link

    Cool -- do you need me to do something more to help debug this?

    @pitrou
    Copy link
    Member

    pitrou commented Jun 24, 2017

    It would be nice to know if there is additional output (for example exceptions happening in other threads or processes) when you run that command and manage to trigger a hang.

    @mattbillenstein
    Copy link

    I don't see anything odd -- it runs for awhile and then times out once it's deadlocked:

    0:03:18 load avg: 3.20 [224] test_multiprocessing_forkserver
    test_many_processes (test.test_multiprocessing_forkserver.WithProcessesTestProcess) ... ok
    test_many_processes (test.test_multiprocessing_forkserver.WithThreadsTestProcess) ... skipped 'test not appropriate for threads'

    ----------------------------------------------------------------------
    Ran 2 tests in 0.849s

    OK (skipped=1)
    0:03:18 load avg: 3.11 [225] test_multiprocessing_forkserver
    test_many_processes (test.test_multiprocessing_forkserver.WithProcessesTestProcess) ... ok
    test_many_processes (test.test_multiprocessing_forkserver.WithThreadsTestProcess) ... skipped 'test not appropriate for threads'

    ----------------------------------------------------------------------
    Ran 2 tests in 0.865s

    OK (skipped=1)
    0:03:19 load avg: 3.11 [226] test_multiprocessing_forkserver
    test_many_processes (test.test_multiprocessing_forkserver.WithProcessesTestProcess) ... Timeout (0:00:30)!
    Thread 0x00007fff790c1000 (most recent call first):
    File "/Users/mattb/src/misc/cpython/Lib/selectors.py", line 415 in select
    File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/connection.py", line 916 in wait
    File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/popen_fork.py", line 45 in wait
    File "/Users/mattb/src/misc/cpython/Lib/multiprocessing/process.py", line 121 in join
    File "/Users/mattb/src/misc/cpython/Lib/test/test_multiprocessing.py", line 431 in test_many_processes
    File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 605 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/case.py", line 653 in __call
    _
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 122 in run
    File "/Users/mattb/src/misc/cpython/Lib/unittest/suite.py", line 84 in __call__
    File "/Users/mattb/src/misc/cpython/Lib/unittest/runner.py", line 176 in run
    File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1896 in _run_suite
    File "/Users/mattb/src/misc/cpython/Lib/test/support/init.py", line 1936 in run_unittest
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 168 in test_runner
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 169 in runtest_inner
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/runtest.py", line 137 in runtest
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 374 in run_tests_sequential
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 454 in run_tests
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 530 in _main
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 504 in main
    File "/Users/mattb/src/misc/cpython/Lib/test/libregrtest/main.py", line 573 in main
    File "/Users/mattb/src/misc/cpython/Lib/test/main.py", line 2 in <module>
    File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 85 in _run_code
    File "/Users/mattb/src/misc/cpython/Lib/runpy.py", line 193 in _run_module_as_main

    @pitrou
    Copy link
    Member

    pitrou commented Jun 24, 2017

    Thanks for trying :-) Hmm, that's annoying. I don't know if you'd like to give me shell access to the machine (and a CPython checkout I can play with, perhaps)?

    @mattbillenstein
    Copy link

    Yes, I'll email you the details.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 26, 2017

    Ok, I think I've managed to dig to the core issue. It is actually the same issue as https://bugs.python.org/issue11768 (which was wrongly closed as fixed, apparently :-)).

    Py_AddPendingCall() calls PyThread_acquire_lock() to try and take the pending calls lock. Unfortunately, PyThread_acquire_lock() is not reentrant in the case where semaphores are not used (e.g. on OS X). We can probably fix that first issue by calling pthread_mutex_trylock() instead of pthread_mutex_lock().

    There is a second more fundamental issue, though, which is that PyThread_acquire_lock() calls into functions that are not async-signal-safe (see http://man7.org/linux/man-pages/man7/signal-safety.7.html for a list). So, depending on the particular OS and libc implementation, PyThread_acquire_lock() can fail in mysterious ways (including hang the process) when called from a signal handler.

    So perhaps the ultimate fix would be to remove the OS-based locking in Py_AddPendingCall and use a busy spinwait... The performance implications may be bad, though.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 26, 2017

    Using pthread_mutex_trylock() *and* disabling the CHECK_STATUS_PTHREAD() calls (which use the non-async-signal-safe fprintf()) at least seems to suppress the hangs on Matt's OS X machine (after more than 1000 runs).

    @vstinner
    Copy link
    Member Author

    It is actually the same issue as https://bugs.python.org/issue11768 (which was wrongly closed as fixed, apparently :-))

    Well, don't be confused by the issue title. The only made change is to only call Py_AddPendingCall() only once, instead of calling it for each received signal.

    I don't know if the signal handler is really reentrant or not.

    Signal handling is hard :-(

    @pitrou pitrou added the type-bug An unexpected behavior, bug, or error label Jun 26, 2017
    @vstinner
    Copy link
    Member Author

    If I understood corretly, the problem is that the Python C signal handler is not reentrant because it calls Py_AddPendingCall() which uses a lock and a list.

    Before, the signal handler queued a new call to checksignals_witharg() (which just calls PyErr_CheckSignals()) for each received signal. Now, we only queue a single call to checksignals_witharg().

    To prevent reentrency issues, can't we hardcoded a call to PyErr_CheckSignals() in ceval.c when SIGNAL_PENDING_CALLS() is called?

    Py_AddPendingCall() feature is rarely used, it's mostly used for processing signals, no? Calling PyErr_CheckSignals() when no signal was received is cheap, so it shouldn't hurt.

    @vstinner vstinner changed the title test_multiprocessing_forkserver hangs on the master branch Non-reentrant signal handler (test_multiprocessing_forkserver hangs) Jun 26, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Jun 26, 2017

    I think Kristjan uses Py_AddPendingCall, so ideally we would make it more reliable. However, we're right that for the purpose of signal delivery, we can hardcall a call inside Py_MakePendingCalls().

    @pitrou
    Copy link
    Member

    pitrou commented Jun 26, 2017

    It seems #2408 alone solves the issue on Matt's machine. It's also a pleasantly simple patch :-)

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Jun 26, 2017

    Thanks for the mention, @pitrou.
    CCP was using Py_AddPendingCall but not from signal handlers, but external threads. Also on windows only.
    You'll also be happy to know that I have left CCP and the Eve codebase is being kept stable while regularly adding security patches from the 2.7 codebase, as far as I know :)

    @pitrou
    Copy link
    Member

    pitrou commented Jun 28, 2017

    For the record, #2415 is ready.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 28, 2017

    New changeset c08177a by Antoine Pitrou in branch 'master':
    bpo-30703: Improve signal delivery (bpo-2415)
    c08177a

    @pitrou
    Copy link
    Member

    pitrou commented Jun 30, 2017

    I will wait a bit and then backport this to 3.6.

    @mattbillenstein
    Copy link

    Verified this by letting the test_many_processes loop overnight (master@42bc8beadd49)-- 34k passes later and no hangs. Nice work!

    @vstinner
    Copy link
    Member Author

    I concur with Matt: nice job Antoine, thanks for making Python more reliable ;-)

    @pitrou
    Copy link
    Member

    pitrou commented Jun 30, 2017

    Thanks for running such a lengthy test, Matt :-)

    @pitrou
    Copy link
    Member

    pitrou commented Jul 1, 2017

    New changeset 3024c05 by Antoine Pitrou in branch '3.6':
    [3.6] bpo-30703: Improve signal delivery (GH-2415) (bpo-2527)
    3024c05

    @pitrou pitrou closed this as completed Jul 1, 2017
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life OS-mac tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants