Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_3_join_in_forked_from_thread() of test_threading hangs 1 hour on "x86 Ubuntu Shared 3.x" #56079

Closed
vstinner opened this issue Apr 18, 2011 · 19 comments
Labels
tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

BPO 11870
Nosy @vstinner
Files
  • test_threading_fork.diff
  • debug_stuck.diff
  • threading_reinit_lock.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-12-18.18:12:47.096>
    created_at = <Date 2011-04-18.20:17:59.861>
    labels = ['tests']
    title = 'test_3_join_in_forked_from_thread() of test_threading hangs 1 hour on "x86 Ubuntu Shared 3.x"'
    updated_at = <Date 2013-01-17.22:37:26.961>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2013-01-17.22:37:26.961>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-12-18.18:12:47.096>
    closer = 'neologix'
    components = ['Tests']
    creation = <Date 2011-04-18.20:17:59.861>
    creator = 'vstinner'
    dependencies = []
    files = ['22489', '23896', '24030']
    hgrepos = []
    issue_num = 11870
    keywords = ['patch']
    message_count = 19.0
    messages = ['133990', '139076', '139121', '139129', '139131', '139219', '139337', '139574', '139576', '139583', '139602', '148698', '148993', '148996', '149148', '149773', '149784', '149788', '180158']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'gps', 'neologix', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue11870'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    test_3_join_in_forked_from_thread() of test_threading failed on "x86 Ubuntu Shared 3.x" buildbot:
    -----------------------------------
    [201/354] test_threading
    [41179 refs]
    [40407 refs]
    [40407 refs]
    [40407 refs]
    Timeout (1:00:00)!
    Thread 0x404218c0:
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 466 in _eintr_retry_call
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1486 in _try_wait
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1528 in wait
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 455 in _run_and_join
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 518 in test_3_join_in_forked_from_thread
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 387 in _executeTestPart
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 442 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 494 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1078 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1166 in _run_suite
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1192 in run_unittest
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 728 in test_main
    File "./Lib/test/regrtest.py", line 1041 in runtest_inner
    File "./Lib/test/regrtest.py", line 835 in runtest
    File "./Lib/test/regrtest.py", line 659 in main
    File "./Lib/test/regrtest.py", line 1619 in <module>
    make: *** [buildbottest] Error 1
    program finished with exit code 2
    elapsedTime=4426.776675
    http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/3577/steps/test/logs/stdio
    -----------------------------------

    Code of the test:
    ----------------------------------

        def _run_and_join(self, script):
            script = """if 1:
                import sys, os, time, threading
    
                # a thread, which waits for the main program to terminate
                def joiningfunc(mainthread):
                    mainthread.join()
                    print('end of thread')
                    # stdout is fully buffered because not a tty, we have to flush
                    # before exit.
                    sys.stdout.flush()
            \n""" + script
    
            p = subprocess.Popen([sys.executable, "-c", script], stdout=subprocess.PIPE)
            rc = p.wait() <~~~ HANG HERE ~~~~
            data = p.stdout.read().decode().replace('\r', '')
            p.stdout.close()
            self.assertEqual(data, "end of main\nend of thread\n")
            self.assertFalse(rc == 2, "interpreter was blocked")
            self.assertTrue(rc == 0, "Unexpected error")
    @unittest.skipUnless(hasattr(os, 'fork'), "needs os.fork()")
    def test_3_join_in_forked_from_thread(self):
        # Like the test above, but fork() was called from a worker thread
        # In the forked process, the main Thread object must be marked as stopped.
    
        # Skip platforms with known problems forking from a worker thread.
        # See http://bugs.python.org/issue3863.
        if sys.platform in ('freebsd4', 'freebsd5', 'freebsd6', 'netbsd5',
                           'os2emx'):
            raise unittest.SkipTest('due to known OS bugs on ' + sys.platform)
        script = """if 1:
            main_thread = threading.current_thread()
            def worker():
                childpid = os.fork()
                if childpid != 0:
                    os.waitpid(childpid, 0)
                    sys.exit(0)
    
                    t = threading.Thread(target=joiningfunc,
                                         args=(main_thread,))
                    print('end of main')
                    t.start()
                    t.join() # Should not block: main_thread is already stopped
    
                w = threading.Thread(target=worker)
                w.start()
                """
            self._run_and_join(script)

    @vstinner vstinner added the tests Tests in the Lib/test dir label Apr 18, 2011
    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Jun 25, 2011

    test_2_join_in_forked_process fails on FreeBSD 6.4 buildbot.
    http://www.python.org/dev/buildbot/all/builders/x86 FreeBSD 6.4 3.x/builds/1606/steps/test/logs/stdio

    """
    ======================================================================
    FAIL: test_2_join_in_forked_process (test.test_threading.ThreadJoinOnShutdown)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/db3l/buildarea/3.x.bolen-freebsd/build/Lib/test/test_threading.py", line 464, in test_2_join_in_forked_process
        self._run_and_join(script)
      File "/usr/home/db3l/buildarea/3.x.bolen-freebsd/build/Lib/test/test_threading.py", line 436, in _run_and_join
        self.assertEqual(data, "end of main\nend of thread\n")
    AssertionError: '' != 'end of main\nend of thread\n'
    + end of main
    + end of thread
    """

    I think it's the same problem as issue bpo-12316: in the child process, even calling pthread_create can segfault/abort on FreeBSD6 (async-safe blahblah...).
    Tests creating a thread from a fork()ed process should be skipped on FreeBSD6.
    Patch attached, along with some refactoring to use the skipIf idiom.

    As for test_3_join_in_forked_from_thread, well, it could be more or less the same problem. We're really doing something prohibited by POSIX, so things might break in unexpected ways. For example, calling pthread_condition_destroy from the child process can deadlock (see http://bugs.python.org/issue6721#msg136047).

    Victor: to debug this kind of problem, it would be great if faulthandler could also dump tracebacks of children processes. Do you mind if I create a new issue?

    @vstinner
    Copy link
    Member Author

    Victor: to debug this kind of problem, it would be great
    if faulthandler could also dump tracebacks of children processes.
    Do you mind if I create a new issue?

    Please open a new issue.

    @vstinner
    Copy link
    Member Author

    + @unittest.skipIf(sys.platform in ('freebsd4', 'freebsd5', 'freebsd6',
    + 'netbsd5', 'os2emx'), "due to known OS bug")

    This skip gives very few information, and it is duplicated for each function. I would prefer a constant of the "broken OSes" with your following comment attached to the constant:

    + # Between fork() and exec(), only async-safe functions are allowed (issues
    + # bpo-12316 and bpo-11870), and fork() from a worker thread is known to trigger
    + # problems with some operating systems: skip problematic tests on platforms
    + # known to behave badly.

    Or split the test case into two testcases: one using fork and skipped on broken platforms, one not using fork?

    ---

    As for test_3_join_in_forked_from_thread, well, it could be more
    or less the same problem. We're really doing something prohibited
    by POSIX, so things might break in unexpected ways.

    If the creation of a thread after a fork is reliable on some systems, we should not deny the creation of new threads after a fork. You can replace "creation of new threads" by any other non async-safe function in my previous sentence. Therefore I agree that the good answer to this issue is to skip the test on "broken systems" (or should we call them "POSIX compliant systems?" :-)).

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Jun 25, 2011

    This skip gives very few information, and it is duplicated for each
    function. I would prefer a constant of the "broken OSes" with your
    following comment attached to the constant:

    Ok, I'll try to write something along those lines.

    If the creation of a thread after a fork is reliable on some systems,
    we should not deny the creation of new threads after a fork.

    Well, the problem is that it is not reliable on any platform, but happens to work "most of the time" on some platforms ;-)

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Jun 26, 2011

    Patch attached.

    @vstinner
    Copy link
    Member Author

    Your patch is linux3 compliant, go ahead!

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 1, 2011

    New changeset 0ed5e6ff10f8 by Victor Stinner in branch '3.2':
    Issue bpo-11870: Skip test_threading.test_2_join_in_forked_process() on platforms
    http://hg.python.org/cpython/rev/0ed5e6ff10f8

    New changeset f43dee86fffd by Victor Stinner in branch 'default':
    (merge 3.2) Issue bpo-11870: Skip test_threading.test_2_join_in_forked_process()
    http://hg.python.org/cpython/rev/f43dee86fffd

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 1, 2011

    New changeset ff36b8cadfd6 by Victor Stinner in branch '2.7':
    Issue bpo-11870: Skip test_threading.test_2_join_in_forked_process() on platforms
    http://hg.python.org/cpython/rev/ff36b8cadfd6

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 1, 2011

    The initial problem was test_3_join_in_forked_from_thread() and the hangs does still happen:

    [324/356] test_threading
    Timeout (1:00:00)!
    Thread 0x404248c0:
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1498 in _communicate_with_poll
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1423 in _communicate
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 836 in communicate
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 32 in _assert_python
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 50 in assert_python_ok
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 434 in _run_and_join
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 493 in test_3_join_in_forked_from_thread
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 407 in _executeTestPart
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 462 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 514 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 168 in run
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1259 in _run_suite
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1285 in run_unittest
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 774 in test_main
    File "./Lib/test/regrtest.py", line 1070 in runtest_inner
    File "./Lib/test/regrtest.py", line 861 in runtest
    File "./Lib/test/regrtest.py", line 669 in main
    File "./Lib/test/regrtest.py", line 1648 in <module>

    http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/4081/steps/test/logs/stdio

    (neologix's patch doesn't change anything for x86 Ubuntu Shared 3.x buildbot, which is a Linux, not a FreeBSD)

    I don't know why it only hangs on this Linux buildbot. It's maybe an old Linux kernel, an old GNU libc version, or something like that?

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Jul 1, 2011

    The initial problem was test_3_join_in_forked_from_thread() and the hangs does still happen:

    Yes, the patch was there to fix test_2_join_in_forked_from_thread.

    [324/356] test_threading
    Timeout (1:00:00)!
    Thread 0x404248c0:
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1498 in _communicate_with_poll
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1423 in _communicate
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 836 in communicate
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 32 in _assert_python
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 50 in assert_python_ok
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 434 in _run_and_join
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 493 in test_3_join_in_forked_from_thread
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 407 in _executeTestPart
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 462 in run
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 514 in __call__
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 168 in run
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1259 in _run_suite
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1285 in run_unittest
     File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 774 in test_main
     File "./Lib/test/regrtest.py", line 1070 in runtest_inner
     File "./Lib/test/regrtest.py", line 861 in runtest
     File "./Lib/test/regrtest.py", line 669 in main
     File "./Lib/test/regrtest.py", line 1648 in <module>

    http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/4081/steps/test/logs/stdio

    This means that the subprocess hangs, but without a backtrace of the
    child process (issue bpo-12413), it's hard to analyse it further.
    I've had a look at the code, and couldn't find anything obviously
    wrong; Gregory's patches to sanitize threading's lock should have
    fixed this. I also tried running this test in a loop for 48 hours but
    couldn't reproduce it.
    One possible explanation (but it's just a wild guess) is that with a
    certain kernel/libc configuration, the lock deallocation code can
    deadlock (I've seen pthread_cond_destroy() block, this could maybe
    happen with sem_destroy()).

    So I suggest to try to come up with a solution to bpo-12413, which should
    help analyzing this - and similar - issues.

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 1, 2011

    Gregory's patches to sanitize threading's lock should have fixed this

    The subprocess hang still occurs something, it just happened:

    http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/4898/steps/test/logs/stdio

    [110/363] test_threading
    Timeout (1:00:00)!
    Thread 0x404888c0:
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1513 in _communicate_with_poll
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1438 in _communicate
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 850 in communicate
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 32 in _assert_python
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 50 in assert_python_ok
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 441 in _run_and_join
    File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 497 in test_3_join_in_forked_from_thread
    ...

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 7, 2011

    I removed my two previous message (msg148991 and msg148992), there were unrelated to this issue: the test hangs in debug mode in the IO module because of a deadleak in IO related to the fork...

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Dec 7, 2011

    To debug this, we should probably make use of faulthandler (but not
    dump_tracebacks_later, since it creates a new thread). The way to go
    could be to make the parent process send a fatal signal to the child
    process if the latter takes too long to complete.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Dec 10, 2011

    Here's a patch to help nail this down.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Dec 18, 2011

    Victor, could you try the patch attached?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 18, 2011

    New changeset 775319cebaa3 by Charles-François Natali in branch '2.7':
    Issue bpo-11870: threading: Properly reinitialize threads internal locks and
    http://hg.python.org/cpython/rev/775319cebaa3

    New changeset de962ec0faaa by Charles-François Natali in branch '3.2':
    Issue bpo-11870: threading: Properly reinitialize threads internal locks and
    http://hg.python.org/cpython/rev/de962ec0faaa

    New changeset cec0d77d01c4 by Charles-François Natali in branch 'default':
    Issue bpo-11870: threading: Properly reinitialize threads internal locks and
    http://hg.python.org/cpython/rev/cec0d77d01c4

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Dec 18, 2011

    Should be fixed now.

    @neologix neologix mannequin closed this as completed Dec 18, 2011
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 17, 2013

    New changeset cd54b48946ca by Stefan Krah in branch '3.3':
    Issue bpo-11870: Skip test_3_join_in_forked_from_thread() on HP-UX.
    http://hg.python.org/cpython/rev/cd54b48946ca

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    tests Tests in the Lib/test dir
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant