classification
Title: test_3_join_in_forked_from_thread() of test_threading hangs 1 hour on "x86 Ubuntu Shared 3.x"
Type: Stage: resolved
Components: Tests Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: gps, haypo, neologix, python-dev
Priority: normal Keywords: patch

Created on 2011-04-18 20:17 by haypo, last changed 2013-01-17 22:37 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
test_threading_fork.diff neologix, 2011-06-26 20:55 review
debug_stuck.diff neologix, 2011-12-10 12:03 review
threading_reinit_lock.diff neologix, 2011-12-18 16:05 review
Messages (19)
msg133990 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-18 20:17
test_3_join_in_forked_from_thread() of test_threading failed on "x86 Ubuntu Shared 3.x" buildbot:
-----------------------------------
[201/354] test_threading
[41179 refs]
[40407 refs]
[40407 refs]
[40407 refs]
Timeout (1:00:00)!
Thread 0x404218c0:
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 466 in _eintr_retry_call
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1486 in _try_wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1528 in wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 455 in _run_and_join
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 518 in test_3_join_in_forked_from_thread
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 387 in _executeTestPart
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 442 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 494 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1078 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1166 in _run_suite
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1192 in run_unittest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 728 in test_main
  File "./Lib/test/regrtest.py", line 1041 in runtest_inner
  File "./Lib/test/regrtest.py", line 835 in runtest
  File "./Lib/test/regrtest.py", line 659 in main
  File "./Lib/test/regrtest.py", line 1619 in <module>
make: *** [buildbottest] Error 1
program finished with exit code 2
elapsedTime=4426.776675
http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/3577/steps/test/logs/stdio
-----------------------------------

Code of the test:
----------------------------------
    def _run_and_join(self, script):
        script = """if 1:
            import sys, os, time, threading

            # a thread, which waits for the main program to terminate
            def joiningfunc(mainthread):
                mainthread.join()
                print('end of thread')
                # stdout is fully buffered because not a tty, we have to flush
                # before exit.
                sys.stdout.flush()
        \n""" + script

        p = subprocess.Popen([sys.executable, "-c", script], stdout=subprocess.PIPE)
        rc = p.wait() <~~~ HANG HERE ~~~~
        data = p.stdout.read().decode().replace('\r', '')
        p.stdout.close()
        self.assertEqual(data, "end of main\nend of thread\n")
        self.assertFalse(rc == 2, "interpreter was blocked")
        self.assertTrue(rc == 0, "Unexpected error")

    @unittest.skipUnless(hasattr(os, 'fork'), "needs os.fork()")
    def test_3_join_in_forked_from_thread(self):
        # Like the test above, but fork() was called from a worker thread
        # In the forked process, the main Thread object must be marked as stopped.

        # Skip platforms with known problems forking from a worker thread.
        # See http://bugs.python.org/issue3863.
        if sys.platform in ('freebsd4', 'freebsd5', 'freebsd6', 'netbsd5',
                           'os2emx'):
            raise unittest.SkipTest('due to known OS bugs on ' + sys.platform)
        script = """if 1:
            main_thread = threading.current_thread()
            def worker():
                childpid = os.fork()
                if childpid != 0:
                    os.waitpid(childpid, 0)
                    sys.exit(0)

                t = threading.Thread(target=joiningfunc,
                                     args=(main_thread,))
                print('end of main')
                t.start()
                t.join() # Should not block: main_thread is already stopped

            w = threading.Thread(target=worker)
            w.start()
            """
        self._run_and_join(script)
----------------------------------
msg139076 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-06-25 14:51
test_2_join_in_forked_process fails on FreeBSD 6.4 buildbot.
http://www.python.org/dev/buildbot/all/builders/x86 FreeBSD 6.4 3.x/builds/1606/steps/test/logs/stdio

"""
======================================================================
FAIL: test_2_join_in_forked_process (test.test_threading.ThreadJoinOnShutdown)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd/build/Lib/test/test_threading.py", line 464, in test_2_join_in_forked_process
    self._run_and_join(script)
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd/build/Lib/test/test_threading.py", line 436, in _run_and_join
    self.assertEqual(data, "end of main\nend of thread\n")
AssertionError: '' != 'end of main\nend of thread\n'
+ end of main
+ end of thread
"""

I think it's the same problem as issue #12316: in the child process, even calling pthread_create can segfault/abort on FreeBSD6 (async-safe blahblah...).
Tests creating a thread from a fork()ed process should be skipped on FreeBSD6.
Patch attached, along with some refactoring to use the skipIf idiom.

As for test_3_join_in_forked_from_thread, well, it could be more or less the same problem. We're really doing something prohibited by POSIX, so things might break in unexpected ways. For example, calling pthread_condition_destroy from the child process can deadlock (see http://bugs.python.org/issue6721#msg136047).

Victor: to debug this kind of problem, it would be great if faulthandler could also dump tracebacks of children processes. Do you mind if I create a new issue?
msg139121 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-25 21:50
> Victor: to debug this kind of problem, it would be great
> if faulthandler could also dump tracebacks of children processes.
> Do you mind if I create a new issue?

Please open a new issue.
msg139129 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-25 23:23
+    @unittest.skipIf(sys.platform in ('freebsd4', 'freebsd5', 'freebsd6',
+                     'netbsd5', 'os2emx'), "due to known OS bug")

This skip gives very few information, and it is duplicated for each function. I would prefer a constant of the "broken OSes" with your following comment attached to the constant:

+    # Between fork() and exec(), only async-safe functions are allowed (issues
+    # #12316 and #11870), and fork() from a worker thread is known to trigger
+    # problems with some operating systems: skip problematic tests on platforms
+    # known to behave badly.

Or split the test case into two testcases: one using fork and skipped on broken platforms, one not using fork?

---

> As for test_3_join_in_forked_from_thread, well, it could be more
> or less the same problem. We're really doing something prohibited
> by POSIX, so things might break in unexpected ways.

If the creation of a thread after a fork is reliable on some systems, we should not deny the creation of new threads after a fork. You can replace "creation of new threads" by any other non async-safe function in my previous sentence. Therefore I agree that the good answer to this issue is to skip the test on "broken systems" (or should we call them "POSIX compliant systems?" :-)).
msg139131 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-06-25 23:40
> This skip gives very few information, and it is duplicated for each 
> function. I would prefer a constant of the "broken OSes" with your 
> following comment attached to the constant:

Ok, I'll try to write something along those lines.

> If the creation of a thread after a fork is reliable on some systems,
> we should not deny the creation of new threads after a fork.

Well, the problem is that it is not reliable on any platform, but happens to work "most of the time" on some platforms ;-)
msg139219 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-06-26 20:55
Patch attached.
msg139337 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-27 23:14
Your patch is linux3 compliant, go ahead!
msg139574 - (view) Author: Roundup Robot (python-dev) Date: 2011-07-01 12:53
New changeset 0ed5e6ff10f8 by Victor Stinner in branch '3.2':
Issue #11870: Skip test_threading.test_2_join_in_forked_process() on platforms
http://hg.python.org/cpython/rev/0ed5e6ff10f8

New changeset f43dee86fffd by Victor Stinner in branch 'default':
(merge 3.2) Issue #11870: Skip test_threading.test_2_join_in_forked_process()
http://hg.python.org/cpython/rev/f43dee86fffd
msg139576 - (view) Author: Roundup Robot (python-dev) Date: 2011-07-01 13:05
New changeset ff36b8cadfd6 by Victor Stinner in branch '2.7':
Issue #11870: Skip test_threading.test_2_join_in_forked_process() on platforms
http://hg.python.org/cpython/rev/ff36b8cadfd6
msg139583 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-07-01 14:06
The initial problem was test_3_join_in_forked_from_thread() and the hangs does still happen:

[324/356] test_threading
Timeout (1:00:00)!
Thread 0x404248c0:
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1498 in _communicate_with_poll
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1423 in _communicate
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 836 in communicate
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 32 in _assert_python
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 50 in assert_python_ok
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 434 in _run_and_join
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 493 in test_3_join_in_forked_from_thread
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 407 in _executeTestPart
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 462 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 514 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 168 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1259 in _run_suite
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1285 in run_unittest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 774 in test_main
  File "./Lib/test/regrtest.py", line 1070 in runtest_inner
  File "./Lib/test/regrtest.py", line 861 in runtest
  File "./Lib/test/regrtest.py", line 669 in main
  File "./Lib/test/regrtest.py", line 1648 in <module>

http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/4081/steps/test/logs/stdio

(neologix's patch doesn't change anything for x86 Ubuntu Shared 3.x buildbot, which is a Linux, not a FreeBSD)

I don't know why it only hangs on this Linux buildbot. It's maybe an old Linux kernel, an old GNU libc version, or something like that?
msg139602 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-07-01 20:48
> The initial problem was test_3_join_in_forked_from_thread() and the hangs does still happen:
>

Yes, the patch was there to fix test_2_join_in_forked_from_thread.

> [324/356] test_threading
> Timeout (1:00:00)!
> Thread 0x404248c0:
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1498 in _communicate_with_poll
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1423 in _communicate
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 836 in communicate
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 32 in _assert_python
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 50 in assert_python_ok
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 434 in _run_and_join
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 493 in test_3_join_in_forked_from_thread
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 407 in _executeTestPart
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 462 in run
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 514 in __call__
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 105 in run
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 67 in __call__
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 168 in run
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1259 in _run_suite
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support.py", line 1285 in run_unittest
>  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 774 in test_main
>  File "./Lib/test/regrtest.py", line 1070 in runtest_inner
>  File "./Lib/test/regrtest.py", line 861 in runtest
>  File "./Lib/test/regrtest.py", line 669 in main
>  File "./Lib/test/regrtest.py", line 1648 in <module>
>
> http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/4081/steps/test/logs/stdio
>

This means that the subprocess hangs, but without a backtrace of the
child process (issue #12413), it's hard to analyse it further.
I've had a look at the code, and couldn't find anything obviously
wrong; Gregory's patches to sanitize threading's lock should have
fixed this. I also tried running this test in a loop for 48 hours but
couldn't reproduce it.
One possible explanation (but it's just a wild guess) is that with a
certain kernel/libc configuration, the lock deallocation code can
deadlock (I've seen pthread_cond_destroy() block, this could maybe
happen with sem_destroy()).

So I suggest to try to come up with a solution to #12413, which should
help analyzing this - and similar - issues.
msg148698 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-12-01 00:43
> Gregory's patches to sanitize threading's lock should have fixed this

The subprocess hang still occurs something, it just happened:

http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20Shared%203.x/builds/4898/steps/test/logs/stdio

[110/363] test_threading
Timeout (1:00:00)!
Thread 0x404888c0:
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1513 in _communicate_with_poll
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 1438 in _communicate
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/subprocess.py", line 850 in communicate
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 32 in _assert_python
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/script_helper.py", line 50 in assert_python_ok
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 441 in _run_and_join
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_threading.py", line 497 in test_3_join_in_forked_from_thread
...
msg148993 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-12-07 21:35
I removed my two previous message (msg148991 and msg148992), there were unrelated to this issue: the test hangs in debug mode in the IO module because of a deadleak in IO related to the fork...
msg148996 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-07 22:04
To debug this, we should probably make use of faulthandler (but not
dump_tracebacks_later, since it creates a new thread). The way to go
could be to make the parent process send a fatal signal to the child
process if the latter takes too long to complete.
msg149148 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-10 12:02
Here's a patch to help nail this down.
msg149773 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-18 16:05
Victor, could you try the patch attached?
msg149784 - (view) Author: Roundup Robot (python-dev) Date: 2011-12-18 17:47
New changeset 775319cebaa3 by Charles-François Natali in branch '2.7':
Issue #11870: threading: Properly reinitialize threads internal locks and
http://hg.python.org/cpython/rev/775319cebaa3

New changeset de962ec0faaa by Charles-François Natali in branch '3.2':
Issue #11870: threading: Properly reinitialize threads internal locks and
http://hg.python.org/cpython/rev/de962ec0faaa

New changeset cec0d77d01c4 by Charles-François Natali in branch 'default':
Issue #11870: threading: Properly reinitialize threads internal locks and
http://hg.python.org/cpython/rev/cec0d77d01c4
msg149788 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-18 18:12
Should be fixed now.
msg180158 - (view) Author: Roundup Robot (python-dev) Date: 2013-01-17 22:37
New changeset cd54b48946ca by Stefan Krah in branch '3.3':
Issue #11870: Skip test_3_join_in_forked_from_thread() on HP-UX.
http://hg.python.org/cpython/rev/cd54b48946ca
History
Date User Action Args
2013-01-17 22:37:26python-devsetmessages: + msg180158
2011-12-18 18:12:47neologixsetstatus: open -> closed
resolution: fixed
messages: + msg149788

stage: resolved
2011-12-18 17:47:32python-devsetmessages: + msg149784
2011-12-18 16:05:27neologixsetfiles: + threading_reinit_lock.diff

messages: + msg149773
2011-12-10 12:03:00neologixsetfiles: + debug_stuck.diff

messages: + msg149148
2011-12-07 22:04:19neologixsetmessages: + msg148996
2011-12-07 21:35:34hayposetmessages: + msg148993
2011-12-07 21:34:25hayposetmessages: - msg148992
2011-12-07 21:34:23hayposetmessages: - msg148991
2011-12-07 21:26:20hayposetmessages: + msg148992
2011-12-07 21:19:07hayposetmessages: + msg148991
2011-12-01 00:43:31hayposetmessages: + msg148698
2011-07-01 20:48:47neologixsetmessages: + msg139602
2011-07-01 14:06:49hayposetmessages: + msg139583
2011-07-01 13:05:01python-devsetmessages: + msg139576
2011-07-01 12:53:42python-devsetnosy: + python-dev
messages: + msg139574
2011-06-27 23:14:10hayposetmessages: + msg139337
2011-06-26 20:55:31neologixsetfiles: - test_threading_fork.diff
2011-06-26 20:55:08neologixsetfiles: + test_threading_fork.diff

messages: + msg139219
2011-06-25 23:40:27neologixsetmessages: + msg139131
2011-06-25 23:23:05hayposetmessages: + msg139129
2011-06-25 21:50:24hayposetmessages: + msg139121
2011-06-25 14:51:09neologixsetfiles: + test_threading_fork.diff

nosy: + neologix
messages: + msg139076

keywords: + patch
2011-04-19 00:43:03pitrousetnosy: + gps
2011-04-18 20:17:59haypocreate