Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_multiprocessing hangs intermittently on POSIX platforms #47338

Closed
warsaw opened this issue Jun 12, 2008 · 73 comments
Closed

test_multiprocessing hangs intermittently on POSIX platforms #47338

warsaw opened this issue Jun 12, 2008 · 73 comments
Labels
release-blocker stdlib Python modules in the Lib dir

Comments

@warsaw
Copy link
Member

warsaw commented Jun 12, 2008

BPO 3088
Nosy @gvanrossum, @warsaw, @amauryfa, @tebeka, @mdickinson, @ncoghlan, @benjaminp
Dependencies
  • bpo-874900: threading module can deadlock after fork
  • Files
  • test_get.output
  • test_multiprocessing_reduced.diff
  • threadlocal.patch
  • traceback.txt: Traceback from test_multiprocessing crash
  • traceback2.txt: A second traceback
  • multithread_traceback.txt: Traceback showing all threads
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2008-07-17.17:41:47.256>
    created_at = <Date 2008-06-12.04:48:42.642>
    labels = ['library', 'release-blocker']
    title = 'test_multiprocessing hangs intermittently on POSIX platforms'
    updated_at = <Date 2008-07-18.14:43:00.357>
    user = 'https://github.com/warsaw'

    bugs.python.org fields:

    activity = <Date 2008-07-18.14:43:00.357>
    actor = 'tebeka'
    assignee = 'jnoller'
    closed = True
    closed_date = <Date 2008-07-17.17:41:47.256>
    closer = 'jnoller'
    components = ['Library (Lib)']
    creation = <Date 2008-06-12.04:48:42.642>
    creator = 'barry'
    dependencies = ['874900']
    files = ['10604', '10633', '10766', '10795', '10796', '10800']
    hgrepos = []
    issue_num = 3088
    keywords = ['patch']
    message_count = 73.0
    messages = ['68053', '68055', '68059', '68062', '68063', '68064', '68065', '68066', '68067', '68068', '68072', '68077', '68136', '68150', '68153', '68198', '68209', '68237', '68240', '68295', '68384', '68466', '68467', '68485', '68856', '68877', '68929', '69025', '69026', '69106', '69108', '69109', '69110', '69119', '69120', '69121', '69122', '69123', '69124', '69125', '69127', '69128', '69130', '69131', '69133', '69138', '69142', '69149', '69151', '69154', '69155', '69156', '69158', '69159', '69160', '69181', '69182', '69185', '69188', '69414', '69424', '69425', '69431', '69433', '69434', '69435', '69737', '69738', '69740', '69747', '69754', '69895', '69955']
    nosy_count = 12.0
    nosy_names = ['gvanrossum', 'barry', 'amaury.forgeotdarc', 'tebeka', 'mark.dickinson', 'ncoghlan', 'Rhamphoryncus', 'donmez', 'paulmelis', 'roudkerk', 'benjamin.peterson', 'jnoller']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue3088'
    versions = ['Python 2.6', 'Python 3.0']

    @warsaw
    Copy link
    Member Author

    warsaw commented Jun 12, 2008

    For me, test_multiprocessing hangs consistently on OS X 10.5.3. It
    passes just fine on Ubuntu 8.04.

    @warsaw warsaw added release-blocker stdlib Python modules in the Lib dir labels Jun 12, 2008
    @donmez
    Copy link
    Mannequin

    donmez mannequin commented Jun 12, 2008

    I can confirm this on Leopard too.

    @benjaminp
    Copy link
    Contributor

    It passes for me on Leopard. Can you post the test running in verbose
    mode so we can see where it hangs?

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 12, 2008

    On python-3000 trunk, _multiprocessing doesn't even compile:

    /Users/jesse/open_source/subversion/python-
    3000/Modules/_multiprocessing/semaphore.c: In function ‘semlock_iszero’:
    /Users/jesse/open_source/subversion/python-
    3000/Modules/_multiprocessing/semaphore.c:515: warning: unused variable
    ‘sval’

    @warsaw
    Copy link
    Member Author

    warsaw commented Jun 12, 2008

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    On Jun 12, 2008, at 9:02 AM, Benjamin Peterson wrote:

    It passes for me on Leopard. Can you post the test running in verbose
    mode so we can see where it hangs?

    It never hangs when run standalone, though it crashes about half the
    time. Running it under gdb doesn't help; it always gives me an
    interrupted syste call in os.waidpid() in forking.py.

    The hang occurs during 'make test', and it's always the second run
    that hangs. Perhaps some lock isn't getting cleaned up properly the
    first time?

    • -Barry

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.9 (Darwin)

    iQCVAwUBSFEicXEjvBPtnXfVAQLGNwP/S6f2IrO7c7SET0Gx8FXqdPmot3jcmopx
    TFxDA5csh/pVaDQCVW6DiLMXsu2TkQGPPbbo8Bx9iPmV/iIHFqy4nDtETqqjKdRp
    BvVtBmvSrP6wmymlKFlFC5qdfbbvguZq/hO60XulQk+WU4F8N7oHQck0tA2JhdDh
    lS5SAFAIovA=
    =xzs6
    -----END PGP SIGNATURE-----

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 12, 2008

    I did a make clean && ./configure && make and it started compiling for me
    again. Sorry for the noise.

    @gvanrossum
    Copy link
    Member

    If it's only failing during the second run of "make test", typically
    there's some implicit dependency on something that is disturbed by
    running a test that's later in the suite of tests. This could be either
    the fault of that other test (not restoring some global setting or
    environment var) or the fault of the test that fails (making unwarranted
    assumptions or not initializing some needed settings before starting).

    If it works for some folks and not for others, on the same platform,
    compare the set of extension modules that are not built, reported by
    "make" in a message starting with "Failed to find the necessary bits to
    build these modules:". Likely, Barry has the most complete set, while
    Jesse has a few more extensions missing.

    Finding this is usually a painful process of bisecting the set of tests
    run. Randomizing the tests with regrtest.py -r might also be helpful.

    FWIW, when I tried (on Leopard) "make test
    TESTOPTS=test_multiprocessing" it hung on the first set. When I ran
    "./python Lib/test/test_multiprocess.py -v" it reported 122 tests
    passed. But when I ran "./python Lib/test/regrtest.py -v
    test_multiprocessing" one test failed:

    ======================================================================
    ERROR: test_remote (test.test_multiprocessing.WithManagerTestRemoteManager)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/Users/guido/p/Lib/test/test_multiprocessing.py", line 1167, in
    test_remote
        queue = manager2.get_queue()
      File "/Users/guido/p/Lib/multiprocessing/managers.py", line 650, in temp
        authkey=self._authkey, exposed=exp
      File "/Users/guido/p/Lib/multiprocessing/managers.py", line 902, in
    AutoProxy
        incref=incref)
      File "/Users/guido/p/Lib/multiprocessing/managers.py", line 711, in
    __init__
        self._incref()
      File "/Users/guido/p/Lib/multiprocessing/managers.py", line 758, in
    _incref
        dispatch(conn, None, 'incref', (self._id,))
      File "/Users/guido/p/Lib/multiprocessing/managers.py", line 94, in
    dispatch
        raise convert_to_error(kind, result)
    RemoteError: 
    ---------------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/guido/p/Lib/multiprocessing/managers.py", line 196, in
    handle_request
        result = func(c, *args, **kwds)
      File "/Users/guido/p/Lib/multiprocessing/managers.py", line 412, in incref
        self.id_to_refcount[ident] += 1
    KeyError: '5f2828'

    @gvanrossum
    Copy link
    Member

    I should add this was in the trunk (2.6).

    @paulmelis
    Copy link
    Mannequin

    paulmelis mannequin commented Jun 12, 2008

    I think I'm having a similar lockup on fedora core 4 (smp machine). This
    is with the py3k branch, freshly svn updated. When running "make test
    TESTOPTS=test_multiprocessing" the first of the two test runs always
    succeeds in something like 10-15 seconds, while the second one
    occasionally hangs. I tried running with -v as extra argument but can't
    get it to hang in that case (to figure out which test is the culprit).

    @paulmelis
    Copy link
    Mannequin

    paulmelis mannequin commented Jun 12, 2008

    After a few more runs with -v and redirecting output to a file it seems
    the lockup I get is in
    test_notify_all (test.test_multiprocessing.WithManagerTestCondition)

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 12, 2008

    It's taking me longer to get to this than I planned, any help is
    appreciated.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 12, 2008

    I can get an intermittent (1 every 15 or so runs) lock in:
    test_get (main.WithProcessesTestQueue) ...

    Executed like this:
    ./python Lib/test/test_multiprocessing.py

    When I control-c it the stack looks like this:
    ...snip
    File "/root/py/python-3000/Lib/multiprocessing/pool.py", line 57, in
    worker
    task = get()
    File "/root/py/python-3000/Lib/multiprocessing/queues.py", line 337,
    in get
    task = get()
    File "/root/py/python-3000/Lib/multiprocessing/queues.py", line 339,
    in get
    racquire()
    KeyboardInterrupt
    task = get()
    File "/root/py/python-3000/Lib/multiprocessing/queues.py", line 337,
    in get
    task = get()
    File "/root/py/python-3000/Lib/multiprocessing/queues.py", line 337,
    in get
    return recv()
    File "/root/py/python-3000/Lib/pickle.py", line 1327, in loads
    racquire()
    KeyboardInterrupt
    racquire()
    KeyboardInterrupt
    def loads(s, *, encoding="ASCII", errors="strict"):
    KeyboardInterrupt

    I'm not seeing frequent locks/failures when run with regrtest, but I am
    seeing them with "make test TESTOPTS=test_multiprocessing"

    I've attached full output. Still trying to figure it out

    @paulmelis
    Copy link
    Mannequin

    paulmelis mannequin commented Jun 13, 2008

    I made a copy of test_multiprocessing.py (to test_mp.py) and basically
    removed all test classes, except _TestCondition. In it, I commented all
    test methods except test_notify_all. When run with make test
    TESTOPTS="-v test_mp" there's the lockups every few runs, but I also got
    (only three times in 40 or so runs so far):

    9:57|paul@tabu:~/c/py3k-svn> make test TESTOPTS="-v test_mp"

    Failed to find the necessary bits to build these modules:
    _gestalt
    To find the necessary bits, look in setup.py in detect_modules() for the
    module's name.

    find ./Lib -name '*.py[co]' -print | xargs rm -f
    ./python -E -bb ./Lib/test/regrtest.py -v test_mp
    test_mp
    test_notify_all (test.test_mp.WithProcessesTestCondition) ... ok
    test_notify_all (test.test_mp.WithThreadsTestCondition) ... ok
    test_notify_all (test.test_mp.WithManagerTestCondition) ... ok

    ----------------------------------------------------------------------
    Ran 3 tests in 1.087s

    OK
    1 test OK.
    CAUTION:  stdout isn't compared in verbose mode:
    a test that passes in verbose mode may fail without it.
    ./python -E -bb ./Lib/test/regrtest.py -v test_mp
    test_mp
    test_notify_all (test.test_mp.WithProcessesTestCondition) ... ok
    test_notify_all (test.test_mp.WithThreadsTestCondition) ... ok
    test_notify_all (test.test_mp.WithManagerTestCondition) ... Exception in
    thread Thread-28:
    Traceback (most recent call last):
      File "/home/paul/c/py3k-svn/Lib/threading.py", line 492, in
    _bootstrap_inner
        self.run()
      File "/home/paul/c/py3k-svn/Lib/threading.py", line 447, in run
        self._target(*self._args, **self._kwargs)
      File "/home/paul/c/py3k-svn/Lib/test/test_mp.py", line 208, in f
        cond.wait(timeout)
      File "/home/paul/c/py3k-svn/Lib/multiprocessing/managers.py", line
    973, in wait
        return self._callmethod('wait', (timeout,))
      File "/home/paul/c/py3k-svn/Lib/multiprocessing/managers.py", line
    748, in _callmethod
        raise convert_to_error(kind, result)
    RuntimeError: cannot wait on un-aquired lock

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 13, 2008

    FWIW: In order to boost the logging level within the test(s) do the
    following:

    Search for LOG_LEVEL, set it to:
    LOG_LEVEL=util.SUBDEBUG

    And then in the main() replace:
    multiprocessing.get_logger().setLevel(LOG_LEVEL)
    With:
    multiprocessing.log_to_stderr(level=LOG_LEVEL)

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 13, 2008

    I also isolated the test(s) like Paul did, and it looks like a semi-
    consistent lock up in:
    File "/root/py/python-3000/Lib/multiprocessing/queues.py", line 337, in
    get
    racquire()

    This is running only the test_event test. The racquire traces back to
    synchronize.SemLock which calls into _multiprocessing.SemLock

    @donmez
    Copy link
    Mannequin

    donmez mannequin commented Jun 14, 2008

    Seems to work fine for me now with latest py3k branch.

    @roudkerk
    Copy link
    Mannequin

    roudkerk mannequin commented Jun 14, 2008

    I suspect the problems with WithManagerTestCondition.notify_all() may
    have to do with the thread safety of the proxies. If you replace
    Thread(...) by self.Process(...) in that test then the problem may go away.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 15, 2008

    After talking with Richard, I think the best way to attack this issue
    (and the other ones around suite unreliability) is to remove the
    unreliable test cases for the first beta, and then refactor the suite
    post beta with an eye towards reliability and clarity. Personally, I
    would like to break the suites up in the the test_multiprocessing.py
    script to be more in the vein of other tests in Lib/test/...

    I removed the more unreliable test cases while keeping the core ones and
    wrote a quick bash script to do a burn-in of the "make tests" command,
    for 100 loops, I was unable to get the tests to hang with the reduced
    suite.

    I ran the same thing on trunk and py3k just to make sure I could not get
    it to hang/crash.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 15, 2008

    Here is the loop I ran the tests with:

    #!/bin/sh

    for (( i=1;i<=100;i+=1 )); do
    make test TESTOPTS="-v test_multiprocessing"
    done

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 16, 2008

    I don't have commit rights, so I can't apply the test_multiprocessing_reduced.diff myself. Anyone willing? I think this
    should help the buildbots.

    @warsaw
    Copy link
    Member Author

    warsaw commented Jun 19, 2008

    I'm going to knock this one down to critical since it's working for me
    now on OS X and buildbot looks green. We can address any additional
    patches after the beta release.

    @jnoller jnoller mannequin self-assigned this Jun 19, 2008
    @tebeka
    Copy link
    Mannequin

    tebeka mannequin commented Jun 20, 2008

    Still hangs for me on the 2.6 trunk on Ubuntu 8.04

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jun 20, 2008

    Where exactly does it hang Miki?

    @tebeka
    Copy link
    Mannequin

    tebeka mannequin commented Jun 20, 2008

    Jesse,

    I just run "make test", it runs until test_multiprocessing and then
    hangs there

    @mdickinson
    Copy link
    Member

    test_multiprocessing is also still hanging for me, perhaps 30% of the
    times I run the test suite.

    When running the test by itself it seems to pass much more often, but
    not always. I just got the following output (on OS X 10.5.3/Intel).
    There was a hang at test_get; after around half-an-hour I hit Ctrl-C.

    Macintosh-3:trunk dickinsm$ ./python.exe
    Lib/test/test_multiprocessing.py
    test_array (main.WithProcessesTestArray) ... ok
    test_getobj_getlock_obj (main.WithProcessesTestArray) ... ok
    test_rawarray (main.WithProcessesTestArray) ... ok
    test_notify (main.WithProcessesTestCondition) ... ok
    test_notify_all (main.WithProcessesTestCondition) ... ok
    test_timeout (main.WithProcessesTestCondition) ... ok
    test_connection (main.WithProcessesTestConnection) ... ok
    test_duplex_false (main.WithProcessesTestConnection) ... ok
    test_sendbytes (main.WithProcessesTestConnection) ... ok
    test_spawn_close (main.WithProcessesTestConnection) ... ok
    test_event (main.WithProcessesTestEvent) ... ok
    test_finalize (main.WithProcessesTestFinalize) ... ok
    test_heap (main.WithProcessesTestHeap) ... ok
    test_import (main.WithProcessesTestImportStar) ... ok
    test_lock (main.WithProcessesTestLock) ... ok
    test_rlock (main.WithProcessesTestLock) ... ok
    test_enable_logging (main.WithProcessesTestLogging) ... ok
    test_level (main.WithProcessesTestLogging) ... ok
    test_active_children (main.WithProcessesTestProcess) ... ok
    test_cpu_count (main.WithProcessesTestProcess) ... ok
    test_current (main.WithProcessesTestProcess) ... ok
    test_process (main.WithProcessesTestProcess) ... ok
    test_recursion (main.WithProcessesTestProcess) ... ok
    test_terminate (main.WithProcessesTestProcess) ... ok
    test_fork (main.WithProcessesTestQueue) ... ok
    test_get (main.WithProcessesTestQueue) ... ^CTraceback (most recent
    call last):
    File "Lib/test/test_multiprocessing.py", line 1799, in <module>
    main()
    File "Lib/test/test_multiprocessing.py", line 1796, in main
    test_main(unittest.TextTestRunner(verbosity=2).run)
    File "Lib/test/test_multiprocessing.py", line 1786, in test_main
    run(suite)
    File "/Users/dickinsm/python_source/trunk/Lib/unittest.py", line 743,
    in run
    test(result)
    File "/Users/dickinsm/python_source/trunk/Lib/unittest.py", line 454,
    in __call__
    return self.run(*args, **kwds)
    File "/Users/dickinsm/python_source/trunk/Lib/unittest.py", line 450,
    in run
    test(result)
    File "/Users/dickinsm/python_source/trunk/Lib/unittest.py", line 454,
    in __call__
    return self.run(*args, **kwds)
    File "/Users/dickinsm/python_source/trunk/Lib/unittest.py", line 450,
    in run
    test(result)
    File "/Users/dickinsm/python_source/trunk/Lib/unittest.py", line 293,
    in __call__
    return self.run(*args, **kwds)
    File "/Users/dickinsm/python_source/trunk/Lib/unittest.py", line 272,
    in run
    testMethod()
    File "Lib/test/test_multiprocessing.py", line 415, in test_get
    parent_can_continue.wait()
    File
    "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/synchronize.py"
    , line 292, in wait
    self._cond.wait(timeout)
    File
    "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/synchronize.py"
    , line 201, in wait
    self._wait_semaphore.acquire(True, timeout)
    KeyboardInterrupt
    [50284 refs]
    Macintosh-3:trunk dickinsm$

    @amauryfa
    Copy link
    Member

    I think I narrowed the problem to a race condition in *subclasses* of
    threading.local:
    In threadmodule.c::local_getattro, there is a chance that self->dict is
    changed before PyObject_GenericGetAttr is called.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jul 2, 2008

    On Wed, Jul 2, 2008 at 5:08 PM, Mark Dickinson <report@bugs.python.org> wrote:

    Mark Dickinson <dickinsm@gmail.com> added the comment:

    Okay. I just got about 5 perfect runs of the test suite, followed by:

    Macintosh-3:trunk dickinsm$ ./python.exe -m test.regrtest
    [...]
    test_multiprocessing
    Assertion failed: (bp != NULL), function PyObject_Malloc, file
    Objects/obmalloc.c, line 746.
    Abort trap (core dumped)

    I then did:

    gdb -c /cores/core.16235

    I've attached the traceback as traceback.txt

    Are you sure that's right? That traceback has no mention of
    PyObject_Malloc or obmalloc.c. Try checking the date. Also, if you
    use "gdb ./python.exe <corefile>" to start gdb it should print a
    warning if the program doesn't match the core.

    @tebeka
    Copy link
    Mannequin

    tebeka mannequin commented Jul 2, 2008

    I just run "make test" and it never moves past test_multiprocessing.

    Maybe it's my machine which is dual cpu quad core (total of 8 cores)?

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 3, 2008

    Doubtful Miki - I do the work on the module on an 8 Core Gentoo, 8 Core
    Mac Pro and Dual Core Macbook Pro - it's not a # of cores issue, unless
    it's simply a >1 issue.

    @mdickinson
    Copy link
    Member

    Are you sure that's right?

    Not at all. :-)

    That traceback has no mention of
    PyObject_Malloc or obmalloc.c. Try checking the date. Also, if you
    use "gdb ./python.exe <corefile>" to start gdb it should print a
    warning if the program doesn't match the core.

    The date and time on the core file look right (Jul 2, 23:52 GMT+1), and
    gdb ./python.exe ... doesn't give any warning. So I'm not sure what I
    did wrong. I'll try again and see if I get the same thing.

    @mdickinson
    Copy link
    Member

    Here's a new traceback (a different error again, this time: a negative
    refcount in Objects/tupleobject.c.)

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jul 3, 2008

    That looks better. It crashed while deleting an exception, who's args
    tuple has a bogus refcount. Could be a refcount issue of the
    exception or the args, or of something that that references them, or a
    dangling pointer, or a buffer overrun, etc.

    Things to try:

    1. Run "pystack" in gdb, from Misc/gdbinit
    2. Print the exception type. Use "up" until you reach
      BaseException_clear, then do "print self->ob_type->tp_name". Also do
      "print *self" and make sure the ob_refcnt is at 0 and the other fields
      look sane.
    3. Compile using --without-pymalloc and throw it at a real memory
      debugger. I'd suggest starting with your libc's own debugging
      options, as they tend to be less invasive:
      http://developer.apple.com/documentation/Performance/Conceptual/ManagingMemory/Articles/MallocDebug.html
      . If that doesn't work, look at Electric Fence, Valgrind, or your
      tool of choice.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jul 3, 2008

    Also, make sure you do a "make clean" since you last updated the tree or
    touched any file or ran configure. The automatic dependency checking
    isn't 100% reliable.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 3, 2008

    Barring the segfaults Mark is seeing, I went through and removed all of
    the tests, and I have been incrementally adding them back one by one. _TestQueue seems to be the one (at least, the first) which is hanging
    intermittently in a racquire(). If anyone else who is having hangs minds,
    please try removing _TestQueue and see if you can still get it to hang.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 3, 2008

    The two tracebacks provided by Mark seem to correspond to the following
    python stack (innermost last):

    Lib/test/test_multiprocessing.py, line 1005, in _test_map_unordered
    self.assertEqual(sorted(it), map(sqr, range(1000)))
    Lib/multiprocessing/pool.py, line 500, in IMapIterator.next()
    self._cond.acquire()
    Lib/threading.py, line 123, in _RLock.acquire():
    rc = self.__block.acquire(blocking)

    @donmez
    Copy link
    Mannequin

    donmez mannequin commented Jul 3, 2008

    The test hanged for me at first try but worked fine on the second test,
    weird.

    @paulmelis
    Copy link
    Mannequin

    paulmelis mannequin commented Jul 3, 2008

    On a Linux system (FC4) with r64686 of the Py3k branch I also still get
    occassional hangs (with ./python -E -bb ./Lib/test/regrtest.py -v
    test_multiprocessing). Mostly this seems to occur with the very first
    test executed, i.e. before any of the "test_... " lines have been generated.

    The following may or may not be related. Some time ago I decided to give
    valgrind a try to see if it could detect anything strange going on with
    the multiprocessing tests, specifically using the 'helgrind'
    thread-debugging tool that comes with it.

    Valgrind reports as its first error:

    ==9719== Thread #1: Bug in libpthread: sem_wait succeeded on semaphore
    without prior sem_post
    ==9719== at 0x4007FFF: sem_wait_WRK (hg_intercepts.c:1057)
    ==9719== by 0x4008094: sem_wait@* (hg_intercepts.c:1073)
    ==9719== by 0x46A0087: semlock_acquire (semaphore.c:310)
    ==9719== by 0x808C121: PyEval_EvalFrameEx (ceval.c:3371)
    ==9719== by 0x808D0FE: PyEval_EvalCodeEx (ceval.c:2808)
    ==9719== by 0x808B9D0: PyEval_EvalFrameEx (ceval.c:3469)
    ==9719== by 0x808D0FE: PyEval_EvalCodeEx (ceval.c:2808)
    ==9719== by 0x80F4B65: function_call (funcobject.c:628)
    ==9719== by 0x80D1207: PyObject_Call (abstract.c:2178)
    ==9719== by 0x80890EC: PyEval_EvalFrameEx (ceval.c:3672)
    ==9719== by 0x808C1A9: PyEval_EvalFrameEx (ceval.c:3459)
    ==9719== by 0x808C1A9: PyEval_EvalFrameEx (ceval.c:3459)
    ==9716== Thread #1 is the program's root thread

    I've been hesitant to report this as the claim that libpthread is broken
    is pretty bold. I contacted the valgrind devs about this, see [1].
    More recently, someone on the valgrind list reported problems that do
    seem to indicate there are broken libpthreads out there (see [2]), as
    this individual reports a semaphore wait not blocking where it should.

    Could it be that the multiprocessing tests are exposing one or more bugs
    in libpthread?

    [1] http://thread.gmane.org/gmane.comp.debugging.valgrind/8345
    [2] http://thread.gmane.org/gmane.comp.debugging.valgrind/8384

    @mdickinson
    Copy link
    Member

    Are you sure that's right? That traceback has no mention of
    PyObject_Malloc or obmalloc.c.

    So now I think that the traceback was right. There was no mention of
    PyObject_Malloc or obmalloc.c because the traceback only showed 1 of the 9
    threads, and the failed assert occurred in a different thread.

    I've attached another traceback, showing all the threads, and applying 'tb
    full' to the relevant thread. (This was from a different run, but with
    the same failed assertion at line 746 of Objects/obmalloc.c.)

    @donmez
    Copy link
    Mannequin

    donmez mannequin commented Jul 8, 2008

    I managed to hang on Ubuntu, here is the backtrace that I got with CTRL-C:

    Process PoolWorker-5:1:
    Traceback (most recent call last):
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    232, in _bootstrap
        test_bsddb test_bsddb3 test_cProfile test_kqueue test_lib2to3
    2 skips unexpected on linux2:
        test_bsddb3 test_bsddb
    Process PoolWorker-5:3:
    Traceback (most recent call last):
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    232, in _bootstrap
        self.run()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    88, in run
        self._target(*self._args, **self._kwargs)
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/pool.py", line
    57, in worker
        self.run()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    88, in run
        self._target(*self._args, **self._kwargs)
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/pool.py", line
    57, in worker
        task = get()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/queues.py", line
    339, in get
        task = get()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/queues.py", line
    337, in get
        return recv()
      File "/home/cartman/Sources/py3k/Lib/pickle.py", line 1327, in loads
        racquire()
    KeyboardInterrupt
    Process PoolWorker-5:2:
    Traceback (most recent call last):
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    232, in _bootstrap
        self.run()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    88, in run
        self._target(*self._args, **self._kwargs)
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/pool.py", line
    57, in worker
        def loads(s, *, encoding="ASCII", errors="strict"):
    KeyboardInterrupt
    Process PoolWorker-5:4:
    Traceback (most recent call last):
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    232, in _bootstrap
        self.run()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/process.py", line
    88, in run
        self._target(*self._args, **self._kwargs)
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/pool.py", line
    57, in worker
        task = get()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/queues.py", line
    337, in get
        racquire()
    KeyboardInterrupt
        task = get()
      File "/home/cartman/Sources/py3k/Lib/multiprocessing/queues.py", line
    337, in get
        racquire()
    KeyboardInterrupt
    ^CError in atexit._run_exitfuncs:
    make: *** [testall] Segmentation fault

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2008

    I found that on my Debian64, running test_multiprocessing under gdb
    hangs even before the first test is started - somewhere in the
    installation of the Manager.

    And it appears that the problem is described in bpo-874900: "threading
    module can deadlock after fork".
    I don't know if it's a good idea to mix fork and threads, but the patch
    I attached to bpo-874900 seems to correct the hang.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 8, 2008

    Thanks Amaury - I've been working through the tests and identifying
    the "problem children" - I'll finish that up and then try re-running
    them with the 874900 patch.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jul 8, 2008

    I'm still seeing intermittent lockups on Ubuntu 7.10 - traceback on
    ctrl-C is similar to that posted by Ismail above.

    Since Jesse seems to be on top of this, I'll stick to using -x
    test_multiprocessing for the moment.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jul 8, 2008

    Bumping back to release blocker for beta 2 (Barry may choose to defer it
    again, but it should at least be on his radar).

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jul 8, 2008

    Updated issue title to more accurately reflect scope of the problem.

    @ncoghlan ncoghlan changed the title test_multiprocessing hangs on OS X 10.5.3 test_multiprocessing hangs intermittently on POSIX platforms Jul 8, 2008
    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jul 8, 2008

    I forgot to mention that I am seeing the intermittent hangs on the trunk
    (2.6). I haven't been testing it on Py3k.

    @warsaw
    Copy link
    Member Author

    warsaw commented Jul 16, 2008

    Sadly _multiprocessing apparently doesn't even build on my Ubuntu 8.04
    machine and it still hangs on my 10.5 machine.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 16, 2008

    On Jul 15, 2008, at 8:38 PM, "Barry A. Warsaw"
    <report@bugs.python.org> wrote:

    Barry A. Warsaw <barry@python.org> added the comment:

    Sadly _multiprocessing apparently doesn't even build on my Ubuntu 8.04
    machine and it still hangs on my 10.5 machine.

    There is no reason it shouldn't compile on ubuntu - without the patch
    for the bug I added as a dependency, we will keep seeing hangs

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 16, 2008

    Barry - can you email the compile errors?

    @warsaw
    Copy link
    Member Author

    warsaw commented Jul 16, 2008

    Something's very strange. The first make after configure fails to build
    _multiprocessing, but a subsequent make succeeds. I'll see if I can
    capture the compilation error message.

    @warsaw
    Copy link
    Member Author

    warsaw commented Jul 16, 2008

    Here's the 'make' output. What's strange is that moving
    _multiprocessing{_failed,}.so, the import works just fine.

    building '_multiprocessing' extension
    creating
    build/temp.linux-i686-3.0/home/barry/projects/python/python30/Modules/_multiprocessing
    gcc -pthread -fPIC -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall
    -Wstrict-prototypes -DHAVE_SEM_OPEN=1 -DHAVE_FD_TRANSFER=1
    -DHAVE_SEM_TIMEDWAIT=1 -IModules/_multiprocessing -I.
    -I/home/barry/projects/python/python30/./Include -I. -IInclude
    -I./Include -I/usr/local/include
    -I/home/barry/projects/python/python30/Include
    -I/home/barry/projects/python/python30 -c
    /home/barry/projects/python/python30/Modules/_multiprocessing/multiprocessing.c
    -o
    build/temp.linux-i686-3.0/home/barry/projects/python/python30/Modules/_multiprocessing/multiprocessing.o
    gcc -pthread -fPIC -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall
    -Wstrict-prototypes -DHAVE_SEM_OPEN=1 -DHAVE_FD_TRANSFER=1
    -DHAVE_SEM_TIMEDWAIT=1 -IModules/_multiprocessing -I.
    -I/home/barry/projects/python/python30/./Include -I. -IInclude
    -I./Include -I/usr/local/include
    -I/home/barry/projects/python/python30/Include
    -I/home/barry/projects/python/python30 -c
    /home/barry/projects/python/python30/Modules/_multiprocessing/socket_connection.c
    -o
    build/temp.linux-i686-3.0/home/barry/projects/python/python30/Modules/_multiprocessing/socket_connection.o
    gcc -pthread -fPIC -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall
    -Wstrict-prototypes -DHAVE_SEM_OPEN=1 -DHAVE_FD_TRANSFER=1
    -DHAVE_SEM_TIMEDWAIT=1 -IModules/_multiprocessing -I.
    -I/home/barry/projects/python/python30/./Include -I. -IInclude
    -I./Include -I/usr/local/include
    -I/home/barry/projects/python/python30/Include
    -I/home/barry/projects/python/python30 -c
    /home/barry/projects/python/python30/Modules/_multiprocessing/semaphore.c -o
    build/temp.linux-i686-3.0/home/barry/projects/python/python30/Modules/_multiprocessing/semaphore.o
    gcc -pthread -shared
    build/temp.linux-i686-3.0/home/barry/projects/python/python30/Modules/_multiprocessing/multiprocessing.o
    build/temp.linux-i686-3.0/home/barry/projects/python/python30/Modules/_multiprocessing/socket_connection.o
    build/temp.linux-i686-3.0/home/barry/projects/python/python30/Modules/_multiprocessing/semaphore.o
    -L/usr/local/lib -o build/lib.linux-i686-3.0/_multiprocessing.so
    *** WARNING: renaming "_multiprocessing" since importing it failed: No
    module named _multiprocessing

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 17, 2008

    bpo-874900's patch seems to have resolve the hangs. I am closing this
    issue as fixed.

    @jnoller jnoller mannequin closed this as completed Jul 17, 2008
    @tebeka
    Copy link
    Mannequin

    tebeka mannequin commented Jul 18, 2008

    I confirm this is solved for me in beta 2

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants