classification
Title: buildbot: test_multiprocessing timeout (test_notify_all? test_pool_worker_lifetime?)
Type: behavior Stage: resolved
Components: Tests Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: flox, haypo, jnoller, neologix, pitrou, python-dev
Priority: normal Keywords: buildbot, patch

Created on 2010-04-16 20:56 by haypo, last changed 2011-04-11 12:45 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
pool_shutdown_race.diff neologix, 2011-04-10 22:00 review
Messages (18)
msg103359 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-04-16 20:56
Example: http://www.python.org/dev/buildbot/3.x/builders/x86 FreeBSD 7.2 3.x/builds/480/steps/test/logs/stdio

-----------
test_multiprocessing
test test_multiprocessing failed -- Traceback (most recent call last):
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd7/build/Lib/test/test_multiprocessing.py", line 737, in test_notify_all
    self.assertReturnsIfImplemented(6, get_value, woken)
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd7/build/Lib/test/test_multiprocessing.py", line 111, in assertReturnsIfImplemented
    return self.assertEqual(value, res)
AssertionError: 6 != 5

Re-running test test_multiprocessing in verbose mode
test_array (test.test_multiprocessing.WithProcessesTestArray) ... skipped 'requires _ctypes'
test_getobj_getlock_obj (test.test_multiprocessing.WithProcessesTestArray) ... skipped 'requires _ctypes'
test_rawarray (test.test_multiprocessing.WithProcessesTestArray) ... skipped 'requires _ctypes'
test_notify (test.test_multiprocessing.WithProcessesTestCondition) ... ok
test_notify_all (test.test_multiprocessing.WithProcessesTestCondition) ... FAIL
test_timeout (test.test_multiprocessing.WithProcessesTestCondition) ... ok
test_connection (test.test_multiprocessing.WithProcessesTestConnection) ... ok
test_duplex_false (test.test_multiprocessing.WithProcessesTestConnection) ... ok
test_sendbytes (test.test_multiprocessing.WithProcessesTestConnection) ... ok
test_spawn_close (test.test_multiprocessing.WithProcessesTestConnection) ... ok
test_event (test.test_multiprocessing.WithProcessesTestEvent) ... ok
test_finalize (test.test_multiprocessing.WithProcessesTestFinalize) ... ok
test_heap (test.test_multiprocessing.WithProcessesTestHeap) ... ok
test_import (test.test_multiprocessing.WithProcessesTestImportStar) ... ok
test_listener_client (test.test_multiprocessing.WithProcessesTestListenerClient) ... ok
test_lock (test.test_multiprocessing.WithProcessesTestLock) ... ok
test_lock_context (test.test_multiprocessing.WithProcessesTestLock) ... ok
test_rlock (test.test_multiprocessing.WithProcessesTestLock) ... ok
test_enable_logging (test.test_multiprocessing.WithProcessesTestLogging) ... ok
test_level (test.test_multiprocessing.WithProcessesTestLogging) ... ok
test_rapid_restart (test.test_multiprocessing.WithProcessesTestManagerRestart) ... ok
test_apply (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_async (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_async_timeout (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_imap (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_imap_unordered (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_make_pool (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_map (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_map_chunksize (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_terminate (test.test_multiprocessing.WithProcessesTestPool) ... ok
test_pool_worker_lifetime (test.test_multiprocessing.WithProcessesTestPoolWorkerLifetime) ... 
command timed out: 1800 seconds without output, killing pid 87365
process killed by signal 9
program finished with exit code -1
elapsedTime=6317.267055
-----------
msg103367 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-04-16 21:53
The error occurs on buildbots:
 - i386 Ubuntu 3.x (r80115)
 - x86 FreeBSD 7.2 3.x (r80116, r80075)
 - x86 XP-4 3.x (r80115, r80106)
msg103372 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-04-16 22:00
See also issue #8429.
msg103374 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-04-16 22:09
All builds ends with "test_multiprocessing" and then the buildbot kills the proces. Only "x86 FreeBSD 7.2 3.x" last build (480, not the others) show us that test_pool_worker_lifetime (test.test_multiprocessing.WithProcessesTestPoolWorkerLifetime) hung.
msg103398 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-04-17 11:58
Same issue on "sparc solaris10 gcc 3.x":
http://www.python.org/dev/buildbot/builders/sparc solaris10 gcc 3.x/builds/639/steps/test/logs/stdio
msg113899 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-08-14 14:09
Still failing on 3.2 and 2.7

 - x86 FreeBSD 7.2 3.x r83981, r83971 ...
     http://www.python.org/dev/buildbot/all/builders/x86%20FreeBSD%207.2%203.x/builds/823
 - x86 FreeBSD 7.2 2.7 r83985, r83806 ...
     http://www.python.org/dev/buildbot/all/builders/x86%20FreeBSD%207.2%202.7/builds/67
 - x86 XP-4 2.7 r83985
     http://www.python.org/dev/buildbot/all/builders/x86%20XP-4%202.7/builds/146
 - sparc Ubuntu 2.7 r82939
     http://www.python.org/dev/buildbot/all/builders/sparc%20Ubuntu%202.7/builds/11


Traceback on "x86 XP-4 2.7":
test test_multiprocessing failed -- Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_multiprocessing.py", line 746, in test_notify_all
    self.assertReturnsIfImplemented(6, get_value, woken)
  File "D:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_multiprocessing.py", line 120, in assertReturnsIfImplemented
    return self.assertEqual(value, res)
AssertionError: 6 != 5


Traceback on "x86 FreeBSD 7.2 3.x":
======================================================================
FAIL: test_notify_all (test.test_multiprocessing.WithProcessesTestCondition)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd7/build/Lib/test/test_multiprocessing.py", line 740, in test_notify_all
    self.assertReturnsIfImplemented(6, get_value, woken)
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd7/build/Lib/test/test_multiprocessing.py", line 114, in assertReturnsIfImplemented
    return self.assertEqual(value, res)
AssertionError: 6 != 3


Additionally, these 2.7 buildbots hang on test_multiprocessing:
 - PPC Leopard 2.7 r83996, r83961 ...
   http://www.python.org/dev/buildbot/all/builders/PPC%20Leopard%202.7/builds/84
 - PPC Tiger 2.7 r83961, r83879 ...
     http://www.python.org/dev/buildbot/all/builders/PPC%20Tiger%202.7/builds/100
 - x86 FreeBSD 7.2 2.7 r83765
     http://www.python.org/dev/buildbot/all/builders/x86%20FreeBSD%207.2%202.7/builds/54
msg113958 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-08-15 13:42
Seen on Windows 7 3.1:

test test_multiprocessing failed -- Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.1.bolen-windows7\build\lib\test\test_multiprocessing.py", line 746, in test_notify_all
    self.assertReturnsIfImplemented(6, get_value, woken)
  File "D:\cygwin\home\db3l\buildarea\3.1.bolen-windows7\build\lib\test\test_multiprocessing.py", line 120, in assertReturnsIfImplemented
    return self.assertEqual(value, res)
AssertionError: 6 != 2


There's a second issue at the end of the log, it seems to be the case 3 of issue #9592 (RuntimeError: maximum recursion depth exceeded while calling a Python object).

http://www.python.org/dev/buildbot/all/builders/x86%20Windows7%203.1/builds/676
msg113965 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2010-08-15 15:09
Is this intermittent, or consistently failing? Updating it with more buildbot failures doesn't help.
msg113967 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-08-15 15:18
It is intermittent on most buildbots.
The exception is "x86 FreeBSD 7.2 3.x" where it occurs on each run.

http://www.python.org/dev/buildbot/all/builders/x86%20FreeBSD%207.2%203.x/
msg132943 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-04 13:38
Here is a nice trace on "PPC Leopard 3.x" thanks to faulthandler + regrtest timeout (30 minutes):
-------------------------------
...
[218/354] test_multiprocessing
Thread 0xf0617000:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 235 in wait
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/queue.py", line 185 in get
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 372 in _handle_results
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 688 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 735 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 708 in _bootstrap

Thread 0xf0595000:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 235 in wait
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/queue.py", line 185 in get
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 331 in _handle_tasks
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 688 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 735 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 708 in _bootstrap

Thread 0xf0513000:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 324 in _handle_workers
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 688 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 735 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 708 in _bootstrap

Thread 0xf0491000:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 235 in wait
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/queue.py", line 185 in get
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 102 in worker
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 688 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 735 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 708 in _bootstrap

Thread 0xf040f000:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 235 in wait
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/queue.py", line 185 in get
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 102 in worker
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 688 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 735 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 708 in _bootstrap

Thread 0xf038d000:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 235 in wait
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/queue.py", line 185 in get
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 102 in worker
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 688 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 735 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 708 in _bootstrap

Thread 0xf030b000:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 235 in wait
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/queue.py", line 185 in get
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 102 in worker
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 688 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 735 in _bootstrap_inner
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/threading.py", line 708 in _bootstrap

Thread 0xa09e1820:
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/forking.py", line 134 in poll
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/forking.py", line 149 in wait
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/process.py", line 149 in join
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/multiprocessing/pool.py", line 458 in join
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/test/test_multiprocessing.py", line 1195 in test_pool_worker_lifetime
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/case.py", line 387 in _executeTestPart
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/case.py", line 442 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/case.py", line 494 in __call__
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/test/support.py", line 1078 in run
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/test/support.py", line 1166 in _run_suite
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/test/support.py", line 1192 in run_unittest
  File "/Users/buildbot/buildarea/3.x.parc-leopard-1/build/Lib/test/test_multiprocessing.py", line 2141 in test_main
  File "./Lib/test/regrtest.py", line 1032 in runtest_inner
  File "./Lib/test/regrtest.py", line 826 in runtest
  File "./Lib/test/regrtest.py", line 650 in main
  File "./Lib/test/regrtest.py", line 1607 in <module>
make: *** [buildbottest] Error 1
-------------------------------
http://www.python.org/dev/buildbot/all/builders/PPC%20Leopard%203.x/builds/1634/steps/test/logs/stdio

 * The main thread (thread 0xa09e1820) is joining the pool (p.join()), wait in Popen.poll(): os.waitpid(self.pid, 0)
 * There are 4 workers waiting on their get() function, _Condition.wait(): if timeout is None: waiter.acquire()
 * The multiprocess Pool has 3 threads:

   * _handle_workers() (Thread 0xf0513000) is waiting in its polling loop:
       while pool._worker_handler._state == RUN and pool._state == RUN:
            pool._maintain_pool()
            time.sleep(0.1)
   * _handle_tasks() (thread 0xf0595000) is waiting on taskqueue.get()
   * _handle_results() (thread 0xf0617000) is waiting on get() method of another Queue

The main thread is waiting a child process, but we don't have the state of this process.
msg133184 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-07 00:03
Another test_multiprocessing failure (30 min timeout) on x86 XP-4 3.x:
----------------------
[125/354] test_multiprocessing
Thread 0x00000e5c:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 235 in wait
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\queue.py", line 185 in get
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 372 in _handle_results
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 688 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 735 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 708 in _bootstrap

Thread 0x00000c14:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 235 in wait
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\queue.py", line 185 in get
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 331 in _handle_tasks
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 688 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 735 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 708 in _bootstrap

Thread 0x00000d78:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 324 in _handle_workers
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 688 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 735 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 708 in _bootstrap

Thread 0x000003b8:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 235 in wait
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\queue.py", line 185 in get
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 102 in worker
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 688 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 735 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 708 in _bootstrap

Thread 0x000001e8:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 235 in wait
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\queue.py", line 185 in get
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 102 in worker
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 688 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 735 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 708 in _bootstrap

Thread 0x000002ac:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 235 in wait
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\queue.py", line 185 in get
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 102 in worker
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 688 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 735 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 708 in _bootstrap

Thread 0x000005a4:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 235 in wait
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\queue.py", line 185 in get
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 102 in worker
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 688 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 735 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\threading.py", line 708 in _bootstrap

Thread 0x00000c84:
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\forking.py", line 287 in wait
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\process.py", line 149 in join
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\multiprocessing\pool.py", line 458 in join
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_multiprocessing.py", line 1143 in test_async_error_callback
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\case.py", line 387 in _executeTestPart
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\case.py", line 442 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\case.py", line 494 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\suite.py", line 105 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\suite.py", line 67 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\suite.py", line 105 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\suite.py", line 67 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\suite.py", line 105 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\unittest\suite.py", line 67 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\support.py", line 1078 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\support.py", line 1166 in _run_suite
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\support.py", line 1192 in run_unittest
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_multiprocessing.py", line 2152 in test_main
  File "../lib/test/regrtest.py", line 1032 in runtest_inner
  File "../lib/test/regrtest.py", line 826 in runtest
  File "../lib/test/regrtest.py", line 650 in main
  File "../lib/test/regrtest.py", line 1607 in <module>

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
----------------------
http://www.python.org/dev/buildbot/all/builders/x86%20XP-4%203.x/builds/4349/steps/test/logs/stdio
msg133462 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-10 12:41
I think those lockups are due to a race in the Pool shutdown code.
In Lib/multiprocessing/pool.py:

    def close(self):
        debug('closing pool')
        if self._state == RUN:
            self._state = CLOSE
            self._worker_handler._state = CLOSE
            self._taskqueue.put(None) 

We set the current state to CLOSE, and send None to the taskqueue, so that task_handler detects that we want to shut down the queue and sends None (sentinel) to the inqueue for each worker process.
When a worker process receives this sentinel, it exists, and when Pool's join method is called, each process is joined successfully.
Now, there's a problem, because of the worker_hanler thread.
This thread constantly starts new threads if existing one exited after having completed their work:

    def _handle_workers(pool):
        while pool._worker_handler._state == RUN and pool._state == RUN:
            pool._maintain_pool()
            time.sleep(0.1)
        debug('worker handler exiting')

where 

    def _maintain_pool(self):
        """Clean up any exited workers and start replacements for them.
        """
        if self._join_exited_workers():
            self._repopulate_pool()

Imagine the following happens:

worker_handler checks that the pool is still running (state == RUN), but before calling maintain_pool, it's preempted (releasal of the GIL), and Pool's close() methode is called :
state is set to CLOSE, None is put to taskqueue, and worker threads exit.
Then, Pool's join is called:

    def join(self):
        debug('joining pool')
        assert self._state in (CLOSE, TERMINATE)
        self._worker_handler.join()
        self._task_handler.join()
        self._result_handler.join()
        for p in self._pool:
            p.join()


this blocks until worker_handler exits. This thread sooner or later resumes and calls maintain_pool.
maintain_pool calls repopulate_pool, which recreates new worker threads/processes.
Then, worker_handler checks the current state, sees CLOSE, and exists.
Then, Pool's join blocks  there:
        for p in self._pool:
            p.join()

since the newly created processes never receive the sentinels (already consumed by the previous worker processes)...

This race can be reproduced almost every time by just adding:


    def _handle_workers(pool):
        while pool._worker_handler._state == RUN and pool._state == RUN:
+            time.sleep(1)
            pool._maintain_pool()
            time.sleep(0.1)
        debug('worker handler exiting')

Then something as simple as this will block:

p = multiprocessing.Pool(3)
p.close()
p.join()

I still have to think of a clean way to solve this.
msg133483 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-10 22:00
Attached is a patch fixing this race, and a similar one in Pool's terminate.
msg133484 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-10 22:03
Nice! See also issue11814.
msg133485 - (view) Author: Roundup Robot (python-dev) Date: 2011-04-10 22:23
New changeset d5e43afeede6 by Antoine Pitrou in branch '3.2':
Issue #8428: Fix a race condition in multiprocessing.Pool when terminating
http://hg.python.org/cpython/rev/d5e43afeede6
msg133487 - (view) Author: Roundup Robot (python-dev) Date: 2011-04-10 22:28
New changeset dfc61dc14f59 by Antoine Pitrou in branch '2.7':
Issue #8428: Fix a race condition in multiprocessing.Pool when terminating
http://hg.python.org/cpython/rev/dfc61dc14f59
msg133488 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-10 22:30
Should be fixed now, thank you Charles-François.
As for the TestCondition failure, there's a separate issue11790 open.
(Victor, please don't file many bugs in a single issue!)
msg133517 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-11 12:45
> (Victor, please don't file many bugs in a single issue!)

I thought that these issues were the same.

Victor
History
Date User Action Args
2011-04-11 12:45:30hayposetmessages: + msg133517
2011-04-10 22:30:28pitrousetstatus: open -> closed
versions: + Python 3.3
messages: + msg133488

resolution: fixed
stage: resolved
2011-04-10 22:28:25python-devsetmessages: + msg133487
2011-04-10 22:23:30python-devsetnosy: + python-dev
messages: + msg133485
2011-04-10 22:03:19pitrousetnosy: + pitrou
messages: + msg133484
2011-04-10 22:00:40neologixsetfiles: + pool_shutdown_race.diff
keywords: + patch
messages: + msg133483
2011-04-10 12:41:06neologixsetnosy: + neologix
messages: + msg133462
2011-04-07 00:03:12hayposetmessages: + msg133184
2011-04-04 13:38:41hayposetmessages: + msg132943
2010-08-15 15:18:57floxsetmessages: + msg113967
2010-08-15 15:09:58jnollersetmessages: + msg113965
2010-08-15 13:42:40floxsetmessages: + msg113958
2010-08-14 14:09:32floxsetversions: + Python 2.7
nosy: + flox

messages: + msg113899

components: + Tests
type: behavior
2010-04-19 10:56:31hayposetnosy: + jnoller
2010-04-19 10:51:01hayposettitle: buildbot: test_multiprocessing timeout -> buildbot: test_multiprocessing timeout (test_notify_all? test_pool_worker_lifetime?)
2010-04-17 11:58:47hayposetmessages: + msg103398
2010-04-16 22:09:17hayposetmessages: + msg103374
2010-04-16 22:00:41hayposetmessages: + msg103372
2010-04-16 21:53:14hayposetmessages: + msg103367
2010-04-16 20:56:51haypocreate