This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor
Type: Stage: resolved
Components: Tests Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vstinner Nosy List: aeros, corona10, pitrou, tomMoral, vstinner
Priority: normal Keywords: patch

Created on 2020-03-17 17:01 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 19739 merged vstinner, 2020-04-27 19:06
PR 19751 closed aeros, 2020-04-28 08:44
PR 19758 closed vstinner, 2020-04-28 15:58
PR 19760 closed vstinner, 2020-04-28 16:37
PR 19788 open tomMoral, 2020-04-29 09:23
Messages (34)
msg364451 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-17 17:01
AMD64 Ubuntu Shared 3.x:
https://buildbot.python.org/all/#/builders/101/builds/532

test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) ... 
Stdout:
15.51s 

Stderr:
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Dangling thread: <Thread(QueueFeederThread, stopped daemon 140540322510592)>
Dangling thread: <_MainThread(MainThread, started 140540525950528)>
Dangling thread: <_ExecutorManagerThread(Thread-111, stopped daemon 140540401628928)>

(...)

======================================================================
ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [exit at task unpickle]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
    executor.shutdown(wait=True)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 721, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 93, in wakeup
    self._writer.send_bytes(b"")
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Stdout:
15.51s 

Stderr:
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Dangling thread: <Thread(QueueFeederThread, stopped daemon 140540322510592)>
Dangling thread: <_MainThread(MainThread, started 140540525950528)>
Dangling thread: <_ExecutorManagerThread(Thread-111, stopped daemon 140540401628928)>

--

On the same build, test_concurrent_futures timed out after 15 min, while running test_ressources_gced_in_workers():

0:29:08 load avg: 1.46 Re-running test_concurrent_futures in verbose mode
(...)
test_map_exception (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
test_map_timeout (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
test_max_workers_negative (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
test_max_workers_too_large (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... skipped 'Windows-only process limit'
test_no_stale_references (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
Timeout (0:15:00)!
Thread 0x00007f38bf766700 (most recent call first):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 303 in wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/queues.py", line 227 in _feed
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 882 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 944 in _bootstrap_inner
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007f38bff67700 (most recent call first):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/selectors.py", line 415 in select
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 936 in wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 372 in wait_result_broken_or_wakeup
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 319 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 944 in _bootstrap_inner
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007f38c7128640 (most recent call first):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 303 in wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/_base.py", line 434 in result
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 955 in test_ressources_gced_in_workers
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 616 in _callTestMethod
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 659 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 719 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 176 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 2079 in _run_suite
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 2201 in run_unittest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 209 in _test_module
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 234 in _runtest_inner2
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 270 in _runtest_inner
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 153 in _runtest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 193 in runtest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 318 in rerun_failed_tests
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 691 in _main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 634 in main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 712 in main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/__main__.py", line 2 in <module>
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 87 in _run_code
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 194 in _run_module_as_main
test_ressources_gced_in_workers (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... Makefile:1171: recipe for target 'buildbottest' failed
make: *** [buildbottest] Error 1

command timed out: 1200 seconds without output running [b'make', b'buildbottest', b'TESTOPTS=-j2 --junit-xml test-results.xml ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=900'], attempting to kill
program finished with exit code 2
elapsedTime=3849.667593
msg364546 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 18:24
Same bug on AMD64 FreeBSD Non-Debug 3.x:
https://buildbot.python.org/all/#/builders/214/builds/472

======================================================================
ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [crash during func execution on worker]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
    executor.shutdown(wait=True)
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 721, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 93, in wakeup
    self._writer.send_bytes(b"")
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Stdout:
8.35s 

Stderr:
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Dangling thread: <_MainThread(MainThread, started 34369908736)>
Dangling thread: <_ExecutorManagerThread(Thread-114, stopped daemon 34396434432)>
Dangling thread: <Thread(QueueFeederThread, stopped daemon 34377508352)>

----------------------------------------------------------------------
msg364552 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 18:31
> Same bug on AMD64 FreeBSD Non-Debug 3.x:
> https://buildbot.python.org/all/#/builders/214/builds/472

Oh, test_crash failed twice, but not on the same test case:

* ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [crash during func execution on worker]
* ERROR: test_crash (test.test_concurrent_futures.ProcessPoolForkExecutorDeadlockTest) [crash during func execution on worker]

The second failure was when test_concurrent_futures was re-run sequentially.
msg367434 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 16:02
See also bpo-30966 "Add multiprocessing.SimpleQueue.close()".
msg367449 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 19:36
New changeset 5d1f32d33ba24d0aa87235ae40207bb57778388b by Victor Stinner in branch 'master':
bpo-39995: Split test_concurrent_futures.test_crash() into sub-tests (GH-19739)
https://github.com/python/cpython/commit/5d1f32d33ba24d0aa87235ae40207bb57778388b
msg367450 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 19:46
AMD64 Ubuntu Shared 3.x:
https://buildbot.python.org/all/#/builders/101/builds/809

======================================================================
ERROR: test_shutdown_no_wait (test.test_concurrent_futures.ProcessPoolForkserverProcessPoolShutdownTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
    executor.shutdown(wait=False)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 188, in send_bytes
    self._check_closed()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 141, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed

(...)
0:32:37 load avg: 1.64 Re-running test_concurrent_futures in verbose mode
(...)

======================================================================
ERROR: test_shutdown_no_wait (test.test_concurrent_futures.ProcessPoolForkProcessPoolShutdownTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
    executor.shutdown(wait=False)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
msg367451 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 19:54
> See also bpo-30966 "Add multiprocessing.SimpleQueue.close()".

I pushed a commit 1a275013d1ecc2e3778d64fda86174b2f13d6969: 
"Process.shutdown(wait=True) of concurrent.futures now closes
explicitly the result queue."

test_shutdown_deadlock_pickle() still rely on the queue to be closed implicitly. Queue created at:

  (...)
  File "/home/vstinner/python/master/Lib/test/test_concurrent_futures.py", lineno 1196
    with self.executor_type(max_workers=2,
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", lineno 637
    self._result_queue = mp_context.SimpleQueue()
  File "/home/vstinner/python/master/Lib/multiprocessing/context.py", lineno 113
    return SimpleQueue(ctx=self.get_context())
  File "/home/vstinner/python/master/Lib/multiprocessing/queues.py", lineno 341
    self._reader, self._writer = connection.Pipe(duplex=False)
  File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", lineno 539
    c2 = Connection(fd2, readable=False)
msg367453 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 20:09
AMD64 Fedora Stable Clang Installed 3.x:
https://buildbot.python.org/all/#/builders/127/builds/679

0:04:21 load avg: 1.29 [423/423/1] test_concurrent_futures failed (2 min 39 sec)
Warning -- threading_cleanup() failed to cleanup -1 threads (count: 0, dangling: 3)
Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures
Warning -- threading._dangling was modified by test_concurrent_futures
test_cancel (test.test_concurrent_futures.FutureTests) ... ok
test_cancelled (test.test_concurrent_futures.FutureTests) ... ok
test_done (test.test_concurrent_futures.FutureTests) ... ok
(...)
test_first_exception_some_already_complete (test.test_concurrent_futures.ThreadPoolWaitTests) ... 1.60s ok
test_pending_calls_race (test.test_concurrent_futures.ThreadPoolWaitTests) ... 0.11s ok
test_timeout (test.test_concurrent_futures.ThreadPoolWaitTests) ... 6.11s ok
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory

======================================================================
ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/test/test_concurrent_futures.py", line 130, in tearDown
    self.executor.shutdown(wait=True)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 724, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

----------------------------------------------------------------------

Ran 193 tests in 159.805s

FAILED (errors=1, skipped=6)
  Before: set()
  After:  {<weakref at 0x7f0828b17bd0; to 'SpawnProcess' at 0x7f082ace7400>, <weakref at 0x7f0828b17b80; to 'SpawnProcess' at 0x7f082ace7880>, <weakref at 0x7f0828b17b30; to 'SpawnProcess' at 0x7f0828bbc0a0>, <weakref at 0x7f0828b17720; to 'SpawnProcess' at 0x7f082ace72e0>, <weakref at 0x7f0828b17a90; to 'SpawnProcess' at 0x7f082ace7910>} 
  Before: {<weakref at 0x7f082b5ad400; to '_MainThread' at 0x7f082bcd8850>}
  After:  {<weakref at 0x7f0828b4d900; to 'Thread' at 0x7f0828bbc4c0>, <weakref at 0x7f0828b4d0e0; to '_MainThread' at 0x7f082bcd8850>, <weakref at 0x7f0828b17a40; to '_ExecutorManagerThread' at 0x7f0828bbc8e0>} 
test test_concurrent_futures failed

== Tests result: FAILURE ==
msg367455 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 20:34
x86 Gentoo Installed with X 3.x:
https://buildbot.python.org/all/#/builders/128/builds/726

test_del_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ...

Warning -- Unraisable exception
Exception ignored in: <function _ExecutorManagerThread.__init__.<locals>.weakref_cb at 0xb5067898>
Traceback (most recent call last):
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 281, in weakref_cb
    thread_wakeup.wakeup()
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

0.04s ok
msg367458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 21:15
ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)
  (...)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

It seems like Connection.close() was called while Connection._send() was called. I added debug logs:

* self._handle was equal to 4 at the function entry
* self._handle was equal to None when write() was called
msg367460 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 21:26
> It seems like Connection.close() was called while Connection._send() was called. I added debug logs:

The connection was closed by terminate_broken() called by _ExecutorManagerThread.run() thread:

test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... close handle 4
  File "/home/vstinner/python/master/Lib/threading.py", line 908, in _bootstrap
    self._bootstrap_inner()
  File "/home/vstinner/python/master/Lib/threading.py", line 950, in _bootstrap_inner
    self.run()
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 313, in run
    self.terminate_broken(cause)
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 456, in terminate_broken
    self.join_executor_internals()
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 503, in join_executor_internals
    self.thread_wakeup.close()
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 75, in close
    self._writer.close()
msg367462 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 21:46
terminate_broken() method was added by:

commit 0e89076247580ba0e570c4816f0e5628a7e36e83
Author: Thomas Moreau <thomas.moreau.2010@gmail.com>
Date:   Sun Mar 1 21:49:14 2020 +0100

    bpo-39678: refactor queue manager thread (GH-18551)
msg367463 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 21:48
> ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)

The patch below makes this test failure more likely:

diff --git a/Lib/multiprocessing/connection.py b/Lib/multiprocessing/connection.py
index 510e4b5aba..63518e55d9 100644
--- a/Lib/multiprocessing/connection.py
+++ b/Lib/multiprocessing/connection.py
@@ -370,6 +370,7 @@ class Connection(_ConnectionBase):
     def _send(self, buf, write=_write):
         remaining = len(buf)
         while True:
+            time.sleep(0.050)
             n = write(self._handle, buf)
             remaining -= n
             if remaining == 0:
msg367465 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 22:05
It seems like test_killed_child() race condition was introduced by:

commit a5cbab552d294d99fde864306632d7e511a75d3c (refs/bisect/bad)
Author: Thomas Moreau <thomas.moreau.2010@gmail.com>
Date:   Sun Feb 16 19:09:26 2020 +0100

    bpo-39104: Fix hanging ProcessPoolExecutor on shutdown nowait with pickling failure (GH-17670)
msg367486 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-28 03:09
I can't be certain for the other failures, but I'm currently exploring a potential solution for addressing the `test_killed_child` failure. To me, it seems to be a race condition with attempting to call _ThreadWakeup.close() while there are still bytes being sent. IMO, we should wait until closing the pipe's reader and writer until it's finished sending or receiving bytes. Here's one way to implement that w/ threading.Event:

diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
index 8e9b69a8f0..9bf073fc34 100644
--- a/Lib/concurrent/futures/process.py
+++ b/Lib/concurrent/futures/process.py
@@ -68,21 +68,30 @@ class _ThreadWakeup:
     def __init__(self):
         self._closed = False
         self._reader, self._writer = mp.Pipe(duplex=False)
+        # Used to ensure pipe is not closed while sending or receiving bytes
+        self._not_running = threading.Event()
+        # Initialize event as True
+        self._not_running.set()
 
     def close(self):
         if not self._closed:
+            self._not_running.wait()
             self._closed = True
             self._writer.close()
             self._reader.close()
 
     def wakeup(self):
         if not self._closed:
+            self._not_running.clear()
             self._writer.send_bytes(b"")
+            self._not_running.set()
 
     def clear(self):
         if not self._closed:
+            self._not_running.clear()
             while self._reader.poll():
                 self._reader.recv_bytes()
+            self._not_running.set()


From using Victor's method of replicating the failure with inserting a `time.sleep(0.050)` in multiprocessing.Connection._send(), it appears to fix the failure in test_killed_child. I believe it would also fix the other failures since they appear to be caused by the same core issue, but I've been unable to replicate them locally so I'm not 100% certain.
msg367487 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-28 03:16
After writing the above out and a bit of further consideration, I think it might make more sense to wait for the event after setting `self._closed = True` so that it prevents future wakeup() and clear() calls from reading/writing to the pipe, while still allowing ones that are currently ongoing to finish.

Thoughts?
msg367503 - (view) Author: Thomas Moreau (tomMoral) * Date: 2020-04-28 07:05
Sorry I just saw this. It seems that I introduced this regression.

One of the goal of having a `ThreadWakeup` and not a `SimpleQueue` is to avoid using locks that can hinder the performance of the executor. I don't remember the exact details but I think I did some benchmark and it was giving lower performances for large set of small tasks (not so sure anymore). If a fully synchronous method is chosen, maybe it is safer to rely on a `SimpleQueue` as it will be lower maintenance. If the solution proposed by aeros is chosen, the event can probably be replaced by a lock no? It would be easier to understand I guess.

From the failures, it seems to be a race condition between `shutdown` and `terminate_broken`. The race condition should not be possible in `submit` as in principle, the `join_executor_internals` is only called when no new task can be submitted.
One solution would be to use the `self._shutdown_lock` from the executor to protect the call to `close` in `terminate_broken` and the call to `self._thread_wakeup.wakeup` in `shutdown`. That way, the lock is only acquired at critical points without being used all the time. This could also be done by adding `lock=True/False` to only lock the potentially dangerous calls.
msg367508 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-28 08:08
How about the following (untested):

diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
index 8e9b69a8f0..c0c2eb3032 100644
--- a/Lib/concurrent/futures/process.py
+++ b/Lib/concurrent/futures/process.py
@@ -66,23 +66,29 @@ _global_shutdown = False
 
 class _ThreadWakeup:
     def __init__(self):
-        self._closed = False
         self._reader, self._writer = mp.Pipe(duplex=False)
 
     def close(self):
-        if not self._closed:
-            self._closed = True
-            self._writer.close()
-            self._reader.close()
+        r, w = self._reader, self._writer
+        self._reader = self._writer = None
+        if r is not None:
+            r.close()
+            w.close()
 
     def wakeup(self):
-        if not self._closed:
+        try:
             self._writer.send_bytes(b"")
+        except AttributeError:
+            # Closed
+            pass
 
     def clear(self):
-        if not self._closed:
+        try:
             while self._reader.poll():
                 self._reader.recv_bytes()
+        except AttributeError:
+            # Closed
+            pass
 
 
 def _python_exit():
msg367512 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-28 09:10
Oops, it seems that I opened PR-19751 a bit preemptively. When I get the chance, I'll see if Antoine's implementation can address the failures and do some comparisons.
msg367513 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-28 09:21
I decided to close PR-19751. Both because it does not correctly address the race condition (due to an oversight on my part) and it would add substantial overhead to _ThreadWakeup. Instead, I agree that we should explore a non-locking solution.
msg367539 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-28 16:06
Antoine Pitrou: "How about the following (untested): (...)"

Using Antoine's patch, test_killed_child() still fails (I used my msg367463 patch to make the failure more likely).
msg367540 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-28 16:09
With the same traceback and error message?
msg367542 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-28 16:38
With my msg367463 patch (add sleep), test_cancel_futures() fails. Example:

======================================================================
FAIL: test_cancel_futures (test.test_concurrent_futures.ProcessPoolForkProcessPoolShutdownTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vstinner/python/master/Lib/test/test_concurrent_futures.py", line 353, in test_cancel_futures
    self.assertTrue(len(cancelled) >= 35, msg=f"{len(cancelled)=}")
AssertionError: False is not true : len(cancelled)=0
msg367543 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-28 16:43
Thomas Moreau: "One solution would be to use the `self._shutdown_lock` from the executor to protect the call to `close` in `terminate_broken` and the call to `self._thread_wakeup.wakeup` in `shutdown`. That way, the lock is only acquired at critical points without being used all the time. This could also be done by adding `lock=True/False` to only lock the potentially dangerous calls."

I wrote a conservative PR 19760 which always lock ProcessPoolExecutor._shutdown_lock while accessing _ThreadWakeup.

PR 19760 fix test_killed_child(): it doesn't fail anymore, even with my msg367463 patch (add sleep).
msg367544 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-28 17:00
> With my msg367463 patch (add sleep), test_cancel_futures() fails.

The test uses sleep() as a synchronization primitive:

   executor.submit(time.sleep, .1).

That's bad, but it doesn't *have to* be fixed now. My msg367463 patch adds an artifical sleep: the test looks fine in practice. I prefer to wait until it fails on a buildbot worker before spending time to make the test more reliable.
msg367605 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-29 01:32
New changeset a4dfe8ede5a37576e17035dccfe109ba7752237e by Victor Stinner in branch 'master':
bpo-39995: Fix concurrent.futures _ThreadWakeup (GH-19760)
https://github.com/python/cpython/commit/a4dfe8ede5a37576e17035dccfe109ba7752237e
msg367608 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-29 01:34
I'm still getting more and more buildbot emails about test_concurrent_futures, so I merged my PR 19760 to fix buildbots.

Please revert or modify my PR 19760 if you have a better approach, but please check that test_killed_child() and ProcessPoolForkExecutorDeadlockTest tests don't fail with my msg367463 patch.

I would still appreciated a post-commit review of my change, since I don't know well concurrent.futures code :

bpo-39995: Fix concurrent.futures _ThreadWakeup (GH-19760)
https://github.com/python/cpython/commit/a4dfe8ede5a37576e17035dccfe109ba7752237e
msg367616 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-29 03:00
I am a bit concerned about the performance implication of accessing the thread wakeup through a lock in the call queue, but for now I think it's reasonable to address the buildbot failure with a locking solution while we try to find a better one.

I'm not certain if we'll be able to find one that correctly addresses the failures with zero additional locking, but I think we may be able to reduce the number of times we use the lock compared to the implementation in GH-19760. I'll spend some time exploring that as I find the time to, and report back with any significant findings.
msg367617 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-29 03:05
Also, thanks to Victor for coming up with an interim solution, and for the feedback from Antoine and Thomas. GH-19760 is a significant improvement from my initial proposal in GH-19751.
msg367631 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-29 08:02
I looked at the change and it seemed ok to me. Perhaps Thomas can give it a look too.
msg367632 - (view) Author: Thomas Moreau (tomMoral) * Date: 2020-04-29 08:20
I think this is a reasonable way to move on.Some of the locks can probably be removed but this needs careful investigation and in the mean time, it hinders everyone. Thanks victor for the fast fix up! 

To me, an interesting observation is that the failure seems to only happen when the executor is in broken state. If we can find a way to adapt the behavior to be more conservative on these states (which are not impacting perf) that would be nice.

I will try to look at it today and see if I can remove some of the locks while still not failing with Victor's patch. We can probably remove the lock on `clear` safely. For the others, it might be more complex.
msg367635 - (view) Author: Thomas Moreau (tomMoral) * Date: 2020-04-29 09:27
I did GH 19788 with a few modifications. There is only one lock that seems to mater for the perf, and I actually added one other (the one in _python_exit, which necessitate another bug fix for fork context).

I did not benchmark to see if it was worth it in term of perf.
msg394494 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-05-26 22:49
"test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor"

I didn't see this failure recently, I close the issue. Since changes were pushed, I mark the issue as fixed.

If someone has ideas to enhance the code, I suggest to open a new more specific issue. I consider the initial issue (buildot failure) as fixed.
msg394707 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2021-05-29 02:25
Thanks for closing up the issue, Victor :)
History
Date User Action Args
2022-04-11 14:59:28adminsetgithub: 84176
2021-05-29 02:25:33aerossetmessages: + msg394707
2021-05-26 22:49:56vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg394494

stage: patch review -> resolved
2020-04-29 09:27:53tomMoralsetmessages: + msg367635
2020-04-29 09:23:39tomMoralsetpull_requests: + pull_request19111
2020-04-29 08:20:15tomMoralsetmessages: + msg367632
2020-04-29 08:02:04pitrousetassignee: aeros -> vstinner
messages: + msg367631
2020-04-29 03:05:55aerossetmessages: + msg367617
2020-04-29 03:00:37aerossetmessages: + msg367616
2020-04-29 02:39:57corona10setnosy: + corona10
2020-04-29 01:34:52vstinnersetmessages: + msg367608
2020-04-29 01:32:14vstinnersetmessages: + msg367605
2020-04-28 17:00:11vstinnersetmessages: + msg367544
2020-04-28 16:43:01vstinnersetmessages: + msg367543
2020-04-28 16:38:42vstinnersetmessages: + msg367542
2020-04-28 16:37:48vstinnersetpull_requests: + pull_request19082
2020-04-28 16:09:22pitrousetmessages: + msg367540
2020-04-28 16:06:39vstinnersetmessages: + msg367539
2020-04-28 15:58:40vstinnersetpull_requests: + pull_request19080
2020-04-28 09:21:17aerossetmessages: + msg367513
2020-04-28 09:10:32aerossetmessages: + msg367512
2020-04-28 08:44:37aerossetpull_requests: + pull_request19072
2020-04-28 08:08:36pitrousetmessages: + msg367508
2020-04-28 07:05:56tomMoralsetnosy: + tomMoral
messages: + msg367503
2020-04-28 03:16:43aerossetmessages: + msg367487
2020-04-28 03:09:33aerossetassignee: aeros
messages: + msg367486
2020-04-28 01:08:28aerossetnosy: + pitrou
2020-04-28 00:52:59aerossetnosy: + aeros
2020-04-27 22:05:35vstinnersetmessages: + msg367465
2020-04-27 21:48:28vstinnersetmessages: + msg367463
2020-04-27 21:46:19vstinnersetmessages: + msg367462
2020-04-27 21:26:50vstinnersetmessages: + msg367460
2020-04-27 21:15:57vstinnersetmessages: + msg367458
2020-04-27 20:34:21vstinnersetmessages: + msg367455
2020-04-27 20:09:04vstinnersetmessages: + msg367453
2020-04-27 19:54:18vstinnersetmessages: + msg367451
2020-04-27 19:46:17vstinnersetmessages: + msg367450
2020-04-27 19:36:58vstinnersetmessages: + msg367449
2020-04-27 19:06:22vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request19061
2020-04-27 16:02:06vstinnersetmessages: + msg367434
2020-03-18 18:31:26vstinnersetmessages: + msg364552
2020-03-18 18:24:03vstinnersetmessages: + msg364546
2020-03-17 17:02:14vstinnersettitle: test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with -> test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor
2020-03-17 17:01:54vstinnercreate