New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor #84176
Comments
AMD64 Ubuntu Shared 3.x: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) ... Stderr: (...) ====================================================================== Traceback (most recent call last):
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
executor.shutdown(wait=True)
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 721, in shutdown
self._executor_manager_thread_wakeup.wakeup()
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 93, in wakeup
self._writer.send_bytes(b"")
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor Stdout: Stderr: -- On the same build, test_concurrent_futures timed out after 15 min, while running test_ressources_gced_in_workers(): 0:29:08 load avg: 1.46 Re-running test_concurrent_futures in verbose mode Thread 0x00007f38bff67700 (most recent call first): Thread 0x00007f38c7128640 (most recent call first): command timed out: 1200 seconds without output running [b'make', b'buildbottest', b'TESTOPTS=-j2 --junit-xml test-results.xml ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=900'], attempting to kill |
Same bug on AMD64 FreeBSD Non-Debug 3.x: ====================================================================== Traceback (most recent call last):
File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
executor.shutdown(wait=True)
File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 721, in shutdown
self._executor_manager_thread_wakeup.wakeup()
File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 93, in wakeup
self._writer.send_bytes(b"")
File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor Stdout: Stderr: ---------------------------------------------------------------------- |
Oh, test_crash failed twice, but not on the same test case:
The second failure was when test_concurrent_futures was re-run sequentially. |
See also bpo-30966 "Add multiprocessing.SimpleQueue.close()". |
AMD64 Ubuntu Shared 3.x: ====================================================================== Traceback (most recent call last):
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
executor.shutdown(wait=False)
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
self._executor_manager_thread_wakeup.wakeup()
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
self._writer.send_bytes(b"")
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 188, in send_bytes
self._check_closed()
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 141, in _check_closed
raise OSError("handle is closed")
OSError: handle is closed (...) ====================================================================== Traceback (most recent call last):
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
executor.shutdown(wait=False)
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
self._executor_manager_thread_wakeup.wakeup()
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
self._writer.send_bytes(b"")
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor |
I pushed a commit 1a27501: test_shutdown_deadlock_pickle() still rely on the queue to be closed implicitly. Queue created at: (...) |
AMD64 Fedora Stable Clang Installed 3.x: 0:04:21 load avg: 1.29 [423/423/1] test_concurrent_futures failed (2 min 39 sec)
Warning -- threading_cleanup() failed to cleanup -1 threads (count: 0, dangling: 3)
Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures
Warning -- threading._dangling was modified by test_concurrent_futures
test_cancel (test.test_concurrent_futures.FutureTests) ... ok
test_cancelled (test.test_concurrent_futures.FutureTests) ... ok
test_done (test.test_concurrent_futures.FutureTests) ... ok
(...)
test_first_exception_some_already_complete (test.test_concurrent_futures.ThreadPoolWaitTests) ... 1.60s ok
test_pending_calls_race (test.test_concurrent_futures.ThreadPoolWaitTests) ... 0.11s ok
test_timeout (test.test_concurrent_futures.ThreadPoolWaitTests) ... 6.11s ok
Traceback (most recent call last):
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory ====================================================================== Traceback (most recent call last):
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/test/test_concurrent_futures.py", line 130, in tearDown
self.executor.shutdown(wait=True)
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 724, in shutdown
self._executor_manager_thread_wakeup.wakeup()
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
self._writer.send_bytes(b"")
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor Ran 193 tests in 159.805s FAILED (errors=1, skipped=6) == Tests result: FAILURE == |
x86 Gentoo Installed with X 3.x: test_del_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... Warning -- Unraisable exception
Exception ignored in: <function _ExecutorManagerThread.__init__.<locals>.weakref_cb at 0xb5067898>
Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 281, in weakref_cb
thread_wakeup.wakeup()
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
self._writer.send_bytes(b"")
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor 0.04s ok |
ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) It seems like Connection.close() was called while Connection._send() was called. I added debug logs:
|
The connection was closed by terminate_broken() called by _ExecutorManagerThread.run() thread: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... close handle 4 |
terminate_broken() method was added by: commit 0e89076
|
The patch below makes this test failure more likely: diff --git a/Lib/multiprocessing/connection.py b/Lib/multiprocessing/connection.py
index 510e4b5aba..63518e55d9 100644
--- a/Lib/multiprocessing/connection.py
+++ b/Lib/multiprocessing/connection.py
@@ -370,6 +370,7 @@ class Connection(_ConnectionBase):
def _send(self, buf, write=_write):
remaining = len(buf)
while True:
+ time.sleep(0.050)
n = write(self._handle, buf)
remaining -= n
if remaining == 0: |
It seems like test_killed_child() race condition was introduced by: commit a5cbab5 (refs/bisect/bad)
|
I can't be certain for the other failures, but I'm currently exploring a potential solution for addressing the diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
index 8e9b69a8f0..9bf073fc34 100644
--- a/Lib/concurrent/futures/process.py
+++ b/Lib/concurrent/futures/process.py
@@ -68,21 +68,30 @@ class _ThreadWakeup:
def __init__(self):
self._closed = False
self._reader, self._writer = mp.Pipe(duplex=False)
+ # Used to ensure pipe is not closed while sending or receiving bytes
+ self._not_running = threading.Event()
+ # Initialize event as True
+ self._not_running.set()
def close(self):
if not self._closed:
+ self._not_running.wait()
self._closed = True
self._writer.close()
self._reader.close()
def wakeup(self):
if not self._closed:
+ self._not_running.clear()
self._writer.send_bytes(b"")
+ self._not_running.set()
def clear(self):
if not self._closed:
+ self._not_running.clear()
while self._reader.poll():
self._reader.recv_bytes()
+ self._not_running.set() From using Victor's method of replicating the failure with inserting a |
After writing the above out and a bit of further consideration, I think it might make more sense to wait for the event after setting Thoughts? |
Sorry I just saw this. It seems that I introduced this regression. One of the goal of having a From the failures, it seems to be a race condition between |
How about the following (untested): diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
index 8e9b69a8f0..c0c2eb3032 100644
--- a/Lib/concurrent/futures/process.py
+++ b/Lib/concurrent/futures/process.py
@@ -66,23 +66,29 @@ _global_shutdown = False
class _ThreadWakeup:
def __init__(self):
- self._closed = False
self._reader, self._writer = mp.Pipe(duplex=False)
def close(self):
- if not self._closed:
- self._closed = True
- self._writer.close()
- self._reader.close()
+ r, w = self._reader, self._writer
+ self._reader = self._writer = None
+ if r is not None:
+ r.close()
+ w.close()
def wakeup(self):
- if not self._closed:
+ try:
self._writer.send_bytes(b"")
+ except AttributeError:
+ # Closed
+ pass
def clear(self):
- if not self._closed:
+ try:
while self._reader.poll():
self._reader.recv_bytes()
+ except AttributeError:
+ # Closed
+ pass
def _python_exit(): |
Oops, it seems that I opened PR-19751 a bit preemptively. When I get the chance, I'll see if Antoine's implementation can address the failures and do some comparisons. |
I decided to close PR-19751. Both because it does not correctly address the race condition (due to an oversight on my part) and it would add substantial overhead to _ThreadWakeup. Instead, I agree that we should explore a non-locking solution. |
Antoine Pitrou: "How about the following (untested): (...)" Using Antoine's patch, test_killed_child() still fails (I used my msg367463 patch to make the failure more likely). |
With the same traceback and error message? |
With my msg367463 patch (add sleep), test_cancel_futures() fails. Example: ====================================================================== Traceback (most recent call last):
File "/home/vstinner/python/master/Lib/test/test_concurrent_futures.py", line 353, in test_cancel_futures
self.assertTrue(len(cancelled) >= 35, msg=f"{len(cancelled)=}")
AssertionError: False is not true : len(cancelled)=0 |
Thomas Moreau: "One solution would be to use the I wrote a conservative PR 19760 which always lock ProcessPoolExecutor._shutdown_lock while accessing _ThreadWakeup. PR 19760 fix test_killed_child(): it doesn't fail anymore, even with my msg367463 patch (add sleep). |
The test uses sleep() as a synchronization primitive: executor.submit(time.sleep, .1). That's bad, but it doesn't *have to* be fixed now. My msg367463 patch adds an artifical sleep: the test looks fine in practice. I prefer to wait until it fails on a buildbot worker before spending time to make the test more reliable. |
I'm still getting more and more buildbot emails about test_concurrent_futures, so I merged my PR 19760 to fix buildbots. Please revert or modify my PR 19760 if you have a better approach, but please check that test_killed_child() and ProcessPoolForkExecutorDeadlockTest tests don't fail with my msg367463 patch. I would still appreciated a post-commit review of my change, since I don't know well concurrent.futures code : bpo-39995: Fix concurrent.futures _ThreadWakeup (GH-19760) |
I am a bit concerned about the performance implication of accessing the thread wakeup through a lock in the call queue, but for now I think it's reasonable to address the buildbot failure with a locking solution while we try to find a better one. I'm not certain if we'll be able to find one that correctly addresses the failures with zero additional locking, but I think we may be able to reduce the number of times we use the lock compared to the implementation in #63959. I'll spend some time exploring that as I find the time to, and report back with any significant findings. |
I looked at the change and it seemed ok to me. Perhaps Thomas can give it a look too. |
I think this is a reasonable way to move on.Some of the locks can probably be removed but this needs careful investigation and in the mean time, it hinders everyone. Thanks victor for the fast fix up! To me, an interesting observation is that the failure seems to only happen when the executor is in broken state. If we can find a way to adapt the behavior to be more conservative on these states (which are not impacting perf) that would be nice. I will try to look at it today and see if I can remove some of the locks while still not failing with Victor's patch. We can probably remove the lock on |
I did GH 19788 with a few modifications. There is only one lock that seems to mater for the perf, and I actually added one other (the one in _python_exit, which necessitate another bug fix for fork context). I did not benchmark to see if it was worth it in term of perf. |
"test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor" I didn't see this failure recently, I close the issue. Since changes were pushed, I mark the issue as fixed. If someone has ideas to enhance the code, I suggest to open a new more specific issue. I consider the initial issue (buildot failure) as fixed. |
Thanks for closing up the issue, Victor :) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: