New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concurrent.futures deadlock #80047
Comments
The attached test program hangs eventually (it may need a few thousand of iterations). Tested with Python v3.7.2 on Linux, amd64. |
I've only got 3.7.1 Ubuntu bash on Windows (also amd64) immediately available, but I'm not seeing a hang, nor is there any obvious memory leak that might eventually lead to problems (memory regularly drops back to under 10 MB shared, 24 KB private working set). I modified your code to add a sys.stdout.flush() after the write so it would actually echo the dots as they were written instead of waiting for a few thousand of them to build up in the buffer, but otherwise it's the same code. Are you sure you're actually hanging, and it's not just the output getting buffered? |
You're right that sys.stdout.flush() is missing in my code; but on Linux it doesn't make a big difference, because multiprocessing flushes stdout before fork()ing. And yes, it really hangs. |
This seem related to https://bugs.python.org/issue35809 |
Could you use gdb/lldb to attach to the process hanging and give us a stack trace? |
There are two processes running (parent and child) when the thing hangs. |
@jwilk: thanks for creating cf-deadlock.py I can replicate the test program hang on Fedora 29 with python3-3.7.2-4.fc29.x86_64 The test program hasn't yet hung on Fedora 29 with older packages, in particular My interest is due to the fact that the libreswan.org test suite has started to hang and we don't know why. It might well be this bug. |
I've filed a Fedora bug report that points to this one: <https://bugzilla.redhat.com/show_bug.cgi?id=1691434\> |
Any update on this issue? I don't understand why the example hangs. |
I've attached a variation on cf-deadlock.py that, should nothing happen for 2 minutes, will kill itself. Useful with git bisect. |
I'm seeing cf-deadlock-alarm.py hangs on vanilla python 3.7.[0123] with: Linux 5.0.5-100.fc28.x86_64 #1 SMP Wed Mar 27 22:16:29 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux can anyone reproduce this? I also wonder if this is connected to bpo-6721 where a recent "fix" made things worse - the fedora versions that work for libreswan don't have the "fix". |
More info from adding a faulthandler ...
Thread 0x00007f1ce7fff700 (most recent call first): Thread 0x00007f1cec917700 (most recent call first): Current thread 0x00007f1cfd9486c0 (most recent call first): |
Here's the children; yes there are somehow 4 children sitting around. Hopefully this is enough to figure out where things deadlock. 29970 8752 8752 29970 pts/6 8752 Sl+ 1000 1:00 | | \_ ./v3.7.3/bin/python3 cf-deadlock.py 8975 Current thread 0x00007f3be65126c0 (most recent call first): 8976 Current thread 0x00007f3be65126c0 (most recent call first): 8977 Current thread 0x00007f3be65126c0 (most recent call first): 8978 Current thread 0x00007f3be65126c0 (most recent call first): |
Reverting 3b69993 makes problem go away. |
@hroncok see comment msg339370 Vanilla 3.7.0 (re-confirmed) didn't contain the change, nor did 3.6.8 (ok, that isn't vanilla) but both can hang using the test. It can take a while and, subjectively, it seems to depend on machine load. I've even struggled to get 3.7.3 to fail without load. Presumably there's a race and grinding the test machine into the ground increases the odds of it happening. The patch for bpo-6721 could be causing many things, but two to mind:
My hunch is the latter as the stack dumps look nothing like those I analyzed for bpo-36533 (see messages msg339454 and msg339458). |
Gregory: It seems like 3b69993 is causing deadlocks which is not a good thing. What do you think of reverting this change? |
A least 2 projects were broken by the logging change: libreswan and Anaconda.
That's related to the libreswan project. Last year, there was another regression in Anaconda: Fedora installer: The workaround/fix was to revert 3b69993 in Python. Anaconda has been modified, and we were able to revert the revert 3b69993 :-) I'm not sure what was the Anaconda fix. Maybe this change? |
(disclaimer: I'm mashing my high level backtraces in with @jwiki's low level backtraces) The Python backtrace shows the deadlocked process called 'f' which then 'called': The low-level back-trace shows it was trying to acquire a lock (no surprises there); but the surprise is that it is inside of dlopen() trying to load '_ctypes...so'! #11 __dlopen (file=file@entry=0x7f398da4b050 "_ctypes.cpython-37m-x86_64-linux-gnu.so", mode=<optimized out>) at dlopen.c:87 and the lock in question (assuming my sources roughly match above) seems to be: /* We modify the list of loaded objects. */ presumably a thread in the parent held this lock at the time of the fork. If one of the other children also has the lock pre-acquired then this is confirmed (unfortunately not having the lock won't rebut the theory). So, any guesses as to what dl related operation was being performed by the parent? ---- I don't think the remaining processes are involved (and I've probably got 4 in total because my machine has 4 cores). 8976 - this acquired the multi-process semaphore and is blocked in '_recv' awaiting further instructions |
Please do not blindly revert that. See my PR in https://bugs.python.org/issue36533 which is specific to this "issue" with logging. |
Here's a possible stack taken during the fork(): Thread 1 "python3" hit Breakpoint 1, 0x00007ffff7124734 in fork () from /lib64/libc.so.6 Thread 1814 (Thread 0x7fffe69d5700 (LWP 23574)): Thread 1 (Thread 0x7ffff7fca080 (LWP 20524)): where, in my source code, dl_iterate_phdr() starts with something like: /* Make sure nobody modifies the list of loaded objects. */ i.e., when the fork occures, the non-fork thread has acquired dl_load_write_lock - the same lock that the child will later try to acquire (and hang) no clue as to what that thread is doing though; other than it looks like it is trying to generate a backtrace? |
run ProcessPoolExecutor with one fixed child (over ride default of #cores) |
script to capture stack backtrace at time of fork, last backtrace printed will be for hang |
I am unable to get cf-deadlock.py to hang on my own builds of pure CPython 3.7.2+ d7cb203 or 3.6.8+ be77fb7 (versions i had in a local git clone). which specific python builds are seeing the hang using? Which specific platform/distro version? "3.7.2" isn't enough, if you are using a distro supplied interpreter please try and reproduce this with a build from the CPython tree itself. distros always apply their own patches to their own interpreters. ... Do realize that while working on this it is fundamentally *impossible* per POSIX for os.fork() to be safely used at the Python level in a process also using pthreads. That this _ever_ appeared to work is a pure accident of implementations of underlying libc, malloc, system libraries, and kernel behaviors. POSIX considers it undefined behavior. Nothing done in CPython can avoid that. Any "fix" for these kinds of issues is merely working around the inevitable which will re-occur. concurrent.futures.ProcessPoolExecutor uses multiprocessing for its process management. As of 3.7 ProcessPoolExecutor accepts a mp_context parameter to specify the multiprocessing start method. Alternatively the default appears Use the 'spawn' start method and the problem should go away as it'll no longer be misusing os.fork(). You _might_ be able to get the 'forkserver' start method to work, but only reliably if you make sure the forkserver is spawned _before_ any threads in the process (such as ProcessPoolExecutor's own queue management thread - which appears to be spawned upon the first call to .submit()). |
We're discussing vanilla Python, for instance v3.7.0 is: git clone .../cpython (my 3.6.x wasn't vanilla, but I clearly stated that) Like I also mentioned, loading down the machine also helps. Try something like running #cores*2 of the script in parallel? |
@gregory.p.smith, I'm puzzled by your references to POSIX and/or os.fork(). The code in question looks like: import concurrent.futures
import sys
def f():
import ctypes
while True:
with concurrent.futures.ProcessPoolExecutor() as executor:
ftr = executor.submit(f)
ftr.result() which, to me, looks like pure Python. Are you saying that this code can't work on GNU/Linux systems. |
concurrent.futures.ProcessPoolExecutor uses both multiprocessing and threading. multiprocessing defaults to using os.fork(). |
So: #1 we've a bug: the single-threaded ProcessPoolExecutor test program should work 100% reliably - it does not #2 we've a cause: ProcessPoolExecutor is implemented internally using an unfortunate combination of fork and threads, this is causing the deadlock #3 we've got a workaround - something like:
ProcessPoolExecutor(multiprocessing.get_context('spawn'))
but I'm guessing, the documentation is scant. As for a fix, maybe:
|
I doubt it is single threaded, the .submit() method appears to spawn a thread internally. |
FYI, I'm getting a similar deadlock in a child Python process which is stuck on locking a mutex from the dl library. See attached stack. I'm not using concurrent.futures however, the parent Python process is a test driver that uses threading.Thread and subprocess.Popen to spawn new processes... I'm not using os.fork(). This occurred on ArchLinux with Python 3.7.4. |
The thread_run() function (previously "t_bootstrap()") which is the low-level C function to run a thread in Python _thread.start_new_thread() no longer calls PyThread_exit_thread(): see bpo-44434. I'm able to reproduce the issue using attached cf-deadlock.py with Python 3.8:
In the main branch (with bpo-44434 fix), I can no longer reproduce the issue. I ran cf-deadlock.py in 4 terminals in parallel with "./python -m test -r -j2" in a 5th terminal for 5 minutes. I couldn't reproduce the issue. On Python 3.8, I reproduced the issue in less than 1 minute. Can someone please confirm that the issue is now fixed? Can we mark this issue as a duplicate of bpo-44434? |
I can no longer reproduce the bug with Python from git. |
Great! I close the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: