Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent.futures deadlock #80047

Closed
jwilk mannequin opened this issue Jan 31, 2019 · 34 comments
Closed

concurrent.futures deadlock #80047

jwilk mannequin opened this issue Jan 31, 2019 · 34 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@jwilk
Copy link
Mannequin

jwilk mannequin commented Jan 31, 2019

BPO 35866
Nosy @gpshead, @brianquinlan, @pitrou, @vstinner, @jwilk, @MojoVampire, @cagney, @hroncok, @pablogsal
Files
  • cf-deadlock.py
  • gdb-bt-parent.txt
  • gdb-bt-child.txt
  • cf-deadlock-alarm.py: Version of cf-deadlock.py that tries to always exit (eventually)
  • cf-deadlock-1.py
  • gdb.sh
  • stack-python.txt
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-07-03.21:52:31.634>
    created_at = <Date 2019-01-31.10:05:11.069>
    labels = ['3.7', '3.8', 'type-bug', 'library']
    title = 'concurrent.futures deadlock'
    updated_at = <Date 2021-07-03.21:53:10.556>
    user = 'https://github.com/jwilk'

    bugs.python.org fields:

    activity = <Date 2021-07-03.21:53:10.556>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-07-03.21:52:31.634>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2019-01-31.10:05:11.069>
    creator = 'jwilk'
    dependencies = []
    files = ['48090', '48223', '48224', '48242', '48271', '48272', '48611']
    hgrepos = []
    issue_num = 35866
    keywords = []
    message_count = 34.0
    messages = ['334618', '334648', '334652', '338376', '338377', '338402', '338501', '338549', '339334', '339361', '339370', '339451', '339464', '340263', '340278', '340343', '340344', '340358', '340359', '340360', '340362', '340430', '340431', '340432', '340442', '340451', '340456', '340459', '340464', '340468', '352707', '396675', '396910', '396929']
    nosy_count = 11.0
    nosy_names = ['gregory.p.smith', 'bquinlan', 'pitrou', 'vstinner', 'bentoi', 'jwilk', 'josh.r', 'cagney', 'hroncok', 'pablogsal', 'hugh']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue35866'
    versions = ['Python 3.7', 'Python 3.8']

    @jwilk
    Copy link
    Mannequin Author

    jwilk mannequin commented Jan 31, 2019

    The attached test program hangs eventually (it may need a few thousand of iterations).

    Tested with Python v3.7.2 on Linux, amd64.

    @jwilk jwilk mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 31, 2019
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Feb 1, 2019

    I've only got 3.7.1 Ubuntu bash on Windows (also amd64) immediately available, but I'm not seeing a hang, nor is there any obvious memory leak that might eventually lead to problems (memory regularly drops back to under 10 MB shared, 24 KB private working set). I modified your code to add a sys.stdout.flush() after the write so it would actually echo the dots as they were written instead of waiting for a few thousand of them to build up in the buffer, but otherwise it's the same code.

    Are you sure you're actually hanging, and it's not just the output getting buffered?

    @jwilk
    Copy link
    Mannequin Author

    jwilk mannequin commented Feb 1, 2019

    You're right that sys.stdout.flush() is missing in my code; but on Linux it doesn't make a big difference, because multiprocessing flushes stdout before fork()ing.

    And yes, it really hangs.

    @pablogsal
    Copy link
    Member

    This seem related to https://bugs.python.org/issue35809

    @pablogsal
    Copy link
    Member

    Could you use gdb/lldb to attach to the process hanging and give us a stack trace?

    @jwilk
    Copy link
    Mannequin Author

    jwilk mannequin commented Mar 19, 2019

    There are two processes running (parent and child) when the thing hangs.
    I'm attaching GDB backtraces for both.

    @hugh
    Copy link
    Mannequin

    hugh mannequin commented Mar 20, 2019

    @jwilk: thanks for creating cf-deadlock.py

    I can replicate the test program hang on Fedora 29 with python3-3.7.2-4.fc29.x86_64

    The test program hasn't yet hung on Fedora 29 with older packages, in particular
    python3-3.7.1-4.fc29.x86_64

    My interest is due to the fact that the libreswan.org test suite has started to hang and we don't know why. It might well be this bug.

    @hugh
    Copy link
    Mannequin

    hugh mannequin commented Mar 21, 2019

    I've filed a Fedora bug report that points to this one: <https://bugzilla.redhat.com/show_bug.cgi?id=1691434\>

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2019

    Any update on this issue? I don't understand why the example hangs.

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 2, 2019

    I've attached a variation on cf-deadlock.py that, should nothing happen for 2 minutes, will kill itself. Useful with git bisect.

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 2, 2019

    I'm seeing cf-deadlock-alarm.py hangs on vanilla python 3.7.[0123] with:

    Linux 5.0.5-100.fc28.x86_64 #1 SMP Wed Mar 27 22:16:29 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    glibc-2.27-37.fc28.x86_64

    can anyone reproduce this?

    I also wonder if this is connected to bpo-6721 where a recent "fix" made things worse - the fedora versions that work for libreswan don't have the "fix".

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 4, 2019

    More info from adding a faulthandler ...

    15	def f():
    16	    import ctypes
    17	
    18	for i in range(0,50):
    19	    sys.stdout.write("\r%d" % i)
    20	    sys.stdout.flush()
    21	    signal.alarm(60*2)
    22	    for j in range(0,1000):
    23	        with concurrent.futures.ProcessPoolExecutor() as executor:
    24	            ftr = executor.submit(f)
    25	            ftr.result()
    

    Thread 0x00007f1ce7fff700 (most recent call first):
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 296 in wait
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 224 in _feed
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865 in run
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917 in _bootstrap_inner
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 885 in _bootstrap

    Thread 0x00007f1cec917700 (most recent call first):
    File "/home/python/v3.7.3/lib/python3.7/selectors.py", line 415 in select
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 920 in wait
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 354 in _queue_management_worker
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865 in run
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917 in _bootstrap_inner
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 885 in _bootstrap

    Current thread 0x00007f1cfd9486c0 (most recent call first):
    File "/home/python/v3.7.3/lib/python3.7/threading.py", line 296 in wait
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/_base.py", line 427 in result
    File "cf-deadlock.py", line 25 in <module>

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 4, 2019

    Here's the children; yes there are somehow 4 children sitting around. Hopefully this is enough to figure out where things deadlock.

    29970 8752 8752 29970 pts/6 8752 Sl+ 1000 1:00 | | \_ ./v3.7.3/bin/python3 cf-deadlock.py
    8752 8975 8752 29970 pts/6 8752 S+ 1000 0:00 | | \_ ./v3.7.3/bin/python3 cf-deadlock.py
    8752 8976 8752 29970 pts/6 8752 S+ 1000 0:00 | | \_ ./v3.7.3/bin/python3 cf-deadlock.py
    8752 8977 8752 29970 pts/6 8752 S+ 1000 0:00 | | \_ ./v3.7.3/bin/python3 cf-deadlock.py
    8752 8978 8752 29970 pts/6 8752 S+ 1000 0:00 | | \_ ./v3.7.3/bin/python3 cf-deadlock.py

    8975

    Current thread 0x00007f3be65126c0 (most recent call first):
    File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
    File "<frozen importlib._bootstrap_external>", line 1043 in create_module
    File "<frozen importlib._bootstrap>", line 583 in module_from_spec
    File "<frozen importlib._bootstrap>", line 670 in _load_unlocked
    File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 983 in _find_and_load
    File "/home/python/v3.7.3/lib/python3.7/ctypes/init.py", line 7 in <module>
    File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
    File "<frozen importlib._bootstrap_external>", line 728 in exec_module
    File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
    File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 983 in _find_and_load
    File "cf-deadlock.py", line 17 in f
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 232 in _process_worker
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 20 in __init__
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit
    File "cf-deadlock.py", line 25 in <module>

    8976

    Current thread 0x00007f3be65126c0 (most recent call first):
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 379 in _recv
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 407 in _recv_bytes
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 216 in recv_bytes
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 94 in get
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 226 in _process_worker
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 20 in __init__
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit
    File "cf-deadlock.py", line 25 in <module>

    8977

    Current thread 0x00007f3be65126c0 (most recent call first):
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/synchronize.py", line 95 in __enter__
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 93 in get
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 226 in _process_worker
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 20 in __init__
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit
    File "cf-deadlock.py", line 25 in <module>

    8978

    Current thread 0x00007f3be65126c0 (most recent call first):
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/synchronize.py", line 95 in __enter__
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 93 in get
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 226 in _process_worker
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 20 in __init__
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen
    File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread
    File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit
    File "cf-deadlock.py", line 25 in <module>

    @hroncok
    Copy link
    Mannequin

    hroncok mannequin commented Apr 15, 2019

    Reverting 3b69993 makes problem go away.

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 15, 2019

    @hroncok see comment msg339370

    Vanilla 3.7.0 (re-confirmed) didn't contain the change, nor did 3.6.8 (ok, that isn't vanilla) but both can hang using the test. It can take a while and, subjectively, it seems to depend on machine load. I've even struggled to get 3.7.3 to fail without load.

    Presumably there's a race and grinding the test machine into the ground increases the odds of it happening.

    The patch for bpo-6721 could be causing many things, but two to mind:

    • turning this bug into bpo-36533 (aka bpo-6721 caused a regression)
    • slowed down the fork (sending time acquiring locks) which increased the odds of this hang

    My hunch is the latter as the stack dumps look nothing like those I analyzed for bpo-36533 (see messages msg339454 and msg339458).

    @vstinner
    Copy link
    Member

    Gregory: It seems like 3b69993 is causing deadlocks which is not a good thing. What do you think of reverting this change?

    @vstinner
    Copy link
    Member

    A least 2 projects were broken by the logging change: libreswan and Anaconda.

    I've filed a Fedora bug report that points to this one: <https://bugzilla.redhat.com/show_bug.cgi?id=1691434\>

    That's related to the libreswan project.

    Last year, there was another regression in Anaconda: Fedora installer:

    https://bugzilla.redhat.com/show_bug.cgi?id=1644936

    The workaround/fix was to revert 3b69993 in Python. Anaconda has been modified, and we were able to revert the revert 3b69993 :-)

    I'm not sure what was the Anaconda fix. Maybe this change?
    rhinstaller/anaconda#1721

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 16, 2019

    (disclaimer: I'm mashing my high level backtraces in with @jwiki's low level backtraces)

    The Python backtrace shows the deadlocked process called 'f' which then 'called':
    import ctypes
    which, in turn 'called':
    from _ctypes import Union, Structure, Array
    and that hung.

    The low-level back-trace shows it was trying to acquire a lock (no surprises there); but the surprise is that it is inside of dlopen() trying to load '_ctypes...so'!

    #11 __dlopen (file=file@entry=0x7f398da4b050 "_ctypes.cpython-37m-x86_64-linux-gnu.so", mode=<optimized out>) at dlopen.c:87
    ...
    #3 _dl_map_object_from_fd (name="_ctypes.cpython-37m-x86_64-linux-gnu.so", origname=origname@entry=0x0, fd=-1, fbp=<optimized out>, realname=<optimized out>, loader=loader@entry=0x0, l_type=<optimized out>, mode=<optimized out>, stack_endp=<optimized out>, nsid=<optimized out>) at dl-load.c:1413
    #2 _dl_add_to_namespace_list (new=0x55f8b8f34540, nsid=0) at dl-object.c:34
    #1 __GI___pthread_mutex_lock (mutex=0x7f3991fb9970 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:115

    and the lock in question (assuming my sources roughly match above) seems to be:

    /* We modify the list of loaded objects. */
    __rtld_lock_lock_recursive (GL(dl_load_write_lock));

    presumably a thread in the parent held this lock at the time of the fork.

    If one of the other children also has the lock pre-acquired then this is confirmed (unfortunately not having the lock won't rebut the theory).

    So, any guesses as to what dl related operation was being performed by the parent?

    ----

    I don't think the remaining processes are involved (and I've probably got 4 in total because my machine has 4 cores).

    8976 - this acquired the multi-process semaphore and is blocked in '_recv' awaiting further instructions
    8978, 8977 - these are blocked waiting for above to free the multi-process semaphore

    @gpshead
    Copy link
    Member

    gpshead commented Apr 16, 2019

    Please do not blindly revert that. See my PR in https://bugs.python.org/issue36533 which is specific to this "issue" with logging.

    @gpshead gpshead added the 3.8 only security fixes label Apr 16, 2019
    @gpshead
    Copy link
    Member

    gpshead commented Apr 16, 2019

    I'd appreciate it if someone with an application running into the issue could be tested with my PR from bpo-36533 (#12704) applied.

    @jwilk
    Copy link
    Mannequin Author

    jwilk mannequin commented Apr 16, 2019

    #12704 doesn't fix the bug for me.
    Reverting 3b69993 doesn't fix it either.

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 17, 2019

    Here's a possible stack taken during the fork():

    Thread 1 "python3" hit Breakpoint 1, 0x00007ffff7124734 in fork () from /lib64/libc.so.6

    Thread 1814 (Thread 0x7fffe69d5700 (LWP 23574)):
    #0 0x00007ffff7bc24e5 in __pthread_mutex_unlock_usercnt () from /lib64/libpthread.so.0
    #1 0x00007ffff71928e3 in dl_iterate_phdr () from /lib64/libc.so.6
    #2 0x00007fffe5fcfe55 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
    #3 0x00007fffe5fcc403 in uw_frame_state_for () from /lib64/libgcc_s.so.1
    #4 0x00007fffe5fcd90f in _Unwind_ForcedUnwind_Phase2 () from /lib64/libgcc_s.so.1
    #5 0x00007fffe5fcdf30 in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
    #6 0x00007ffff7bc7712 in __pthread_unwind () from /lib64/libpthread.so.0
    #7 0x00007ffff7bbf7e7 in pthread_exit () from /lib64/libpthread.so.0
    #8 0x000000000051b2fc in PyThread_exit_thread () at Python/thread_pthread.h:238
    #9 0x000000000055ed16 in t_bootstrap (boot_raw=0x7fffe8da0e40) at ./Modules/_threadmodule.c:1021
    #10 0x00007ffff7bbe594 in start_thread () from /lib64/libpthread.so.0
    #11 0x00007ffff7157e5f in clone () from /lib64/libc.so.6

    Thread 1 (Thread 0x7ffff7fca080 (LWP 20524)):
    #0 0x00007ffff7124734 in fork () from /lib64/libc.so.6
    #1 0x0000000000532c8a in os_fork_impl (module=<optimized out>) at ./Modules/posixmodule.c:5423
    #2 os_fork (module=<optimized out>, _unused_ignored=<optimized out>) at ./Modules/clinic/posixmodule.c.h:1913

    where, in my source code, dl_iterate_phdr() starts with something like:

    /* Make sure nobody modifies the list of loaded objects. */
    __rtld_lock_lock_recursive (GL(dl_load_write_lock));

    i.e., when the fork occures, the non-fork thread has acquired dl_load_write_lock - the same lock that the child will later try to acquire (and hang)

    no clue as to what that thread is doing though; other than it looks like it is trying to generate a backtrace?

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 17, 2019

    run ProcessPoolExecutor with one fixed child (over ride default of #cores)

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 17, 2019

    script to capture stack backtrace at time of fork, last backtrace printed will be for hang

    @gpshead
    Copy link
    Member

    gpshead commented Apr 17, 2019

    I am unable to get cf-deadlock.py to hang on my own builds of pure CPython 3.7.2+ d7cb203 or 3.6.8+ be77fb7 (versions i had in a local git clone).

    which specific python builds are seeing the hang using? Which specific platform/distro version? "3.7.2" isn't enough, if you are using a distro supplied interpreter please try and reproduce this with a build from the CPython tree itself. distros always apply their own patches to their own interpreters.

    ...

    Do realize that while working on this it is fundamentally *impossible* per POSIX for os.fork() to be safely used at the Python level in a process also using pthreads. That this _ever_ appeared to work is a pure accident of implementations of underlying libc, malloc, system libraries, and kernel behaviors. POSIX considers it undefined behavior. Nothing done in CPython can avoid that. Any "fix" for these kinds of issues is merely working around the inevitable which will re-occur.

    concurrent.futures.ProcessPoolExecutor uses multiprocessing for its process management. As of 3.7 ProcessPoolExecutor accepts a mp_context parameter to specify the multiprocessing start method. Alternatively the default appears
    to be controllable as a global setting https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.

    Use the 'spawn' start method and the problem should go away as it'll no longer be misusing os.fork(). You _might_ be able to get the 'forkserver' start method to work, but only reliably if you make sure the forkserver is spawned _before_ any threads in the process (such as ProcessPoolExecutor's own queue management thread - which appears to be spawned upon the first call to .submit()).

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 17, 2019

    We're discussing vanilla Python, for instance v3.7.0 is:

    git clone .../cpython
    cd cpython
    git checkout v3.7.0
    ./configure --prefix=/home/python/v3.7.0
    make -j && make -j install

    (my 3.6.x wasn't vanilla, but I clearly stated that)

    Like I also mentioned, loading down the machine also helps. Try something like running #cores*2 of the script in parallel?

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 17, 2019

    @gregory.p.smith, I'm puzzled by your references to POSIX and/or os.fork().

    The code in question looks like:

    import concurrent.futures
    import sys
    
    def f():
        import ctypes
    
    while True:
        with concurrent.futures.ProcessPoolExecutor() as executor:
            ftr = executor.submit(f)
            ftr.result()

    which, to me, looks like pure Python.

    Are you saying that this code can't work on GNU/Linux systems.

    @gpshead
    Copy link
    Member

    gpshead commented Apr 17, 2019

    concurrent.futures.ProcessPoolExecutor uses both multiprocessing and threading. multiprocessing defaults to using os.fork().

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 18, 2019

    So:

    #1 we've a bug: the single-threaded ProcessPoolExecutor test program should work 100% reliably - it does not

    #2 we've a cause: ProcessPoolExecutor is implemented internally using an unfortunate combination of fork and threads, this is causing the deadlock

    #3 we've got a workaround - something like:
       ProcessPoolExecutor(multiprocessing.get_context('spawn'))
    but I'm guessing, the documentation is scant.

    As for a fix, maybe:

    • have ProcessPoolExecutor use 'spawn' by default; this way things always work
    • have ProcessPoolExecutor properly synchronized its threads before "spawning"/"forking"/... so that "single-threaded" code works
    • document that combining ProcessPoolExecutor's "fork" option and user threads isn't a good idea

    @gpshead
    Copy link
    Member

    gpshead commented Apr 18, 2019

    "the single-threaded ProcessPoolExecutor test program"

    I doubt it is single threaded, the .submit() method appears to spawn a thread internally.

    @bentoi
    Copy link
    Mannequin

    bentoi mannequin commented Sep 18, 2019

    FYI, I'm getting a similar deadlock in a child Python process which is stuck on locking a mutex from the dl library. See attached stack.

    I'm not using concurrent.futures however, the parent Python process is a test driver that uses threading.Thread and subprocess.Popen to spawn new processes... I'm not using os.fork().

    This occurred on ArchLinux with Python 3.7.4.

    @vstinner
    Copy link
    Member

    #6 0x00007ffff7bc7712 in __pthread_unwind () from /lib64/libpthread.so.0
    #7 0x00007ffff7bbf7e7 in pthread_exit () from /lib64/libpthread.so.0
    #8 0x000000000051b2fc in PyThread_exit_thread () at Python/thread_pthread.h:238
    #9 0x000000000055ed16 in t_bootstrap (boot_raw=0x7fffe8da0e40) at ./Modules/_threadmodule.c:1021

    The thread_run() function (previously "t_bootstrap()") which is the low-level C function to run a thread in Python _thread.start_new_thread() no longer calls PyThread_exit_thread(): see bpo-44434.

    I'm able to reproduce the issue using attached cf-deadlock.py with Python 3.8:

    • Run 3 processes running cf-deadlock.py
    • Run the Python test suite to stress the machine: ./python -m test -r -j4 # try different -j values to stess the machine more or less

    In the main branch (with bpo-44434 fix), I can no longer reproduce the issue. I ran cf-deadlock.py in 4 terminals in parallel with "./python -m test -r -j2" in a 5th terminal for 5 minutes. I couldn't reproduce the issue. On Python 3.8, I reproduced the issue in less than 1 minute.

    Can someone please confirm that the issue is now fixed? Can we mark this issue as a duplicate of bpo-44434?

    @jwilk
    Copy link
    Mannequin Author

    jwilk mannequin commented Jul 3, 2021

    I can no longer reproduce the bug with Python from git.

    @vstinner vstinner closed this as completed Jul 3, 2021
    @vstinner
    Copy link
    Member

    vstinner commented Jul 3, 2021

    Great! I close the issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants