Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3.8 regression - ThreadPool join via __del__ hangs forever #83541

Closed
gaborbernat mannequin opened this issue Jan 16, 2020 · 30 comments
Closed

python3.8 regression - ThreadPool join via __del__ hangs forever #83541

gaborbernat mannequin opened this issue Jan 16, 2020 · 30 comments
Labels
3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir

Comments

@gaborbernat
Copy link
Mannequin

gaborbernat mannequin commented Jan 16, 2020

BPO 39360
Nosy @arekm, @vstinner, @pablogsal, @miss-islington, @gaborbernat, @isidentical
PRs
  • bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool #19009
  • [3.8] bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009) #19023
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-03-20.17:49:21.674>
    created_at = <Date 2020-01-16.13:23:44.464>
    labels = ['3.8', 'library', '3.9']
    title = 'python3.8 regression - ThreadPool join via __del__ hangs forever'
    updated_at = <Date 2020-03-20.18:11:24.957>
    user = 'https://github.com/gaborbernat'

    bugs.python.org fields:

    activity = <Date 2020-03-20.18:11:24.957>
    actor = 'arekm'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-03-20.17:49:21.674>
    closer = 'pablogsal'
    components = ['Library (Lib)']
    creation = <Date 2020-01-16.13:23:44.464>
    creator = 'gaborjbernat'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39360
    keywords = ['patch']
    message_count = 30.0
    messages = ['360119', '360128', '360130', '360131', '360133', '364215', '364225', '364226', '364251', '364252', '364253', '364254', '364259', '364261', '364262', '364263', '364265', '364268', '364269', '364270', '364300', '364307', '364308', '364309', '364310', '364692', '364695', '364696', '364698', '364700']
    nosy_count = 6.0
    nosy_names = ['arekm', 'vstinner', 'pablogsal', 'miss-islington', 'gaborjbernat', 'BTaskaya']
    pr_nums = ['19009', '19023']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue39360'
    versions = ['Python 3.8', 'Python 3.9']

    @gaborbernat
    Copy link
    Mannequin Author

    gaborbernat mannequin commented Jan 16, 2020

    Assume the following code:

    from multiprocessing.pool import ThreadPool
    class A(object):
        def __init__(self):
            self.pool = ThreadPool()
        def __del__(self):
            self.pool.close()
            self.pool.join()
    a = A()
    print(a)

    The code snippet above hangs forever on Python 3.8+ (works ok on Python 3.7 and earlier). An example output where I've added some extra prints on to the thread joins:

    <__main__.A object at 0x1104d6070>
    join thread <Thread(Thread-13, started daemon 123145699672064)> None
    done
    join thread <Thread(Thread-14, stopped daemon 123145716461568)> None
    done
    join thread <Thread(Thread-15, stopped daemon 123145733251072)> None
    done
    join thread <Thread(Thread-13, stopped daemon 123145699672064)> None
    done
    join thread <Thread(Thread-14, stopped daemon 123145716461568)> None
    done
    join thread <Thread(Thread-15, stopped daemon 123145733251072)> None
    done
    join thread <DummyProcess(Thread-1, stopped daemon 123145498198016)> None
    done
    join thread <DummyProcess(Thread-2, stopped daemon 123145514987520)> None
    done
    join thread <DummyProcess(Thread-3, stopped daemon 123145531777024)> None
    done
    join thread <DummyProcess(Thread-4, stopped daemon 123145548566528)> None
    done
    join thread <DummyProcess(Thread-5, started daemon 123145565356032)> None
    

    I've tested on MacOs, but could replicate on Linux too within the CI.

    @gaborbernat gaborbernat mannequin added 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir labels Jan 16, 2020
    @gaborbernat
    Copy link
    Mannequin Author

    gaborbernat mannequin commented Jan 16, 2020

    An FYI update, the code is used by swagger-codegen to generate HTTP clients (https://github.com/swagger-api/swagger-codegen/blob/master/modules/swagger-codegen/src/main/resources/python/api_client.mustache#L73). The example below basically replicates creating such client as a global.

    @vstinner
    Copy link
    Member

    This code snippet seems to trigger a crash in the _struct module:
    https://bugs.python.org/issue38076#msg360127

    @gaborbernat
    Copy link
    Mannequin Author

    gaborbernat mannequin commented Jan 16, 2020

    Reported the issue downstream too under swagger-api/swagger-codegen#9991

    @pablogsal
    Copy link
    Member

    This is not a regression: the code only worked on 3.7 because there is a sleep in the pool maintainance loop that makes the pool more likely being more consistent, but is the same problem as there is a race condition. For example, reducing the sleep in the loop (notice 3.8 does not use a sleep but instead uses a more resilient system of sentinels) for the 3.7 version reproduces the problem:

    diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py
    index 3e9a0d6b48..f8d438d87c 100644
    --- a/Lib/multiprocessing/pool.py
    +++ b/Lib/multiprocessing/pool.py
    @@ -410,7 +410,7 @@ class Pool(object):
             # is terminated.
             while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
                 pool._maintain_pool()
    -            time.sleep(0.1)
    +            time.sleep(0.00000001)
             # send sentinel to stop workers
             pool._taskqueue.put(None)
             util.debug('worker handler exiting')

    With that patch, 3.7 hangs as well. The problem here is that something regarding the locks inside the SimpleQueue of inbound tasks is not in a consistent state when Python finalization calls __del__.

    @pablogsal
    Copy link
    Member

    Check out also https://bugs.python.org/issue38744

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 15, 2020

    I've applied pull request patch (which looks a bit weird to me, removes something then adds the same thing back etc https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19009.patch). Essentially the change is:

    diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py
    index b223d6aa72..8bd9608b0e 100644
    --- a/Lib/multiprocessing/pool.py
    +++ b/Lib/multiprocessing/pool.py
    @@ -682,7 +682,15 @@ class Pool(object):
    # this is guaranteed to only be called once
    util.debug('finalizing pool')

    -        worker_handler._state = TERMINATE
    +        # Explicitly do the cleanup here if it didn't come from terminate()
    +        # (required for if the queue will block, if it is already closed)
    +        if worker_handler._state != TERMINATE:
    +            # Notify that the worker_handler state has been changed so the
    +            # _handle_workers loop can be unblocked (and exited) in order to
    +            # send the finalization sentinel all the workers.
    +            worker_handler._state = TERMINATE
    +            change_notifier.put(None)
    +
             task_handler._state = TERMINATE
             util.debug('helping task handler/workers to finish')

    Unfortunately test case from first message in this bug report still hangs for me with this applied.

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 15, 2020

    (note, testing on 3.8 branch + pull request patch)

    @pablogsal
    Copy link
    Member

    which looks a bit weird to me

    Is explained in the message in https://bugs.python.org/msg364211:

    What happens is that is moving that code so is executed in both code paths: explicit termination and multiprocessing finalization.

    Unfortunately test case from first message in this bug report still hangs for me with this applied.

    Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang.

    @pablogsal
    Copy link
    Member

    New changeset ac10e0c by Batuhan Taşkaya in branch 'master':
    bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009)
    ac10e0c

    @pablogsal
    Copy link
    Member

    Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang.

    Oh, actually I am wrong as this is a different issue I had in mind. PR 19009 should fix the case reported in https://bugs.python.org/issue38744 and this equivalent:

              from multiprocessing import Pool
                class A:
                    def __init__(self):
                        self.pool = Pool(processes=1)
                def test():
                    problem = A()
                    problem.pool.map(float, tuple(range(10)))
                if __name__ == "__main__":
                    test()

    Notice that in this code there is no __del__ call.

    @pablogsal
    Copy link
    Member

    I tried and I cannot reproduce the hang with the code provided in the first message here after PR 19009. (Tested on Linux, Windows and MacOS)

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 15, 2020

    Ok, so two test cases

    from multiprocessing.pool import ThreadPool
    class A(object):
        def __init__(self):
            self.pool = ThreadPool()
        def __del__(self):
            self.pool.close()
            self.pool.join()
    a = A()
    print(a)
    from multiprocessing import Pool
    class A:
        def __init__(self):
            self.pool = Pool(processes=1)
    def test():
        problem = A()
        problem.pool.map(float, tuple(range(10)))
    if __name__ == "__main__":
        test()

    On two my test environments with Linux, 5.5.2 and 4.9.208 kernel, glibc 2.31 (userspace is mostly the same on both) test 1) hangs, 2) doesn't hang.

    Testing done on 3.8 branch + https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19023.patch

    For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt.

    strace shows
    [...]
    [pid 232031] munmap(0x7f1baaffe000, 8392704 <unfinished ...>
    [pid 232017] write(5, "\0\0\0\4\200\4N.", 8 <unfinished ...>
    [pid 232031] <... munmap resumed>) = 0
    [pid 232017] <... write resumed>) = 8
    [pid 232031] exit(0) = ?
    [pid 232017] futex(0x7f1b94000b60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
    [pid 232031] +++ exited with 0 +++
    ^C ************************* ctrl + c pressed
    [pid 232017] <... futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    strace: Process 232017 detached
    strace: Process 232032 detached
    strace: Process 232033 detached
    strace: Process 232034 detached
    strace: Process 232035 detached

    @pablogsal
    Copy link
    Member

    Sadly, I am unable to reproduce, so without more clarification or a reliable reproducer, we cannot start debugging the issue.

    For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt.

    Can you attach with gdb and tell the C trace and the python trace (check the gdb helpers on the CPython repo for the later).

    @miss-islington
    Copy link
    Contributor

    New changeset 7f5302f by Miss Islington (bot) in branch '3.8':
    bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009)
    7f5302f

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 15, 2020

    http://ixion.pld-linux.org/~arekm/python-gdb1.txt

    it's a result of thread apply all bt, thread apply all bt full, where

    pystack command shows nothing.

    ps. I'm arekm@freenode and can give ssh account for playing with the issue

    @pablogsal
    Copy link
    Member

    ps. I'm arekm@freenode and can give ssh account for playing with the issue

    Can you send me an email to pablogsal@gmail.com to set up ssh access?

    @pablogsal
    Copy link
    Member

    I still cannot reproduce it in your system:

    [test@ixion-pld cpython]$ hostname
    ixion-pld
    [test@ixion-pld cpython]$ uname -a
    Linux ixion-pld 5.5.2-1 #1 SMP Wed Feb 5 19:26:43 CET 2020 x86_64 Common_KVM_processor PLD Linux

    [test@ixion-pld cpython]$ cat lel.py
    from multiprocessing.pool import ThreadPool
    class A(object):
    def __init__(self):
    self.pool = ThreadPool()
    def __del__(self):
    self.pool.close()
    self.pool.join()
    a = A()
    print(a)
    [test@ixion-pld cpython]$ time ./python lel.py
    <main.A object at 0x7f9130becd20>

    real 0m0.124s
    user 0m0.105s
    sys 0m0.021s

    @pablogsal
    Copy link
    Member

    I tested it on the master branch with commit 5b66ec1

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 15, 2020

    master works fine
    3.8 branch hangs

    so there is some other/related issue.

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 16, 2020

    On master test 1) hangs before commit below and works after commit below. Unfortunately applying that commit to 3.8 branch doesn't help - 3.8 still hangs. Some other fix is also needed I guess

    commit 9ad58ac
    Author: Victor Stinner <vstinner@python.org>
    Date: Mon Mar 9 23:37:49 2020 +0100

    bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)
    
    Clear the frames of daemon threads earlier during the Python shutdown to
    call objects destructors. So "unclosed file" resource warnings are now
    emitted for daemon threads in a more reliable way.
    
    Cleanup _PyThreadState_DeleteExcept() code: rename "garbage" to
    "list".
    

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 16, 2020

    This one is also needed on 3.8 to get it not hang with 1) test case.

    So 3.8 branch + 9ad58ac + 4d96b46 is working for me.

    commit 4d96b46
    Author: Victor Stinner <vstinner@python.org>
    Date: Sat Feb 1 02:30:25 2020 +0100

    bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)
    
    PyThreadState.on_delete is a callback used to notify Python when a
    thread completes. _thread._set_sentinel() function creates a lock
    which is released when the thread completes. It sets on_delete
    callback to the internal release_sentinel() function. This lock is
    known as Threading._tstate_lock in the threading module.
    
    The release_sentinel() function uses the Python C API. The problem is
    that on_delete is called late in the Python finalization, when the C
    API is no longer fully working.
    
    The PyThreadState_Clear() function now calls the
    PyThreadState.on_delete callback. Previously, that happened in
    PyThreadState_Delete().
    
    The release_sentinel() function is now called when the C API is still
    fully working.
    

    @pablogsal
    Copy link
    Member

    Can you try applying 4f384af and 4d96b46 together?

    @pablogsal
    Copy link
    Member

    Oh, sorry, I missed your last message.

    @pablogsal
    Copy link
    Member

    So 3.8 branch + 9ad58ac + 4d96b46 is working for me.

    I can confirm that these commits together fix the problem.

    Victor, are you OK if we backport both changes to 3.8?

    @vstinner
    Copy link
    Member

    Victor, are you OK if we backport both changes to 3.8?

    Let me look at commit 9ad58ac:
    "bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)"

    Calling _PyThreadState_DeleteExcept() in Py_FinalizeEx() is really dangerous. It frees PyThreadState memory of daemon threads. Daemon threads continue to run while Py_FinalizeEx() is running (which takes an unknown amount of time, we only know that it's larger than 0 seconds). When a daemon thread attempts to acquire the GIL, it will likely crash if its PyThreadState memory is freed. This memory can be overriden by another memory allocation, or dereferencing the pointer can trigger a segmentation fault.

    This change caused multiple regressions in the master branch. I had hard time to fix all crashes: I modified take_gil() 4 times, and I'm still not sure that my fix is correct. I had to modify take_gil() function which acquire the GIL: this function is really fragile and I would prefer to not touch it in a stable branch. See bpo-39877 changes to have an idea of the complexity of the problem.

    Python finalization is really fragile: https://pythondev.readthedocs.io/finalization.html

    @vstinner
    Copy link
    Member

    commit 4d96b46
    "bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)"

    "(...) The release_sentinel() function is now called when the C API is still
    fully working."

    This change is "safer" than the other one, but there is still a risk of introducing a regression.

    The problem is that this commit is part of a larger effort to make the Python finalization more reliable. The commit makes sense in the serie of commits pushed into the master branch, but I'm not sure that it makes sense if it's taken alone.

    I don't have the bandwidth right now to investigate this issue, to identify the root issue, and suggest which commits should be backported or not.

    @pablogsal
    Copy link
    Member

    Ok, so sadly it seems that we need to close this issue as is fixed in master but the backport is risky.

    @vstinner
    Copy link
    Member

    If you want to get a reliable behavior, don't rely on destructors. Python finalization is not determistic and destructors can be called while Python is not longer fully functional.

    Release ressources explicitly. For example, use multiprocessing.Pool with a context manager, or even call close() and join() methods explicitly.

    @arekm
    Copy link
    Mannequin

    arekm mannequin commented Mar 20, 2020

    So only this test (from first commit) should be added on master to check for further regressions in the area, right?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants