python3.8 regression - ThreadPool join via del hangs forever #83541

gaborbernat · 2020-01-16T13:23:44Z

BPO	39360
Nosy	@arekm, @vstinner, @pablogsal, @miss-islington, @gaborbernat, @isidentical
PRs	bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool #19009 [3.8] bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009) #19023

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-03-20.17:49:21.674>
created_at = <Date 2020-01-16.13:23:44.464>
labels = ['3.8', 'library', '3.9']
title = 'python3.8 regression - ThreadPool join via __del__ hangs forever'
updated_at = <Date 2020-03-20.18:11:24.957>
user = 'https://github.com/gaborbernat'

bugs.python.org fields:

activity = <Date 2020-03-20.18:11:24.957>
actor = 'arekm'
assignee = 'none'
closed = True
closed_date = <Date 2020-03-20.17:49:21.674>
closer = 'pablogsal'
components = ['Library (Lib)']
creation = <Date 2020-01-16.13:23:44.464>
creator = 'gaborjbernat'
dependencies = []
files = []
hgrepos = []
issue_num = 39360
keywords = ['patch']
message_count = 30.0
messages = ['360119', '360128', '360130', '360131', '360133', '364215', '364225', '364226', '364251', '364252', '364253', '364254', '364259', '364261', '364262', '364263', '364265', '364268', '364269', '364270', '364300', '364307', '364308', '364309', '364310', '364692', '364695', '364696', '364698', '364700']
nosy_count = 6.0
nosy_names = ['arekm', 'vstinner', 'pablogsal', 'miss-islington', 'gaborjbernat', 'BTaskaya']
pr_nums = ['19009', '19023']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue39360'
versions = ['Python 3.8', 'Python 3.9']

gaborbernat · 2020-01-16T13:23:43Z

Assume the following code:

from multiprocessing.pool import ThreadPool
class A(object):
    def __init__(self):
        self.pool = ThreadPool()
    def __del__(self):
        self.pool.close()
        self.pool.join()
a = A()
print(a)

The code snippet above hangs forever on Python 3.8+ (works ok on Python 3.7 and earlier). An example output where I've added some extra prints on to the thread joins:

<__main__.A object at 0x1104d6070>
join thread <Thread(Thread-13, started daemon 123145699672064)> None
done
join thread <Thread(Thread-14, stopped daemon 123145716461568)> None
done
join thread <Thread(Thread-15, stopped daemon 123145733251072)> None
done
join thread <Thread(Thread-13, stopped daemon 123145699672064)> None
done
join thread <Thread(Thread-14, stopped daemon 123145716461568)> None
done
join thread <Thread(Thread-15, stopped daemon 123145733251072)> None
done
join thread <DummyProcess(Thread-1, stopped daemon 123145498198016)> None
done
join thread <DummyProcess(Thread-2, stopped daemon 123145514987520)> None
done
join thread <DummyProcess(Thread-3, stopped daemon 123145531777024)> None
done
join thread <DummyProcess(Thread-4, stopped daemon 123145548566528)> None
done
join thread <DummyProcess(Thread-5, started daemon 123145565356032)> None

I've tested on MacOs, but could replicate on Linux too within the CI.

gaborbernat · 2020-01-16T16:30:13Z

An FYI update, the code is used by swagger-codegen to generate HTTP clients (https://github.com/swagger-api/swagger-codegen/blob/master/modules/swagger-codegen/src/main/resources/python/api_client.mustache#L73). The example below basically replicates creating such client as a global.

vstinner · 2020-01-16T16:48:05Z

This code snippet seems to trigger a crash in the _struct module:
https://bugs.python.org/issue38076#msg360127

gaborbernat · 2020-01-16T17:08:16Z

Reported the issue downstream too under swagger-api/swagger-codegen#9991

pablogsal · 2020-01-16T17:35:03Z

This is not a regression: the code only worked on 3.7 because there is a sleep in the pool maintainance loop that makes the pool more likely being more consistent, but is the same problem as there is a race condition. For example, reducing the sleep in the loop (notice 3.8 does not use a sleep but instead uses a more resilient system of sentinels) for the 3.7 version reproduces the problem:

diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py
index 3e9a0d6b48..f8d438d87c 100644
--- a/Lib/multiprocessing/pool.py
+++ b/Lib/multiprocessing/pool.py
@@ -410,7 +410,7 @@ class Pool(object):
         # is terminated.
         while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
             pool._maintain_pool()
-            time.sleep(0.1)
+            time.sleep(0.00000001)
         # send sentinel to stop workers
         pool._taskqueue.put(None)
         util.debug('worker handler exiting')

With that patch, 3.7 hangs as well. The problem here is that something regarding the locks inside the SimpleQueue of inbound tasks is not in a consistent state when Python finalization calls __del__.

pablogsal · 2020-03-15T03:28:36Z

Check out also https://bugs.python.org/issue38744

arekm · 2020-03-15T09:40:36Z

I've applied pull request patch (which looks a bit weird to me, removes something then adds the same thing back etc https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19009.patch). Essentially the change is:

diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py
index b223d6aa72..8bd9608b0e 100644
--- a/Lib/multiprocessing/pool.py
+++ b/Lib/multiprocessing/pool.py
@@ -682,7 +682,15 @@ class Pool(object):
# this is guaranteed to only be called once
util.debug('finalizing pool')

-        worker_handler._state = TERMINATE
+        # Explicitly do the cleanup here if it didn't come from terminate()
+        # (required for if the queue will block, if it is already closed)
+        if worker_handler._state != TERMINATE:
+            # Notify that the worker_handler state has been changed so the
+            # _handle_workers loop can be unblocked (and exited) in order to
+            # send the finalization sentinel all the workers.
+            worker_handler._state = TERMINATE
+            change_notifier.put(None)
+
         task_handler._state = TERMINATE

         util.debug('helping task handler/workers to finish')

Unfortunately test case from first message in this bug report still hangs for me with this applied.

arekm · 2020-03-15T10:07:17Z

(note, testing on 3.8 branch + pull request patch)

pablogsal · 2020-03-15T19:44:47Z

which looks a bit weird to me

Is explained in the message in https://bugs.python.org/msg364211:

What happens is that is moving that code so is executed in both code paths: explicit termination and multiprocessing finalization.

Unfortunately test case from first message in this bug report still hangs for me with this applied.

Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang.

pablogsal · 2020-03-15T19:46:04Z

New changeset ac10e0c by Batuhan Taşkaya in branch 'master':
bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009)
ac10e0c

pablogsal · 2020-03-15T19:47:55Z

Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang.

Oh, actually I am wrong as this is a different issue I had in mind. PR 19009 should fix the case reported in https://bugs.python.org/issue38744 and this equivalent:

          from multiprocessing import Pool
            class A:
                def __init__(self):
                    self.pool = Pool(processes=1)
            def test():
                problem = A()
                problem.pool.map(float, tuple(range(10)))
            if __name__ == "__main__":
                test()

Notice that in this code there is no __del__ call.

pablogsal · 2020-03-15T19:52:06Z

I tried and I cannot reproduce the hang with the code provided in the first message here after PR 19009. (Tested on Linux, Windows and MacOS)

arekm · 2020-03-15T20:48:55Z

Ok, so two test cases

from multiprocessing.pool import ThreadPool
class A(object):
    def __init__(self):
        self.pool = ThreadPool()
    def __del__(self):
        self.pool.close()
        self.pool.join()
a = A()
print(a)

from multiprocessing import Pool
class A:
    def __init__(self):
        self.pool = Pool(processes=1)
def test():
    problem = A()
    problem.pool.map(float, tuple(range(10)))
if __name__ == "__main__":
    test()

On two my test environments with Linux, 5.5.2 and 4.9.208 kernel, glibc 2.31 (userspace is mostly the same on both) test 1) hangs, 2) doesn't hang.

Testing done on 3.8 branch + https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19023.patch

For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt.

strace shows
[...]
[pid 232031] munmap(0x7f1baaffe000, 8392704 <unfinished ...>
[pid 232017] write(5, "\0\0\0\4\200\4N.", 8 <unfinished ...>
[pid 232031] <... munmap resumed>) = 0
[pid 232017] <... write resumed>) = 8
[pid 232031] exit(0) = ?
[pid 232017] futex(0x7f1b94000b60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 232031] +++ exited with 0 +++
^C ************************* ctrl + c pressed
[pid 232017] <... futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
strace: Process 232017 detached
strace: Process 232032 detached
strace: Process 232033 detached
strace: Process 232034 detached
strace: Process 232035 detached

pablogsal · 2020-03-15T21:26:29Z

Sadly, I am unable to reproduce, so without more clarification or a reliable reproducer, we cannot start debugging the issue.

For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt.

Can you attach with gdb and tell the C trace and the python trace (check the gdb helpers on the CPython repo for the later).

miss-islington · 2020-03-15T21:26:55Z

New changeset 7f5302f by Miss Islington (bot) in branch '3.8':
bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009)
7f5302f

arekm · 2020-03-15T21:44:41Z

http://ixion.pld-linux.org/~arekm/python-gdb1.txt

it's a result of thread apply all bt, thread apply all bt full, where

pystack command shows nothing.

ps. I'm arekm@freenode and can give ssh account for playing with the issue

pablogsal · 2020-03-15T23:06:07Z

ps. I'm arekm@freenode and can give ssh account for playing with the issue

Can you send me an email to pablogsal@gmail.com to set up ssh access?

pablogsal · 2020-03-15T23:16:38Z

I still cannot reproduce it in your system:

[test@ixion-pld cpython]$ hostname
ixion-pld
[test@ixion-pld cpython]$ uname -a
Linux ixion-pld 5.5.2-1 #1 SMP Wed Feb 5 19:26:43 CET 2020 x86_64 Common_KVM_processor PLD Linux

[test@ixion-pld cpython]$ cat lel.py
from multiprocessing.pool import ThreadPool
class A(object):
def __init__(self):
self.pool = ThreadPool()
def __del__(self):
self.pool.close()
self.pool.join()
a = A()
print(a)
[test@ixion-pld cpython]$ time ./python lel.py
<main.A object at 0x7f9130becd20>

real 0m0.124s
user 0m0.105s
sys 0m0.021s

pablogsal · 2020-03-15T23:17:58Z

I tested it on the master branch with commit 5b66ec1

arekm · 2020-03-15T23:29:20Z

master works fine
3.8 branch hangs

so there is some other/related issue.

arekm · 2020-03-16T09:40:26Z

On master test 1) hangs before commit below and works after commit below. Unfortunately applying that commit to 3.8 branch doesn't help - 3.8 still hangs. Some other fix is also needed I guess

commit 9ad58ac
Author: Victor Stinner <vstinner@python.org>
Date: Mon Mar 9 23:37:49 2020 +0100

bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)

Clear the frames of daemon threads earlier during the Python shutdown to
call objects destructors. So "unclosed file" resource warnings are now
emitted for daemon threads in a more reliable way.

Cleanup _PyThreadState_DeleteExcept() code: rename "garbage" to
"list".

arekm · 2020-03-16T10:43:16Z

This one is also needed on 3.8 to get it not hang with 1) test case.

So 3.8 branch + 9ad58ac + 4d96b46 is working for me.

commit 4d96b46
Author: Victor Stinner <vstinner@python.org>
Date: Sat Feb 1 02:30:25 2020 +0100

bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)

PyThreadState.on_delete is a callback used to notify Python when a
thread completes. _thread._set_sentinel() function creates a lock
which is released when the thread completes. It sets on_delete
callback to the internal release_sentinel() function. This lock is
known as Threading._tstate_lock in the threading module.

The release_sentinel() function uses the Python C API. The problem is
that on_delete is called late in the Python finalization, when the C
API is no longer fully working.

The PyThreadState_Clear() function now calls the
PyThreadState.on_delete callback. Previously, that happened in
PyThreadState_Delete().

The release_sentinel() function is now called when the C API is still
fully working.

pablogsal · 2020-03-16T11:04:16Z

Can you try applying 4f384af and 4d96b46 together?

pablogsal · 2020-03-16T11:04:57Z

Oh, sorry, I missed your last message.

pablogsal · 2020-03-16T11:07:24Z

So 3.8 branch + 9ad58ac + 4d96b46 is working for me.

I can confirm that these commits together fix the problem.

Victor, are you OK if we backport both changes to 3.8?

vstinner · 2020-03-20T17:42:47Z

Victor, are you OK if we backport both changes to 3.8?

Let me look at commit 9ad58ac:
"bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)"

Calling _PyThreadState_DeleteExcept() in Py_FinalizeEx() is really dangerous. It frees PyThreadState memory of daemon threads. Daemon threads continue to run while Py_FinalizeEx() is running (which takes an unknown amount of time, we only know that it's larger than 0 seconds). When a daemon thread attempts to acquire the GIL, it will likely crash if its PyThreadState memory is freed. This memory can be overriden by another memory allocation, or dereferencing the pointer can trigger a segmentation fault.

This change caused multiple regressions in the master branch. I had hard time to fix all crashes: I modified take_gil() 4 times, and I'm still not sure that my fix is correct. I had to modify take_gil() function which acquire the GIL: this function is really fragile and I would prefer to not touch it in a stable branch. See bpo-39877 changes to have an idea of the complexity of the problem.

Python finalization is really fragile: https://pythondev.readthedocs.io/finalization.html

vstinner · 2020-03-20T17:47:11Z

commit 4d96b46
"bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)"

"(...) The release_sentinel() function is now called when the C API is still
fully working."

This change is "safer" than the other one, but there is still a risk of introducing a regression.

The problem is that this commit is part of a larger effort to make the Python finalization more reliable. The commit makes sense in the serie of commits pushed into the master branch, but I'm not sure that it makes sense if it's taken alone.

I don't have the bandwidth right now to investigate this issue, to identify the root issue, and suggest which commits should be backported or not.

pablogsal · 2020-03-20T17:49:22Z

Ok, so sadly it seems that we need to close this issue as is fixed in master but the backport is risky.

vstinner · 2020-03-20T17:58:17Z

If you want to get a reliable behavior, don't rely on destructors. Python finalization is not determistic and destructors can be called while Python is not longer fully functional.

Release ressources explicitly. For example, use multiprocessing.Pool with a context manager, or even call close() and join() methods explicitly.

arekm · 2020-03-20T18:11:25Z

So only this test (from first commit) should be added on master to check for further regressions in the area, right?

gaborbernat mannequin added 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir labels Jan 16, 2020

pablogsal closed this as completed Mar 20, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Stealthii mentioned this issue May 5, 2022

fix parallel mode under Python-3.8 ploxiln/fab-classic#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python3.8 regression - ThreadPool join via del hangs forever #83541

python3.8 regression - ThreadPool join via del hangs forever #83541

gaborbernat mannequin commented Jan 16, 2020

gaborbernat mannequin commented Jan 16, 2020

gaborbernat mannequin commented Jan 16, 2020

vstinner commented Jan 16, 2020

gaborbernat mannequin commented Jan 16, 2020

pablogsal commented Jan 16, 2020

pablogsal commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

pablogsal commented Mar 15, 2020

miss-islington commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

arekm mannequin commented Mar 16, 2020

arekm mannequin commented Mar 16, 2020

pablogsal commented Mar 16, 2020

pablogsal commented Mar 16, 2020

pablogsal commented Mar 16, 2020

vstinner commented Mar 20, 2020

vstinner commented Mar 20, 2020

pablogsal commented Mar 20, 2020

vstinner commented Mar 20, 2020

arekm mannequin commented Mar 20, 2020

python3.8 regression - ThreadPool join via __del__ hangs forever #83541

python3.8 regression - ThreadPool join via __del__ hangs forever #83541

Comments

gaborbernat mannequin commented Jan 16, 2020

gaborbernat mannequin commented Jan 16, 2020

gaborbernat mannequin commented Jan 16, 2020

vstinner commented Jan 16, 2020

gaborbernat mannequin commented Jan 16, 2020

pablogsal commented Jan 16, 2020

pablogsal commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

pablogsal commented Mar 15, 2020

miss-islington commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

pablogsal commented Mar 15, 2020

arekm mannequin commented Mar 15, 2020

arekm mannequin commented Mar 16, 2020

arekm mannequin commented Mar 16, 2020

pablogsal commented Mar 16, 2020

pablogsal commented Mar 16, 2020

pablogsal commented Mar 16, 2020

vstinner commented Mar 20, 2020

vstinner commented Mar 20, 2020

pablogsal commented Mar 20, 2020

vstinner commented Mar 20, 2020

arekm mannequin commented Mar 20, 2020

python3.8 regression - ThreadPool join via del hangs forever #83541

python3.8 regression - ThreadPool join via del hangs forever #83541