classification
Title: python3.8 regression - ThreadPool join via __del__ hangs forever
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.9, Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, arekm, gaborbernat, miss-islington, pablogsal, vstinner
Priority: normal Keywords: patch

Created on 2020-01-16 13:23 by gaborbernat, last changed 2020-03-20 18:11 by arekm. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 19009 merged BTaskaya, 2020-03-15 02:47
PR 19023 merged miss-islington, 2020-03-15 19:46
Messages (30)
msg360119 - (view) Author: gaborbernat (gaborbernat) * Date: 2020-01-16 13:23
Assume the following code:

```python
from multiprocessing.pool import ThreadPool
class A(object):
    def __init__(self):
        self.pool = ThreadPool()
    def __del__(self):
        self.pool.close()
        self.pool.join()
a = A()
print(a)
```

The code snippet above hangs forever on Python 3.8+ (works ok on Python 3.7 and earlier). An example output where I've added some extra prints on to the thread joins:

```
<__main__.A object at 0x1104d6070>
join thread <Thread(Thread-13, started daemon 123145699672064)> None
done
join thread <Thread(Thread-14, stopped daemon 123145716461568)> None
done
join thread <Thread(Thread-15, stopped daemon 123145733251072)> None
done
join thread <Thread(Thread-13, stopped daemon 123145699672064)> None
done
join thread <Thread(Thread-14, stopped daemon 123145716461568)> None
done
join thread <Thread(Thread-15, stopped daemon 123145733251072)> None
done
join thread <DummyProcess(Thread-1, stopped daemon 123145498198016)> None
done
join thread <DummyProcess(Thread-2, stopped daemon 123145514987520)> None
done
join thread <DummyProcess(Thread-3, stopped daemon 123145531777024)> None
done
join thread <DummyProcess(Thread-4, stopped daemon 123145548566528)> None
done
join thread <DummyProcess(Thread-5, started daemon 123145565356032)> None
```

I've tested on MacOs, but could replicate on Linux too within the CI.
msg360128 - (view) Author: gaborbernat (gaborbernat) * Date: 2020-01-16 16:30
An FYI update, the code is used by swagger-codegen to generate HTTP clients (https://github.com/swagger-api/swagger-codegen/blob/master/modules/swagger-codegen/src/main/resources/python/api_client.mustache#L73). The example below basically replicates creating such client as a global.
msg360130 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-16 16:48
This code snippet seems to trigger a crash in the _struct module:
https://bugs.python.org/issue38076#msg360127
msg360131 - (view) Author: gaborbernat (gaborbernat) * Date: 2020-01-16 17:08
Reported the issue downstream too under https://github.com/swagger-api/swagger-codegen/issues/9991
msg360133 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-01-16 17:35
This is not a regression: the code only worked on 3.7 because there is a sleep in the pool maintainance loop that makes the pool more likely being more consistent, but is the same problem as there is a race condition. For example, reducing the sleep in the loop (notice 3.8 does not use a sleep but instead uses a more resilient system of sentinels) for the 3.7 version reproduces the problem:

diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py
index 3e9a0d6b48..f8d438d87c 100644
--- a/Lib/multiprocessing/pool.py
+++ b/Lib/multiprocessing/pool.py
@@ -410,7 +410,7 @@ class Pool(object):
         # is terminated.
         while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
             pool._maintain_pool()
-            time.sleep(0.1)
+            time.sleep(0.00000001)
         # send sentinel to stop workers
         pool._taskqueue.put(None)
         util.debug('worker handler exiting')

With that patch, 3.7 hangs as well. The problem here is that something regarding the locks inside the SimpleQueue of inbound tasks is not in a consistent state when Python finalization calls __del__.
msg364215 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 03:28
Check out also https://bugs.python.org/issue38744
msg364225 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-15 09:40
I've applied pull request patch (which looks a bit weird to me, removes something then adds the same thing back etc https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19009.patch). Essentially the change is:

diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py
index b223d6aa72..8bd9608b0e 100644
--- a/Lib/multiprocessing/pool.py
+++ b/Lib/multiprocessing/pool.py
@@ -682,7 +682,15 @@ class Pool(object):
         # this is guaranteed to only be called once
         util.debug('finalizing pool')

-        worker_handler._state = TERMINATE
+        # Explicitly do the cleanup here if it didn't come from terminate()
+        # (required for if the queue will block, if it is already closed)
+        if worker_handler._state != TERMINATE:
+            # Notify that the worker_handler state has been changed so the
+            # _handle_workers loop can be unblocked (and exited) in order to
+            # send the finalization sentinel all the workers.
+            worker_handler._state = TERMINATE
+            change_notifier.put(None)
+
         task_handler._state = TERMINATE

         util.debug('helping task handler/workers to finish')


Unfortunately test case from first message in this bug report still hangs for me with this applied.
msg364226 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-15 10:07
(note, testing on 3.8 branch + pull request patch)
msg364251 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 19:44
> which looks a bit weird to me

Is explained in the message in https://bugs.python.org/msg364211:

What happens is that is moving that code so is executed in both code paths: explicit termination and multiprocessing finalization.

> Unfortunately test case from first message in this bug report still hangs for me with this applied.

Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang.
msg364252 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 19:46
New changeset ac10e0c93218627d1a639db0b7b41714c5f6a883 by Batuhan Taşkaya in branch 'master':
bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009)
https://github.com/python/cpython/commit/ac10e0c93218627d1a639db0b7b41714c5f6a883
msg364253 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 19:47
> Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang.

Oh, actually I am wrong as this is a different issue I had in mind. PR 19009 should fix the case reported in https://bugs.python.org/issue38744 and this equivalent:

          from multiprocessing import Pool
            class A:
                def __init__(self):
                    self.pool = Pool(processes=1)
            def test():
                problem = A()
                problem.pool.map(float, tuple(range(10)))
            if __name__ == "__main__":
                test()

Notice that in this code there is no __del__ call.
msg364254 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 19:52
I tried and I cannot reproduce the hang with the code provided in the first message here after PR 19009. (Tested on Linux, Windows and MacOS)
msg364259 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-15 20:48
Ok, so two test cases

1)

from multiprocessing.pool import ThreadPool
class A(object):
    def __init__(self):
        self.pool = ThreadPool()
    def __del__(self):
        self.pool.close()
        self.pool.join()
a = A()
print(a)

2)

from multiprocessing import Pool
class A:
    def __init__(self):
        self.pool = Pool(processes=1)
def test():
    problem = A()
    problem.pool.map(float, tuple(range(10)))
if __name__ == "__main__":
    test()

On two my test environments with Linux, 5.5.2 and 4.9.208 kernel, glibc 2.31 (userspace is mostly the same on both) test 1) hangs, 2) doesn't hang.

Testing done on 3.8 branch + https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19023.patch


For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt.

strace shows
[...]
[pid 232031] munmap(0x7f1baaffe000, 8392704 <unfinished ...>
[pid 232017] write(5, "\0\0\0\4\200\4N.", 8 <unfinished ...>
[pid 232031] <... munmap resumed>)      = 0
[pid 232017] <... write resumed>)       = 8
[pid 232031] exit(0)                    = ?
[pid 232017] futex(0x7f1b94000b60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 232031] +++ exited with 0 +++
^C ************************* ctrl + c pressed
[pid 232017] <... futex resumed>)       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
strace: Process 232017 detached
strace: Process 232032 detached
strace: Process 232033 detached
strace: Process 232034 detached
strace: Process 232035 detached
msg364261 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 21:26
Sadly, I am unable to reproduce, so without more clarification or a reliable reproducer, we cannot start debugging the issue.

> For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt.

Can you attach with gdb and tell the C trace and the python trace (check the gdb helpers on the CPython repo for the later).
msg364262 - (view) Author: miss-islington (miss-islington) Date: 2020-03-15 21:26
New changeset 7f5302fed4ff0cc8042e639b29a0664a16bc2702 by Miss Islington (bot) in branch '3.8':
bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009)
https://github.com/python/cpython/commit/7f5302fed4ff0cc8042e639b29a0664a16bc2702
msg364263 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-15 21:44
http://ixion.pld-linux.org/~arekm/python-gdb1.txt

it's a result of thread apply all bt, thread apply all bt full, where

pystack command shows nothing.

ps. I'm arekm@freenode and can give ssh account for playing with the issue
msg364265 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 23:06
> ps. I'm arekm@freenode and can give ssh account for playing with the issue

Can you send me an email to pablogsal@gmail.com to set up ssh access?
msg364268 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 23:16
I still cannot reproduce it in your system:

[test@ixion-pld cpython]$ hostname
ixion-pld
[test@ixion-pld cpython]$ uname -a
Linux ixion-pld 5.5.2-1 #1 SMP Wed Feb 5 19:26:43 CET 2020 x86_64 Common_KVM_processor PLD Linux

[test@ixion-pld cpython]$ cat lel.py
from multiprocessing.pool import ThreadPool
class A(object):
    def __init__(self):
        self.pool = ThreadPool()
    def __del__(self):
        self.pool.close()
        self.pool.join()
a = A()
print(a)
[test@ixion-pld cpython]$ time ./python lel.py
<__main__.A object at 0x7f9130becd20>

real    0m0.124s
user    0m0.105s
sys     0m0.021s
msg364269 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-15 23:17
I tested it on the master branch with commit 5b66ec166b81c8a77286da2c0d17be3579c3069a
msg364270 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-15 23:29
master works fine
3.8 branch hangs

so there is some other/related issue.
msg364300 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-16 09:40
On master test 1) hangs before commit below and works after commit below. Unfortunately applying that commit to 3.8 branch doesn't help - 3.8 still hangs. Some other fix is also needed I guess

commit 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6
Author: Victor Stinner <vstinner@python.org>
Date:   Mon Mar 9 23:37:49 2020 +0100

    bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)

    Clear the frames of daemon threads earlier during the Python shutdown to
    call objects destructors. So "unclosed file" resource warnings are now
    emitted for daemon threads in a more reliable way.

    Cleanup _PyThreadState_DeleteExcept() code: rename "garbage" to
    "list".
msg364307 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-16 10:43
This one is also needed on 3.8 to get it not hang with 1) test case.

So 3.8 branch + 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6 + 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 is working for me.

commit 4d96b4635aeff1b8ad41d41422ce808ce0b971c8
Author: Victor Stinner <vstinner@python.org>
Date:   Sat Feb 1 02:30:25 2020 +0100

    bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)

    PyThreadState.on_delete is a callback used to notify Python when a
    thread completes. _thread._set_sentinel() function creates a lock
    which is released when the thread completes. It sets on_delete
    callback to the internal release_sentinel() function. This lock is
    known as Threading._tstate_lock in the threading module.

    The release_sentinel() function uses the Python C API. The problem is
    that on_delete is called late in the Python finalization, when the C
    API is no longer fully working.

    The PyThreadState_Clear() function now calls the
    PyThreadState.on_delete callback. Previously, that happened in
    PyThreadState_Delete().

    The release_sentinel() function is now called when the C API is still
    fully working.
msg364308 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-16 11:04
Can you try applying 4f384af067d05b16a554bfd976934fca9f87a1cf and 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 together?
msg364309 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-16 11:04
Oh, sorry, I missed your last message.
msg364310 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-16 11:07
> So 3.8 branch + 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6 + 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 is working for me.

I can confirm that these commits together fix the problem.

Victor, are you OK if we backport both changes to 3.8?
msg364692 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-20 17:42
> Victor, are you OK if we backport both changes to 3.8?

Let me look at commit 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6:
"bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)"

Calling _PyThreadState_DeleteExcept() in Py_FinalizeEx() is really dangerous. It frees PyThreadState memory of daemon threads. Daemon threads continue to run while Py_FinalizeEx() is running (which takes an unknown amount of time, we only know that it's larger than 0 seconds). When a daemon thread attempts to acquire the GIL, it will likely crash if its PyThreadState memory is freed. This memory can be overriden by another memory allocation, or dereferencing the pointer can trigger a segmentation fault.

This change caused multiple regressions in the master branch. I had hard time to fix all crashes: I modified take_gil() 4 times, and I'm still not sure that my fix is correct. I had to modify take_gil() function which acquire the GIL: this function is really fragile and I would prefer to not touch it in a stable branch. See bpo-39877 changes to have an idea of the complexity of the problem.

Python finalization is really fragile: https://pythondev.readthedocs.io/finalization.html
msg364695 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-20 17:47
commit 4d96b4635aeff1b8ad41d41422ce808ce0b971c8
"bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)"

"(...) The release_sentinel() function is now called when the C API is still
fully working."

This change is "safer" than the other one, but there is still a risk of introducing a regression.

The problem is that this commit is part of a larger effort to make the Python finalization more reliable. The commit makes sense in the serie of commits pushed into the master branch, but I'm not sure that it makes sense if it's taken alone.

I don't have the bandwidth right now to investigate this issue, to identify the root issue, and suggest which commits should be backported or not.
msg364696 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-03-20 17:49
Ok, so sadly it seems that we need to close this issue as is fixed in master but the backport is risky.
msg364698 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-20 17:58
If you want to get a reliable behavior, don't rely on destructors. Python finalization is not determistic and destructors can be called while Python is not longer fully functional.

Release ressources explicitly. For example, use multiprocessing.Pool with a context manager, or even call close() and join() methods explicitly.
msg364700 - (view) Author: Arkadiusz Miśkiewicz (arekm) Date: 2020-03-20 18:11
So only this test (from first commit) should be added on master to check for further regressions in the area, right?
History
Date User Action Args
2020-03-20 18:11:24arekmsetmessages: + msg364700
2020-03-20 17:58:16vstinnersetmessages: + msg364698
2020-03-20 17:49:26pablogsalsetresolution: fixed
2020-03-20 17:49:21pablogsalsetstatus: open -> closed

messages: + msg364696
stage: patch review -> resolved
2020-03-20 17:47:10vstinnersetmessages: + msg364695
2020-03-20 17:42:46vstinnersetmessages: + msg364692
2020-03-16 11:07:24pablogsalsetmessages: + msg364310
2020-03-16 11:04:57pablogsalsetmessages: + msg364309
2020-03-16 11:04:15pablogsalsetmessages: + msg364308
2020-03-16 10:43:15arekmsetmessages: + msg364307
2020-03-16 09:40:25arekmsetmessages: + msg364300
2020-03-15 23:29:19arekmsetmessages: + msg364270
2020-03-15 23:17:57pablogsalsetmessages: + msg364269
2020-03-15 23:16:38pablogsalsetmessages: + msg364268
2020-03-15 23:06:06pablogsalsetmessages: + msg364265
2020-03-15 21:44:40arekmsetmessages: + msg364263
2020-03-15 21:26:55miss-islingtonsetnosy: + miss-islington
messages: + msg364262
2020-03-15 21:26:28pablogsalsetmessages: + msg364261
2020-03-15 20:48:55arekmsetmessages: + msg364259
2020-03-15 19:52:05pablogsalsetmessages: + msg364254
2020-03-15 19:47:55pablogsalsetnosy: - miss-islington
messages: + msg364253
2020-03-15 19:46:08miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request18372
2020-03-15 19:46:03pablogsalsetmessages: + msg364252
2020-03-15 19:44:47pablogsalsetmessages: + msg364251
2020-03-15 10:07:17arekmsetmessages: + msg364226
2020-03-15 09:40:36arekmsetmessages: + msg364225
2020-03-15 03:28:36pablogsalsetmessages: + msg364215
2020-03-15 02:47:50BTaskayasetkeywords: + patch
nosy: + BTaskaya

pull_requests: + pull_request18354
stage: patch review
2020-03-14 16:03:23arekmsetnosy: + arekm
2020-01-16 17:35:03pablogsalsetmessages: + msg360133
2020-01-16 17:08:15gaborbernatsetmessages: + msg360131
2020-01-16 16:48:05vstinnersetnosy: + vstinner
messages: + msg360130
2020-01-16 16:30:13gaborbernatsetmessages: + msg360128
2020-01-16 13:42:41vstinnersetnosy: - vstinner
2020-01-16 13:24:17gaborbernatsetnosy: + pablogsal
2020-01-16 13:23:44gaborbernatcreate