Issue39360
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2020-01-16 13:23 by gaborjbernat, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 19009 | merged | BTaskaya, 2020-03-15 02:47 | |
PR 19023 | merged | miss-islington, 2020-03-15 19:46 |
Messages (30) | |||
---|---|---|---|
msg360119 - (view) | Author: gaborjbernat (gaborjbernat) * | Date: 2020-01-16 13:23 | |
Assume the following code: ```python from multiprocessing.pool import ThreadPool class A(object): def __init__(self): self.pool = ThreadPool() def __del__(self): self.pool.close() self.pool.join() a = A() print(a) ``` The code snippet above hangs forever on Python 3.8+ (works ok on Python 3.7 and earlier). An example output where I've added some extra prints on to the thread joins: ``` <__main__.A object at 0x1104d6070> join thread <Thread(Thread-13, started daemon 123145699672064)> None done join thread <Thread(Thread-14, stopped daemon 123145716461568)> None done join thread <Thread(Thread-15, stopped daemon 123145733251072)> None done join thread <Thread(Thread-13, stopped daemon 123145699672064)> None done join thread <Thread(Thread-14, stopped daemon 123145716461568)> None done join thread <Thread(Thread-15, stopped daemon 123145733251072)> None done join thread <DummyProcess(Thread-1, stopped daemon 123145498198016)> None done join thread <DummyProcess(Thread-2, stopped daemon 123145514987520)> None done join thread <DummyProcess(Thread-3, stopped daemon 123145531777024)> None done join thread <DummyProcess(Thread-4, stopped daemon 123145548566528)> None done join thread <DummyProcess(Thread-5, started daemon 123145565356032)> None ``` I've tested on MacOs, but could replicate on Linux too within the CI. |
|||
msg360128 - (view) | Author: gaborjbernat (gaborjbernat) * | Date: 2020-01-16 16:30 | |
An FYI update, the code is used by swagger-codegen to generate HTTP clients (https://github.com/swagger-api/swagger-codegen/blob/master/modules/swagger-codegen/src/main/resources/python/api_client.mustache#L73). The example below basically replicates creating such client as a global. |
|||
msg360130 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-01-16 16:48 | |
This code snippet seems to trigger a crash in the _struct module: https://bugs.python.org/issue38076#msg360127 |
|||
msg360131 - (view) | Author: gaborjbernat (gaborjbernat) * | Date: 2020-01-16 17:08 | |
Reported the issue downstream too under https://github.com/swagger-api/swagger-codegen/issues/9991 |
|||
msg360133 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-01-16 17:35 | |
This is not a regression: the code only worked on 3.7 because there is a sleep in the pool maintainance loop that makes the pool more likely being more consistent, but is the same problem as there is a race condition. For example, reducing the sleep in the loop (notice 3.8 does not use a sleep but instead uses a more resilient system of sentinels) for the 3.7 version reproduces the problem: diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py index 3e9a0d6b48..f8d438d87c 100644 --- a/Lib/multiprocessing/pool.py +++ b/Lib/multiprocessing/pool.py @@ -410,7 +410,7 @@ class Pool(object): # is terminated. while thread._state == RUN or (pool._cache and thread._state != TERMINATE): pool._maintain_pool() - time.sleep(0.1) + time.sleep(0.00000001) # send sentinel to stop workers pool._taskqueue.put(None) util.debug('worker handler exiting') With that patch, 3.7 hangs as well. The problem here is that something regarding the locks inside the SimpleQueue of inbound tasks is not in a consistent state when Python finalization calls __del__. |
|||
msg364215 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 03:28 | |
Check out also https://bugs.python.org/issue38744 |
|||
msg364225 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-15 09:40 | |
I've applied pull request patch (which looks a bit weird to me, removes something then adds the same thing back etc https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19009.patch). Essentially the change is: diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py index b223d6aa72..8bd9608b0e 100644 --- a/Lib/multiprocessing/pool.py +++ b/Lib/multiprocessing/pool.py @@ -682,7 +682,15 @@ class Pool(object): # this is guaranteed to only be called once util.debug('finalizing pool') - worker_handler._state = TERMINATE + # Explicitly do the cleanup here if it didn't come from terminate() + # (required for if the queue will block, if it is already closed) + if worker_handler._state != TERMINATE: + # Notify that the worker_handler state has been changed so the + # _handle_workers loop can be unblocked (and exited) in order to + # send the finalization sentinel all the workers. + worker_handler._state = TERMINATE + change_notifier.put(None) + task_handler._state = TERMINATE util.debug('helping task handler/workers to finish') Unfortunately test case from first message in this bug report still hangs for me with this applied. |
|||
msg364226 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-15 10:07 | |
(note, testing on 3.8 branch + pull request patch) |
|||
msg364251 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 19:44 | |
> which looks a bit weird to me Is explained in the message in https://bugs.python.org/msg364211: What happens is that is moving that code so is executed in both code paths: explicit termination and multiprocessing finalization. > Unfortunately test case from first message in this bug report still hangs for me with this applied. Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang. |
|||
msg364252 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 19:46 | |
New changeset ac10e0c93218627d1a639db0b7b41714c5f6a883 by Batuhan Taşkaya in branch 'master': bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009) https://github.com/python/cpython/commit/ac10e0c93218627d1a639db0b7b41714c5f6a883 |
|||
msg364253 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 19:47 | |
> Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang. Oh, actually I am wrong as this is a different issue I had in mind. PR 19009 should fix the case reported in https://bugs.python.org/issue38744 and this equivalent: from multiprocessing import Pool class A: def __init__(self): self.pool = Pool(processes=1) def test(): problem = A() problem.pool.map(float, tuple(range(10))) if __name__ == "__main__": test() Notice that in this code there is no __del__ call. |
|||
msg364254 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 19:52 | |
I tried and I cannot reproduce the hang with the code provided in the first message here after PR 19009. (Tested on Linux, Windows and MacOS) |
|||
msg364259 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-15 20:48 | |
Ok, so two test cases 1) from multiprocessing.pool import ThreadPool class A(object): def __init__(self): self.pool = ThreadPool() def __del__(self): self.pool.close() self.pool.join() a = A() print(a) 2) from multiprocessing import Pool class A: def __init__(self): self.pool = Pool(processes=1) def test(): problem = A() problem.pool.map(float, tuple(range(10))) if __name__ == "__main__": test() On two my test environments with Linux, 5.5.2 and 4.9.208 kernel, glibc 2.31 (userspace is mostly the same on both) test 1) hangs, 2) doesn't hang. Testing done on 3.8 branch + https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19023.patch For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt. strace shows [...] [pid 232031] munmap(0x7f1baaffe000, 8392704 <unfinished ...> [pid 232017] write(5, "\0\0\0\4\200\4N.", 8 <unfinished ...> [pid 232031] <... munmap resumed>) = 0 [pid 232017] <... write resumed>) = 8 [pid 232031] exit(0) = ? [pid 232017] futex(0x7f1b94000b60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 232031] +++ exited with 0 +++ ^C ************************* ctrl + c pressed [pid 232017] <... futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) strace: Process 232017 detached strace: Process 232032 detached strace: Process 232033 detached strace: Process 232034 detached strace: Process 232035 detached |
|||
msg364261 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 21:26 | |
Sadly, I am unable to reproduce, so without more clarification or a reliable reproducer, we cannot start debugging the issue. > For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt. Can you attach with gdb and tell the C trace and the python trace (check the gdb helpers on the CPython repo for the later). |
|||
msg364262 - (view) | Author: miss-islington (miss-islington) | Date: 2020-03-15 21:26 | |
New changeset 7f5302fed4ff0cc8042e639b29a0664a16bc2702 by Miss Islington (bot) in branch '3.8': bpo-39360: Ensure all workers exit when finalizing a multiprocessing Pool (GH-19009) https://github.com/python/cpython/commit/7f5302fed4ff0cc8042e639b29a0664a16bc2702 |
|||
msg364263 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-15 21:44 | |
http://ixion.pld-linux.org/~arekm/python-gdb1.txt it's a result of thread apply all bt, thread apply all bt full, where pystack command shows nothing. ps. I'm arekm@freenode and can give ssh account for playing with the issue |
|||
msg364265 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 23:06 | |
> ps. I'm arekm@freenode and can give ssh account for playing with the issue Can you send me an email to pablogsal@gmail.com to set up ssh access? |
|||
msg364268 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 23:16 | |
I still cannot reproduce it in your system: [test@ixion-pld cpython]$ hostname ixion-pld [test@ixion-pld cpython]$ uname -a Linux ixion-pld 5.5.2-1 #1 SMP Wed Feb 5 19:26:43 CET 2020 x86_64 Common_KVM_processor PLD Linux [test@ixion-pld cpython]$ cat lel.py from multiprocessing.pool import ThreadPool class A(object): def __init__(self): self.pool = ThreadPool() def __del__(self): self.pool.close() self.pool.join() a = A() print(a) [test@ixion-pld cpython]$ time ./python lel.py <__main__.A object at 0x7f9130becd20> real 0m0.124s user 0m0.105s sys 0m0.021s |
|||
msg364269 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-15 23:17 | |
I tested it on the master branch with commit 5b66ec166b81c8a77286da2c0d17be3579c3069a |
|||
msg364270 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-15 23:29 | |
master works fine 3.8 branch hangs so there is some other/related issue. |
|||
msg364300 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-16 09:40 | |
On master test 1) hangs before commit below and works after commit below. Unfortunately applying that commit to 3.8 branch doesn't help - 3.8 still hangs. Some other fix is also needed I guess commit 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6 Author: Victor Stinner <vstinner@python.org> Date: Mon Mar 9 23:37:49 2020 +0100 bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848) Clear the frames of daemon threads earlier during the Python shutdown to call objects destructors. So "unclosed file" resource warnings are now emitted for daemon threads in a more reliable way. Cleanup _PyThreadState_DeleteExcept() code: rename "garbage" to "list". |
|||
msg364307 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-16 10:43 | |
This one is also needed on 3.8 to get it not hang with 1) test case. So 3.8 branch + 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6 + 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 is working for me. commit 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 Author: Victor Stinner <vstinner@python.org> Date: Sat Feb 1 02:30:25 2020 +0100 bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296) PyThreadState.on_delete is a callback used to notify Python when a thread completes. _thread._set_sentinel() function creates a lock which is released when the thread completes. It sets on_delete callback to the internal release_sentinel() function. This lock is known as Threading._tstate_lock in the threading module. The release_sentinel() function uses the Python C API. The problem is that on_delete is called late in the Python finalization, when the C API is no longer fully working. The PyThreadState_Clear() function now calls the PyThreadState.on_delete callback. Previously, that happened in PyThreadState_Delete(). The release_sentinel() function is now called when the C API is still fully working. |
|||
msg364308 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-16 11:04 | |
Can you try applying 4f384af067d05b16a554bfd976934fca9f87a1cf and 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 together? |
|||
msg364309 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-16 11:04 | |
Oh, sorry, I missed your last message. |
|||
msg364310 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-16 11:07 | |
> So 3.8 branch + 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6 + 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 is working for me. I can confirm that these commits together fix the problem. Victor, are you OK if we backport both changes to 3.8? |
|||
msg364692 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-03-20 17:42 | |
> Victor, are you OK if we backport both changes to 3.8? Let me look at commit 9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6: "bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)" Calling _PyThreadState_DeleteExcept() in Py_FinalizeEx() is really dangerous. It frees PyThreadState memory of daemon threads. Daemon threads continue to run while Py_FinalizeEx() is running (which takes an unknown amount of time, we only know that it's larger than 0 seconds). When a daemon thread attempts to acquire the GIL, it will likely crash if its PyThreadState memory is freed. This memory can be overriden by another memory allocation, or dereferencing the pointer can trigger a segmentation fault. This change caused multiple regressions in the master branch. I had hard time to fix all crashes: I modified take_gil() 4 times, and I'm still not sure that my fix is correct. I had to modify take_gil() function which acquire the GIL: this function is really fragile and I would prefer to not touch it in a stable branch. See bpo-39877 changes to have an idea of the complexity of the problem. Python finalization is really fragile: https://pythondev.readthedocs.io/finalization.html |
|||
msg364695 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-03-20 17:47 | |
commit 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 "bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)" "(...) The release_sentinel() function is now called when the C API is still fully working." This change is "safer" than the other one, but there is still a risk of introducing a regression. The problem is that this commit is part of a larger effort to make the Python finalization more reliable. The commit makes sense in the serie of commits pushed into the master branch, but I'm not sure that it makes sense if it's taken alone. I don't have the bandwidth right now to investigate this issue, to identify the root issue, and suggest which commits should be backported or not. |
|||
msg364696 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-03-20 17:49 | |
Ok, so sadly it seems that we need to close this issue as is fixed in master but the backport is risky. |
|||
msg364698 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-03-20 17:58 | |
If you want to get a reliable behavior, don't rely on destructors. Python finalization is not determistic and destructors can be called while Python is not longer fully functional. Release ressources explicitly. For example, use multiprocessing.Pool with a context manager, or even call close() and join() methods explicitly. |
|||
msg364700 - (view) | Author: Arkadiusz Miśkiewicz (arekm) | Date: 2020-03-20 18:11 | |
So only this test (from first commit) should be added on master to check for further regressions in the area, right? |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:25 | admin | set | github: 83541 |
2020-03-20 18:11:24 | arekm | set | messages: + msg364700 |
2020-03-20 17:58:16 | vstinner | set | messages: + msg364698 |
2020-03-20 17:49:26 | pablogsal | set | resolution: fixed |
2020-03-20 17:49:21 | pablogsal | set | status: open -> closed messages: + msg364696 stage: patch review -> resolved |
2020-03-20 17:47:10 | vstinner | set | messages: + msg364695 |
2020-03-20 17:42:46 | vstinner | set | messages: + msg364692 |
2020-03-16 11:07:24 | pablogsal | set | messages: + msg364310 |
2020-03-16 11:04:57 | pablogsal | set | messages: + msg364309 |
2020-03-16 11:04:15 | pablogsal | set | messages: + msg364308 |
2020-03-16 10:43:15 | arekm | set | messages: + msg364307 |
2020-03-16 09:40:25 | arekm | set | messages: + msg364300 |
2020-03-15 23:29:19 | arekm | set | messages: + msg364270 |
2020-03-15 23:17:57 | pablogsal | set | messages: + msg364269 |
2020-03-15 23:16:38 | pablogsal | set | messages: + msg364268 |
2020-03-15 23:06:06 | pablogsal | set | messages: + msg364265 |
2020-03-15 21:44:40 | arekm | set | messages: + msg364263 |
2020-03-15 21:26:55 | miss-islington | set | nosy:
+ miss-islington messages: + msg364262 |
2020-03-15 21:26:28 | pablogsal | set | messages: + msg364261 |
2020-03-15 20:48:55 | arekm | set | messages: + msg364259 |
2020-03-15 19:52:05 | pablogsal | set | messages: + msg364254 |
2020-03-15 19:47:55 | pablogsal | set | nosy:
- miss-islington messages: + msg364253 |
2020-03-15 19:46:08 | miss-islington | set | nosy:
+ miss-islington pull_requests: + pull_request18372 |
2020-03-15 19:46:03 | pablogsal | set | messages: + msg364252 |
2020-03-15 19:44:47 | pablogsal | set | messages: + msg364251 |
2020-03-15 10:07:17 | arekm | set | messages: + msg364226 |
2020-03-15 09:40:36 | arekm | set | messages: + msg364225 |
2020-03-15 03:28:36 | pablogsal | set | messages: + msg364215 |
2020-03-15 02:47:50 | BTaskaya | set | keywords:
+ patch nosy: + BTaskaya pull_requests: + pull_request18354 stage: patch review |
2020-03-14 16:03:23 | arekm | set | nosy:
+ arekm |
2020-01-16 17:35:03 | pablogsal | set | messages: + msg360133 |
2020-01-16 17:08:15 | gaborjbernat | set | messages: + msg360131 |
2020-01-16 16:48:05 | vstinner | set | nosy:
+ vstinner messages: + msg360130 |
2020-01-16 16:30:13 | gaborjbernat | set | messages: + msg360128 |
2020-01-16 13:42:41 | vstinner | set | nosy:
- vstinner |
2020-01-16 13:24:17 | gaborjbernat | set | nosy:
+ pablogsal |
2020-01-16 13:23:44 | gaborjbernat | create |