multiprocessing.Pool and ThreadPool leak resources after being deleted #78353

tzickel · 2018-07-20T16:45:30Z

BPO	34172
Nosy	@pitrou, @vstinner, @benjaminp, @ned-deily, @zware, @mattip, @MojoVampire, @applio, @tzickel, @pablogsal, @Windsooon
PRs	bpo-34172: multiprocessing.Pool leaks resources after being deleted #8450 [3.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) #9676 [3.6] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) #9677 [2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) #9686 [3.7] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9676)" #10968 [3.6] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9677)" #10969 [2.7] Revert "[2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-9686)" #10970 Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450)" #10971

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-07-20.16:45:29.720>
labels = ['3.8', 'type-bug', 'library', 'docs']
title = 'multiprocessing.Pool and ThreadPool leak resources after being deleted'
updated_at = <Date 2019-10-23.18:45:00.937>
user = 'https://github.com/tzickel'

bugs.python.org fields:

activity = <Date 2019-10-23.18:45:00.937>
actor = 'josh.r'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation', 'Library (Lib)']
creation = <Date 2018-07-20.16:45:29.720>
creator = 'tzickel'
dependencies = []
files = []
hgrepos = []
issue_num = 34172
keywords = ['patch']
message_count = 52.0
messages = ['322028', '322076', '322090', '322129', '322147', '322150', '322172', '322190', '322194', '322230', '322231', '323103', '323104', '326904', '326909', '326913', '326914', '326915', '326948', '326964', '330864', '330865', '330866', '330868', '330869', '330891', '330954', '330955', '330960', '330962', '330988', '330996', '330997', '330999', '331009', '331026', '331087', '331190', '331198', '331199', '331200', '331210', '331216', '331218', '331221', '331629', '331703', '332294', '335357', '355229', '355247', '355248']
nosy_count = 13.0
nosy_names = ['pitrou', 'vstinner', 'benjamin.peterson', 'ned.deily', 'docs@python', 'zach.ware', 'mattip', 'josh.r', 'davin', 'tzickel', 'pablogsal', 'Windson Yang', 'Vy Nguyen']
pr_nums = ['8450', '9676', '9677', '9686', '10968', '10969', '10970', '10971']
priority = None
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue34172'
versions = ['Python 3.8']

tzickel · 2018-07-20T16:45:30Z

In multiprocessing.Pool documentation it's written "When the pool object is garbage collected terminate() will be called immediately.":

https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing.pool.Pool.terminate

A. This does not happen, creating a Pool, deleting it and collecting the garbage, does not call terminate.

B. The documentation for Pool itself does not specify it has a context manager (but the examples show it).

C. This bug is both in Python 3 & 2.

Windsooon · 2018-07-21T06:16:49Z

A. This does not happen, creating a Pool, deleting it and collecting the garbage, does not call terminate.

Would you give me an example how you delete the Pool and collecting the garbage? If you use context manager, It will call terminate() function.

B. The documentation for Pool itself does not specify it has a context manager (but the examples show it).

You can find this info on the same page:

New in version 3.3: Pool objects now support the context management protocol – see Context Manager Types. __enter__() returns the pool object, and __exit__() calls terminate().

tzickel · 2018-07-21T09:12:20Z

>>> from multiprocessing import Pool
>>> import gc
>>> a = Pool(10)
>>> del a
>>> gc.collect()
0
>>>

After this, there are still left behind Process (Pool) or Dummy (ThreadPool) and big _cache data (If you did something with it) which lingers till the process dies.

You are correct on the other issue (I'm using and reading the Python 2 documentation which does not have that...).

Windsooon · 2018-07-22T02:53:40Z

A patch would just add

    def __del__(self):
        self.terminate()

in the Pool object.

tzickel · 2018-07-22T15:30:20Z

But alas that does not work...

Windsooon · 2018-07-22T16:08:04Z

Add a __del__ method in the Pool class should work. But I'm not sure we should do this.

mattip · 2018-07-23T01:59:55Z

It would be sufficient to modify the documentation to reflect the code.

There are other objects like file that recommend[0] calling a method to release resources without depending on implementation-specific details like garbage collection.

[0] https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

pitrou · 2018-07-23T10:32:32Z

Indeed, I think this simply needs a documentation fix.

tzickel · 2018-07-23T11:10:28Z

What other object in the standard lib, leaks resources when deleted in CPython ? Even that documentation says the garbage collector will eventually destroy it, just like here... I think there is an implementation bug.

tzickel · 2018-07-23T16:26:47Z

I think I've found the code bug causing the leak:

cpython/Lib/multiprocessing/pool.py

Line 180 in caa331d

args=(self, )

There is a circular reference between the Pool object, and the self._worker_handler Thread object (and it's also saved in the frame locals for the thread object, which prevents it from being garbage collected).

pitrou · 2018-07-23T16:38:06Z

What other object in the standard lib, leaks resources when deleted in CPython ?

Threads come to mind, for example:

>>> import time, threading, weakref
>>> t = threading.Thread(target=time.sleep, args=(100000,))
>>> t.start()
>>> wr = weakref.ref(t)
>>> del t
>>> wr()
<Thread(Thread-1, started 139937234327296)>

Note I'm not against fixing this issue, just saying it's not that surprising for Pool to keep lingering around when you lost any user-visible reference to it.

tzickel · 2018-08-04T13:10:34Z

It actually makes tons of sense that while the thread is running, that the object representing it is alive. After the thread finishes its work, the object dies.

>>> import time, threading, weakref, gc
>>> t = threading.Thread(target=time.sleep, args=(10,))
>>> wr = weakref.ref(t)
>>> t.start()
>>> del t
>>> gc.collect()
>>> wr()
<Thread(Thread-1, started 139937234327296)>
Wait 10 seconds...
>>> gc.collect()
>>> wr()

The thread is gone (which doesn't happen with the pool).

Anyhow, I've submitted a patch to fix the bug that was introduced 9 years ago on GH, feel free to check it.

pitrou · 2018-08-04T13:40:17Z

Thanks a lot tzickle, I'll take a look.

pitrou · 2018-10-02T21:01:29Z

New changeset 97bfe8d by Antoine Pitrou (tzickel) in branch 'master':
bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450)
97bfe8d

pitrou · 2018-10-02T21:17:10Z

New changeset 97f998a by Antoine Pitrou (Miss Islington (bot)) in branch '3.7':
bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9676)
97f998a

pitrou · 2018-10-02T21:36:21Z

New changeset 07b96a9 by Antoine Pitrou (Miss Islington (bot)) in branch '3.6':
bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9677)
07b96a9

pitrou · 2018-10-02T21:37:42Z

Thanks tzickler for the report and pull request, and sorry for the delay.

This is now fixed in all 3.x branches. I will close this now as multiprocessing in 2.7 diverges quite a bit from 3.x. If you want to fix the issue in 2.7 as well, please say so and I'll reopen.

pitrou · 2018-10-02T21:38:11Z

(tzickel, sorry for mistyping your handle :-/)

tzickel · 2018-10-03T08:59:44Z

Its ok, you only did it twice :) I've submitted a manual 2.7 fix on GH.

pitrou · 2018-10-03T11:50:13Z

New changeset 4a7dd30 by Antoine Pitrou (tzickel) in branch '2.7':
[2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-9686)
4a7dd30

pablogsal · 2018-12-02T02:59:23Z

multiprocessing.Pool.imap hangs in MacOs after applying this commit:

import multiprocessing

def the_test():
    print("Begin")
    for x in multiprocessing.Pool().imap(int,
            ["4", "3"]):
        print(x)
    print("End")

the_test()

This also happens in the backported branches.

vstinner · 2018-12-03T16:28:50Z

"""
def the_test():
print("Begin")
for x in multiprocessing.Pool().imap(int,
["4", "3"]):
print(x)
print("End")
"""

Side-note: Is it correct to use a pool without calling terminate() nor close()? Should we start to emit a ResourceWarning if a pool is not closed explicitly, as we did for files, sockets, asyncio event loops and subprocess.Popen objects?

pablogsal · 2018-12-03T16:48:29Z

Víctor, I have a PR fixing this in:

bpo-35378

Even if is not correct to not call close or join on the Pool, this behaviour was working before while now it hangs. The fix is really simple, si I think we should fix it and not revert the change on this case.

tzickel · 2018-12-03T17:31:13Z

The previous posts here touch all this subjects:
A. The documentation explicitly says: "When the pool object is garbage collected terminate() will be called immediately." (Happened till a code refactor 9 years ago introduced this bug).

B. Large amount of code was developed for this technique:
https://github.com/python/cpython/blob/master/Lib/multiprocessing/util.py#L147 (Till the end of the file almost)

C. The reason I opened this bug was because I was called to see why a long running process crashes after a while, and found out it leaked tons of subprocesses / pool._cache memory.

D. The quoted code, will currently simply leak each invocation lots of subprocesses...

I too, think we should push for the said fix.

vstinner · 2018-12-03T22:29:53Z

tzickel:

A. The documentation explicitly says: "When the pool object is garbage collected terminate() will be called immediately." (Happened till a code refactor 9 years ago introduced this bug).

It is a *very bad* practice to rely on __del__. Don't do that. That's why we introduced ResourceWarning.

tzickel:

https://github.com/python/cpython/blob/master/Lib/multiprocessing/util.py#L147 (Till the end of the file almost)

Is this API *incompatible* with pool.close()? Explicitly release resources?

Pablo:

Víctor, I have a PR fixing this in: (...) The fix is really simple, so I think we should fix it and not revert the change on this case.

I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong.

At this point, I would like that someone explains me what the problem is. #10852 is a solution, ok, but what is the problem? What does the code hangs, whereas previously it was fine? Is the example code really correct? Do we want to support such usage?

I understand that msg330864 rely on black magic to expect that it's going to be fine. The lifetime of the pool is implicit and it sounds like a bad design. I don't want to endorse that.

pablogsal · 2018-12-03T23:00:26Z

I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong.

The pool child objects (imap iterators, async results...etc) need to keep a reference to the parent because if not, the caller is in charge of doing that and that may lead to bugs. Is the same scenario as if I get a dictionary iterator and then I destroy my reference to the dictionary: if the iterator does not keep a reference to the parent (the dictionary) it will not be possible to be used in the future. Indeed, we can see that this is what happens:

I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong.

>>> x = {1:2}
>>> y = iter(x)
>>> gc.get_referrers(x)
[<dict_keyiterator object at 0x0000024447D6D598>, 
{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, 'y': <dict_keyiterator object at 0x0000024447D6D598>, 'gc': <module 'gc' (built-in)>, 'x': {1: 2}}
]

We can see that the dict_keyiterator refers to the dictionary, keeping it alive.

Here we have the same situation: if we do not keep the pool alive, the iterator will hang when iterating because the jobs won't get finished.

At this point, I would like that someone explains to me what the problem is. #10852 is a solution, ok, but what is the problem? What does the code hangs, whereas >previously it was fine? Is the example code really correct? Do we want to support such usage?

The code hangs because the pool was not being destroyed before due to the reference cycle between the pool and an internal object (a Thread). Now it hangs because the worker thread is destroyed with the pool as no references are kept to it and the job that the iterator is waiting on is never finished.

I understand that msg330864 rely on black magic to expect that it's going to be fine. The lifetime of the pool is implicit and it sounds like a bad design. I don't want to endorse that.

I found the weird code in the example in several projects. I have corrected it to use the pool as a context manager or to call close(), but this means that users are doing this and it used to work and not it does not: technically is a regression.

vstinner · 2018-12-03T23:07:59Z

I found the weird code in the example in several projects. I have corrected it to use the pool as a context manager or to call close(), but this means that users are doing this and it used to work and not it does not: technically is a regression.

That's why I'm asking for a revert :-) I suggest to revert this change immediately from 2.7, 3.6 and 3.7, and later see what can be done for master.

Even if we design carefully an API, there will be always someone to misuse it :-) I would prefer to stop promoting such bad code and find a transition to more correct code.

I disagree that a child should keep its parent alive.

I would be ok to use a *weak reference* from the child to the parent to detect when the parent goes away, and maybe trigger an action in that case. For example, raise an exception or log a warning.

pablogsal · 2018-12-03T23:11:42Z

I disagree that a child should keep its parent alive.

But this is normal across the standard library. For example, here is how a deque iterator keeps the deque alive:

>>> x = deque([1,2,3])
>>> deque_iter = iter(x)
>>> deque_weakref = weakref.ref(x)
>>> del x
>>> gc.collect()
>>> gc.get_referrers(deque_weakref())
[<_collections._deque_iterator object at 0x0000024447ED6EA8>]

Here, the deque iterator is the *only* reference to the deque. When we destroy it, the deque is destroyed:
>>> del deque_iter
>>> gc.collect()
>>> deque_weakref()
None

tzickel · 2018-12-04T06:53:19Z

Reverting the code will cause another class of problems, like the reason I fixed it. Programs written such as the example that Pablo gave (and what I've seen) will quietly leak child processes, file descriptors (for the pipes) and memory to a variety degree might not be detected, or in the end detected in a big error or crash.

Also in some ResourceWarnings (if not all), the resources are closed in the end (like in sockets), here without this code patch you cannot implicitly reclaim the resources (because there is a Thread involved here), which I think is a high bar for the user to think about.

You can also enable multiprocessing's debug logging to see how the code behaves with and without the fix:
https://stackoverflow.com/a/1353037

I also agree with Pablo that there is code in the stdlib that holdes reference between child and parent. There is also code that has circular reference (for example Python 2's OrderedDict) and that is ok as well (not that this is the situation here).

pablogsal · 2018-12-04T10:32:43Z

that there is code in the stdlib that holdes reference between child and parent

Just to clarify: is not that is just code in the stdlib that keeps a reference between child and parent. The examples I have given are the exact same situation that we have here: the iterator object associated with another needs to keep its parent alive to work correctly.

vstinner · 2018-12-05T01:37:36Z

multiprocessing should help the developer to detect when the API is misused. For example, emit a ResourceWarning if a pool is not released explicitly. It might help to detect such bugs:

vstinner · 2018-12-05T23:09:06Z

Another example of complex issue related to object lifetime, resources (file descriptors) and multiprocessing: bpo-30966, add SimpleQueue.close().

vstinner · 2018-12-06T00:49:08Z

New changeset 3c6b0d9 by Victor Stinner in branch '3.7':
[3.7] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9676)" (bpo-10968)
3c6b0d9

vstinner · 2018-12-06T00:49:37Z

New changeset eb38ee0 by Victor Stinner in branch '3.6':
[3.6] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9677)" (GH-10969)
eb38ee0

vstinner · 2018-12-06T00:49:44Z

New changeset 358fc87 by Victor Stinner in branch '2.7':
Revert "[2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-9686)" (GH-10970)
358fc87

vstinner · 2018-12-06T07:51:54Z

New changeset 9dfc754 by Victor Stinner in branch 'master':
Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450)" (GH-10971)
9dfc754

vstinner · 2018-12-06T09:42:04Z

I reverted the change in 2.7, 3.6, 3.7 and master branches because it introduces a regression and we are very close to a release:
https://mail.python.org/pipermail/python-dev/2018-December/155920.html

I don't want to have the pressure to push a quick fix. I would like to make sure that we have enough time to design a proper fix. I'm not saying that Pablo's fix is not correct, it's just bad timing.

This bug is likely here for a long time, so I think that it's ok to still have it in the next 3.6 and 3.7 bugfix releases.

I suggest to open a discussion on the python-dev mailing list about multiprocessing relying on the garbage collector and lifetime of multiprocessing objects (Pool, Process, result, etc.). It seems like I disagree with Pablo and tzickel, whereas Armin Rigo (PyPy which has a different GC) is more on my side (release explicitly resources) :-) I would prefer to move towards explicit resource managment instead of relying on destructors and the garbage collector. For example, it's a bad practice to rely on these when using PyPy.

See my previous comments about issues related to multiprocessing objects lifetime.

pitrou · 2018-12-06T09:44:08Z

I agree that reverting in bugfix branches was the right thing to do. I think the fix should have remained in master, though.

vstinner · 2018-12-06T10:29:58Z

See also bpo-35424: "multiprocessing.Pool: emit ResourceWarning".

I wrote 10986 to fix 2 tests which leak resources.

I have a question. Why do tests have to call "pool.join()" after "with pool:"? When I use a file, I know that the resources are released after "with file:".

Should Pool.__exit__() call Pool.join()?

This question reminds me my fix in socketserver (bpo-31151 and bpo-31233) which leaked processes and threads, and my bug bpo-34037 (asyncio leaks threads).

vstinner · 2018-12-11T14:22:10Z

I started a thread on python-dev to discuss these issues:
https://mail.python.org/pipermail/python-dev/2018-December/155946.html

vstinner · 2018-12-12T16:04:18Z

The new test_del_pool() test of the fix failed on a buildbot: bpo-35413 "test_multiprocessing_fork: test_del_pool() leaks dangling threads and processes on AMD64 FreeBSD CURRENT Shared 3.x".

vstinner · 2018-12-21T15:38:04Z

multiprocessing.Pool destructor now emits a ResourceWarning if it is still running: if .close() nor .terminate() have been called, see bpo- 35424.

It is a first alarm that the problematic example is wrong.

Should reconsider to fix this bug in the master branch? If yes, we should carefully document this backward incompatible change.

vstinner · 2019-02-12T19:39:07Z

Pablo fixed bpo-35378 with:

New changeset 3766f18 by Pablo Galindo in branch 'master':
bpo-35378: Fix multiprocessing.Pool references (GH-11627)
3766f18

Does this change also fix this issue? If not, can we attempt again to fix this issue?

Moreover, should we do something in Python 3.7? Sadly, I don't think that we can do anything for 3.7 and 2.7.

vstinner · 2019-10-23T15:19:36Z

What's the status of this issue?

MojoVampire · 2019-10-23T18:43:45Z

Pablo's fix looks like a superset of the original fix applied here, so I'm assuming it fixes this issue as well.

MojoVampire · 2019-10-23T18:45:01Z

It should probably be backport to all supported 3.x branches though, so people aren't required to move to 3.8 to benefit from it.

tzickel mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 20, 2018

pitrou added docs Documentation in the Doc dir 3.8 only security fixes labels Jul 23, 2018

pitrou assigned docspython Jul 23, 2018

pitrou closed this as completed Oct 2, 2018

pitrou reopened this Oct 3, 2018

pitrou closed this as completed Oct 3, 2018

vstinner removed the release-blocker label Dec 6, 2018

pitrou removed the 3.7 (EOL) end of life label Dec 6, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

vstinner added the topic-multiprocessing label Jun 6, 2022

kumaraditya303 closed this as completed Jun 17, 2022

multiprocessing.Pool and ThreadPool leak resources after being deleted #78353

multiprocessing.Pool and ThreadPool leak resources after being deleted #78353

Comments

tzickel mannequin commented Jul 20, 2018

tzickel mannequin commented Jul 20, 2018

Windsooon mannequin commented Jul 21, 2018

tzickel mannequin commented Jul 21, 2018

Windsooon mannequin commented Jul 22, 2018

tzickel mannequin commented Jul 22, 2018

Windsooon mannequin commented Jul 22, 2018

mattip commented Jul 23, 2018

pitrou commented Jul 23, 2018

tzickel mannequin commented Jul 23, 2018

tzickel mannequin commented Jul 23, 2018

pitrou commented Jul 23, 2018

tzickel mannequin commented Aug 4, 2018

pitrou commented Aug 4, 2018

pitrou commented Oct 2, 2018

pitrou commented Oct 2, 2018

pitrou commented Oct 2, 2018

pitrou commented Oct 2, 2018

pitrou commented Oct 2, 2018

tzickel mannequin commented Oct 3, 2018

pitrou commented Oct 3, 2018

pablogsal commented Dec 2, 2018

vstinner commented Dec 3, 2018

pablogsal commented Dec 3, 2018

tzickel mannequin commented Dec 3, 2018

vstinner commented Dec 3, 2018

pablogsal commented Dec 3, 2018

vstinner commented Dec 3, 2018

pablogsal commented Dec 3, 2018

tzickel mannequin commented Dec 4, 2018

pablogsal commented Dec 4, 2018

vstinner commented Dec 5, 2018

vstinner commented Dec 5, 2018

vstinner commented Dec 6, 2018

vstinner commented Dec 6, 2018

vstinner commented Dec 6, 2018

vstinner commented Dec 6, 2018

vstinner commented Dec 6, 2018

pitrou commented Dec 6, 2018

vstinner commented Dec 6, 2018

vstinner commented Dec 11, 2018

vstinner commented Dec 12, 2018

vstinner commented Dec 21, 2018

vstinner commented Feb 12, 2019

vstinner commented Oct 23, 2019

MojoVampire mannequin commented Oct 23, 2019

MojoVampire mannequin commented Oct 23, 2019