classification
Title: Iterating on a zip keeps objects alive longer than expected (test_itertools leaks sometimes references)
Type: Stage: resolved
Components: Tests Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: brandtbucher, pablogsal, rhettinger, serhiy.storchaka, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2020-12-02 11:48 by vstinner, last changed 2020-12-07 20:08 by pablogsal. This issue is now closed.

Files
File name Uploaded Description Edit
test_zip_leak.py vstinner, 2020-12-02 23:43
test_zip_leak2.py vstinner, 2020-12-03 09:16
Pull Requests
URL Status Linked Edit
PR 23623 merged brandtbucher, 2020-12-03 04:32
PR 23651 merged brandtbucher, 2020-12-05 04:55
PR 23652 merged brandtbucher, 2020-12-05 04:57
Messages (39)
msg382304 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-02 11:48
12:07:06 vstinner@apu$ ./python -m test test_itertools -R 3:3 -F -j4
0:00:00 load avg: 0.44 Run tests in parallel using 4 child processes
(...)
0:02:03 load avg: 4.24 [  1] test_itertools passed (2 min 2 sec) -- running: test_itertools (2 min 3 sec), test_itertools (2 min 3 sec), test_itertools (2 min 3 sec)
beginning 6 repetitions
123456
......

0:02:03 load avg: 4.24 [  2/1] test_itertools failed (2 min 3 sec) -- running: test_itertools (2 min 3 sec), test_itertools (2 min 3 sec)
beginning 6 repetitions
123456
......

test_itertools leaked [32, 32, 32] references, sum=96
test_itertools leaked [24, 22, 22] memory blocks, sum=68

Kill <TestWorkerProcess #1 running test=test_itertools pid=159366 time=2 min 3 sec> process group
Kill <TestWorkerProcess #2 running test=test_itertools pid=159586 time=164 ms> process group
Kill <TestWorkerProcess #3 running test=test_itertools pid=159371 time=2 min 3 sec> process group

== Tests result: FAILURE ==

1 test OK.

1 test failed:
    test_itertools

Total duration: 2 min 3 sec
Tests result: FAILURE
msg382305 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-02 11:49
Leak seen on AMD64 RHEL7 Refleaks 3.x:
https://buildbot.python.org/all/#builders/562/builds/109

Recent itertools change, I have no idea if it's related:

commit cc061d0e6fb2569efa91531686f75b89e94ec865
Author: Raymond Hettinger <rhettinger@users.noreply.github.com>
Date:   Mon Nov 30 20:42:54 2020 -0800

    bpo-38200: Add itertools.pairwise() (GH-23549)
msg382340 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-02 21:06
> nosy: + rhettinger, serhiy.storchaka

I wanted to bisect this test before annoying too many people with this random issue. So far, I managed to simplified test_itertools to:

def test_main(verbose=None):
    test_classes = (
        TestBasicOps,
        TestVariousIteratorArgs,
        TestGC,
    )
    support.run_unittest(*test_classes)
msg382342 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-12-02 21:17
You can get a list of all tests with --list-cases and then run them in separate processes.
msg382344 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-12-02 21:32
Thanks.  I'll work on it this evening.  If this is new, pairwise() is the most likely cause.
msg382356 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-02 23:43
I reproduce the leak with attached test_zip.py by running:

$ ./python -m test -R 3:3 test_zip_leak -j4 -m test.test_zip_leak.TestGC.test_permutations -m test.test_zip_leak.TestGC.test_zip

The problem is that the builtin zip class keeps an internal tuple alive. See zip_next) and zipobject.result member (tuple).
msg382357 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-02 23:50
Raymond Hettinger:
> If this is new, pairwise() is the most likely cause.

I reproduced the issue without involving the itertools module. It seems to be a zip_next() micro-optimization which keeps an object alive longer than epxected. This micro-optimization is not new. It's just really hard to trigger the issue.

It seems to depend if the zip object is deleted or not. The builtin zip type implements a traverse function which visits Py_VISIT(lz->result). The GC is supposed to be able to break such cycle, no?
msg382358 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-02 23:53
> The GC is supposed to be able to break such cycle, no?

In zip_new(), lz->result tuple is *not* always tracked by the GC.
msg382360 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-12-03 00:13
Adding Brandt because he recently worked on the zip() code.
msg382361 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 00:59
It looks like the GC untracks the None-filled result tuple in untrack_tuples, and it's never re-tracked again. This can also happen if it's filled with atomic values on an early iteration and the GC visits it.

Perhaps a simple fix is to call _PyObject_GC_TRACK on the result tuple inside of each zip_next call if it's not tracked anymore?

if (!_PyObject_GC_IS_TRACKED(result)) {
    _PyObject_GC_TRACK(result);
}
_PyTuple_MaybeUntrack(result);  // Worth it?
return result;

Although I'm not sure how much of a win we're getting from the result tuple reuse here - maybe it's easier to just not do that.
msg382362 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 01:14
Simple demo:

>>> import gc
>>> gc.disable()
>>> z = zip([[]])
>>> gc.is_tracked(next(z))
True
>>> z = zip([[]])
>>> gc.collect()
0
>>> gc.is_tracked(next(z))
False
msg382365 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 02:02
Some quick benchmarks on a normal build with CPU isolation, tuned with pyperf. No PGO/LTO.

$ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass'  # Current master.
.........................................
Mean +- std dev: 279 ms +- 11 ms

$ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass'  # With above fix.
.........................................
Mean +- std dev: 369 ms +- 20 ms

$ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass'  # With above fix (no _PyTuple_MaybeUntrack).
.........................................
Mean +- std dev: 284 ms +- 17 ms

$ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass'  # With no tuple reuse.
.........................................
Mean +- std dev: 526 ms +- 51 ms

Note that this example reuses the result tuple for *every* iteration.
msg382374 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 05:03
It looks like Victor's original issue is unrelated to zip, though. That test run is clean after adding the same fix to:

- itertools.product
- itertools.combinations
- itertools.combinations_with_replacement
- itertools.permutations
- itertools.zip_longest

...all of which use the same tuple-recycling speed trick.

If my zip PR looks good, I can create another one fixing these as well.
msg382375 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 05:06
(By the way, I didn't know that -F runs the tests forever... so I was waiting *almost* forever for it to finish!)
msg382376 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 05:10
Also, it appears enumerate is affected as well.
msg382387 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-12-03 08:57
Why there are cycles at all? In normal case there should not be cycle and the result tuple should be destroyed when the iteration ends.
msg382388 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-03 08:59
Extract of Brandt's PR:

// The GC may have untracked this result tuple if its elements were all
// untracked. Since we're recycling it, make sure it's tracked again:
if (!_PyObject_GC_IS_TRACKED(result)) {
    _PyObject_GC_TRACK(result);
}

I would like to understand why the tuple is no longer tracked, whereas PyTuple_New() creates a newly created tuple which is tracked.

Using gdb, I found that gc_collect_main() calls untrack_tuples(young) which untracks all tuples of the young generation.

I understand that (when the issue happens):

* a zip() object is created with lz->result = (None, None)
* A GC collection happens
* The GC untracks (None, None) tuple
* next(zip) is called: lz->result has a reference count of 1 and so can be reused.

Problem: the tuple is no longer tracked, whereas its content changed and so the newly filled tuple might be part of a reference cycle. Since the tuple is not tracked, the GC can no longer break the reference cycle involving the zip object internal tuple.

Example of code where the zip tuple is untracked before zip_next() is called on the zip object:

    def test_product(self):
        gc.set_threshold(5)
        pools = [(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                 ('a', 'b', 'c'),
                 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)]
        indices = [0, 2, 10, 11]
        print(indices)
        print(pools)
        list(None for a, b in zip(pools, indices))
msg382391 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-03 09:16
test_zip_leak2.py: simplified example to trigger the GC issue.

Copy it to Lib/test/ and reproduce the leak with:

    ./python -m test -R 3:3 test_zip_leak2

Extract of the code:
---
container = []
iterator = zip([container], [container])

# untrack the internal zip->result=(None, None)
gc.collect()

# create a reference cycle: container -> iterator -> container
container.append(iterator)

next(iterator)
# zip->result=(container, container)

del container, iterator

# Try to break the reference cycle
gc.collect()
---
msg382392 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-03 09:22
Serhiy Storchaka:
> Why there are cycles at all? In normal case there should not be cycle and the result tuple should be destroyed when the iteration ends.

test_itertools creates a reference cycle *on purpose*. See test_zip_leak2.py  simplified code, it comes from test_itertools.
msg382395 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-03 09:23
I add Pablo and Tim who love GC puzzles.
msg382396 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-03 09:34
The zip bug is quite old and was already tested by test_itertools for a very long time. Suddenly, it started to fail on more and more Refleaks buildbots. It may be because the GC must be triggered at a very specific point, and the default GC threshold triggers the issue with latest test_itertools changes.
msg382418 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 15:30
> I add Pablo and Tim who love GC puzzles.

Well, I don’t think the GC is really doing anything wrong here... untracking these sort of tuples likely results in a noticeable reduction in collection times. This code is definitely testing the limits of what is “okay” for us to do with a tuple, and it’s breaking the GC’s (reasonable) assumptions.

I kept digging around, and it appears dict.items() is affected, too (which is a bit scarier). The same fix still works, though.
msg382424 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 16:16
functools.reduce looks affected, too.
msg382448 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-03 20:19
I added some comments in the PR regarding the possibility of forcing the tracking on the visiting function to redirect the cost to gc passes instead of when calling next() but there is another option: not untracking tuples that have refcount == 1.
msg382450 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-03 20:24
> not untracking tuples that have refcount == 1.

Actually, this may not be a good idea as well: running test_list shows that there are 14786 tuples with refcount == 1 that would have been untracked. This is a non-trivial workload to the gc passes that we will lose if we do this.

Performance-wise, I think the best alternative is to force the tracking before visiting in tp_traverse.
msg382451 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 20:27
> I added some comments in the PR regarding the possibility of forcing the tracking on the visiting function to redirect the cost to gc passes instead of when calling next()...

Yep, I agree that this approach is better.

> ...but there is another option: not untracking tuples that have refcount == 1.

I do like the idea of not having to fix this in ~10 different places, but I'm not sure this is enough. The GC visit may happen when the refcount of the result tuple is > 1, but it may drop back to 1 before __next__ called again

z = zip([42, []])
item = next(zip)
gc.collect()  # Refcount of item == 2, untracked by GC.
del item  # Refcount of item == 1, ready to be recycled.
next(zip)  # This should be tracked, but won't be.
msg382452 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-03 20:34
> The GC visit may happen when the refcount of the result tuple is > 1, but it may drop back to 1 before __next__ called again

Good point. Also, is it possible that a more complex cycle ends having more than one reference to that result, so indeed that is not enough.
msg382455 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 22:13
> I added some comments in the PR regarding the possibility of forcing the tracking on the visiting function

Thinking about this more, I'm a bit hesitant to put the re-tracking code in the traverse function (mostly stemming from my lack of knowledge about our GC algorithm). I have no idea how problematic tracking and untracking objects *during* collections can be. Consider the following reference chain:

-> z -> r -> o -> r

Here, z is a zip iterator, r is its result tuple, and o is some object. Let's say r was untracked in the previous collection (I believe untrack_tuples happens *after* all of the traverse functions are called).

I'm worried that the GC may not be able to detect the cycle if it visits o, then z (which *then* starts tracking r), then r.

Or, if this group is determined to be unreachable, untrack_tuples untracks r *before* the cycle is cleaned up. That seems like it could be a problem.

Or, if z's refcount drops to zero and it's cleaned up, its traverse function may *never* be called, which leaves the untracked r -> o -> r cycle.

I guess I'm just not sure how resilient the GC is to these sorts of things. Re-tracking the result tuple in __next__ sidesteps all of these potential issues, with no significant performance impact. So I think I might slightly prefer that.
msg382457 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-03 23:32
> Or, if z's refcount drops to zero and it's cleaned up, its traverse function may *never* be called, which leaves the untracked r -> o -> r cycle.

This is a real problem, indeed. We would need to add the tracking to the tp_dealloc of the zip object as well.

> I have no idea how problematic tracking and untracking objects *during* collections can be.

It is indeed tricky: the most problematic part of the "surprise tracking" is the validity of the pointers, but the key here is that the traverse function will be called BEFORE (update_refs()) the pointers start being invalid (because of the bit reuse).

> Or, if this group is determined to be unreachable, untrack_tuples untracks r *before* the cycle is cleaned up. That seems like it could be a problem.

It is not: untrack_tuples is called over the reachable set after the gc knows what part of the linked list is isolated garbage. If the group is in the unreachable set, then untrack_tuples won't touch it. 

> I'm worried that the GC may not be able to detect the cycle if it visits o, then z (which *then* starts tracking r), then r.

This is not a problem IMO: if the gc does not see the cycle in that collection, it will in the next ones. Wat will happen is that at the beginning the gc does not see r so it iterates over the linked list of the generation being collected, decrementing the gc refcounts. When it reaches z, it starts decrementing the refcounts of z and that calls the traverse function of z, that tracks r AND decrements its gc refcounts but also it adds "r" to the end of the linked list of the young generation. This means that (assuming we are collecting the middle generation) in this collection the cycle will not be seen, but in a future collection where the young generation is collected then "r" will be visited normally (notice that the gc can see "r" without needing to visit "z" before). Also, notice that untrack_tuples() is called over the reachable set, but "r" is in the young generation so it won't be untracked. Also, if we are collecting the younger generation, then the cycle will be found and it will be cleaned nicely.
msg382458 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-03 23:34
I forgot to mention the conclusion: my conclusion is that although the GC is resilient to these shenanigans (it was also extensive validation along with the algorithm), I still prefer to track it on the next() call because is technically the right thing to do: A tuple has been mutated so it must be (potentially) tracked because it needs to behave "AS IF" the tuple was being created from scratch.
msg382459 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-03 23:51
Thanks for that detailed explanation, Pablo.

If nobody objects to the current zip fix, I'll merge it and move on to similar PRs for the other affected stuff:

- dict.items
- enumerate
- functools.reduce
- itertools.product
- itertools.combinations
- itertools.combinations_with_replacement
- itertools.permutations
- itertools.zip_longest
msg382460 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-03 23:54
>  I'll merge it and move on to similar PRs for the other affected stuff:

Not sure if is more ergonomic, but I would prefer to fix this in one PR so the buildbots don't test the partial fixes. Also is easier to revert if something goes wrong with the approach as that would affect all. But being said that, if you prefer multiple ones, I am also fine :)
msg382461 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-04 00:18
Yeah, I'm fine with that.
msg382462 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-04 00:44
Adding:

- collections.OrderedDict.items
msg382476 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-04 11:13
Would it be possible to push a quick & dirty workaround for test_itertools? I'm getting more and more buildbot Refleaks failures.

For example, always skip TestGC.test_zip() for now.

This issue seems quite complex and I would prefer to have time to think about it to design a proper fix, rather than having to work under pressure.
msg382478 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-04 11:30
> Would it be possible to push a quick & dirty workaround for test_itertools? I'm getting more and more buildbot Refleaks failures.

Oh, I looked at PR 23623 and it got many approvals. I'm fine with waiting for a few days until PR 23623 is merged.
msg382552 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-12-05 03:46
New changeset 226a012d1cd61f42ecd3056c554922f359a1a35d by Brandt Bucher in branch 'master':
bpo-42536: GC track recycled tuples (GH-23623)
https://github.com/python/cpython/commit/226a012d1cd61f42ecd3056c554922f359a1a35d
msg382678 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-07 20:07
New changeset 60463e8e4f79e5b5e96dc43fb83ded373b489e33 by Brandt Bucher in branch '3.9':
bpo-42536: GC track recycled tuples (GH-23623) (GH-23651)
https://github.com/python/cpython/commit/60463e8e4f79e5b5e96dc43fb83ded373b489e33
msg382679 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-07 20:08
New changeset 7c797982383ebfd9cca39c480dcf6132979dd06f by Brandt Bucher in branch '3.8':
bpo-42536: GC track recycled tuples (GH-23623) (GH-23652)
https://github.com/python/cpython/commit/7c797982383ebfd9cca39c480dcf6132979dd06f
History
Date User Action Args
2020-12-07 20:08:37pablogsalsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-12-07 20:08:28pablogsalsetmessages: + msg382679
2020-12-07 20:07:52pablogsalsetmessages: + msg382678
2020-12-05 04:57:03brandtbuchersetpull_requests: + pull_request22520
2020-12-05 04:55:18brandtbuchersetpull_requests: + pull_request22519
2020-12-05 03:46:00brandtbuchersetmessages: + msg382552
2020-12-04 11:30:27vstinnersetmessages: + msg382478
2020-12-04 11:13:41vstinnersetmessages: + msg382476
2020-12-04 00:44:15brandtbuchersetmessages: + msg382462
2020-12-04 00:18:22brandtbuchersetmessages: + msg382461
2020-12-03 23:54:49pablogsalsetmessages: + msg382460
2020-12-03 23:51:36brandtbuchersetmessages: + msg382459
2020-12-03 23:34:18pablogsalsetmessages: + msg382458
2020-12-03 23:32:26pablogsalsetmessages: + msg382457
2020-12-03 22:13:17brandtbuchersetmessages: + msg382455
2020-12-03 20:34:37pablogsalsetmessages: + msg382452
2020-12-03 20:27:56brandtbuchersetmessages: + msg382451
2020-12-03 20:24:22pablogsalsetmessages: + msg382450
2020-12-03 20:19:31pablogsalsetmessages: + msg382448
2020-12-03 16:16:33brandtbuchersetmessages: + msg382424
2020-12-03 15:30:57brandtbuchersetmessages: + msg382418
2020-12-03 09:34:11vstinnersetmessages: + msg382396
2020-12-03 09:23:59vstinnersetnosy: + tim.peters, pablogsal
messages: + msg382395
2020-12-03 09:22:50vstinnersetmessages: + msg382392
2020-12-03 09:16:40vstinnersetfiles: + test_zip_leak2.py

messages: + msg382391
2020-12-03 08:59:50vstinnersetmessages: + msg382388
2020-12-03 08:57:06serhiy.storchakasetmessages: + msg382387
2020-12-03 05:10:36brandtbuchersetmessages: + msg382376
2020-12-03 05:06:38brandtbuchersetmessages: + msg382375
2020-12-03 05:03:44brandtbuchersetmessages: + msg382374
2020-12-03 04:32:59brandtbuchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request22491
2020-12-03 02:02:03brandtbuchersetmessages: + msg382365
2020-12-03 01:14:54brandtbuchersetmessages: + msg382362
2020-12-03 00:59:11brandtbuchersetmessages: + msg382361
2020-12-03 00:20:29rhettingersetassignee: rhettinger ->
2020-12-03 00:13:57rhettingersetnosy: + brandtbucher
messages: + msg382360
2020-12-02 23:53:43vstinnersetmessages: + msg382358
2020-12-02 23:50:25vstinnersetmessages: + msg382357
2020-12-02 23:45:17vstinnersettitle: test_itertools leaks sometimes references -> Iterating on a zip keeps objects alive longer than expected (test_itertools leaks sometimes references)
2020-12-02 23:43:56vstinnersetfiles: + test_zip_leak.py

messages: + msg382356
2020-12-02 21:32:03rhettingersetassignee: rhettinger
messages: + msg382344
2020-12-02 21:17:39serhiy.storchakasetmessages: + msg382342
2020-12-02 21:06:04vstinnersetmessages: + msg382340
2020-12-02 12:49:40serhiy.storchakasetnosy: + rhettinger, serhiy.storchaka
2020-12-02 11:49:26vstinnersetmessages: + msg382305
2020-12-02 11:48:35vstinnercreate