refleak in typing.py #72835

1st1 · 2016-11-09T16:00:09Z

BPO	28649
Nosy	@gvanrossum, @ned-deily, @serhiy-storchaka, @1st1, @MojoVampire, @ilevkivskyi
Files	slots_gc.patch typing-clear-caches.patch fix-typing-test.diff fix-typing-test-v2.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-11-10.16:32:47.208>
created_at = <Date 2016-11-09.16:00:08.573>
labels = ['3.7', 'type-bug', 'library', 'release-blocker']
title = 'refleak in typing.py'
updated_at = <Date 2016-11-10.16:45:07.923>
user = 'https://github.com/1st1'

bugs.python.org fields:

activity = <Date 2016-11-10.16:45:07.923>
actor = 'levkivskyi'
assignee = 'none'
closed = True
closed_date = <Date 2016-11-10.16:32:47.208>
closer = 'yselivanov'
components = ['Library (Lib)']
creation = <Date 2016-11-09.16:00:08.573>
creator = 'yselivanov'
dependencies = []
files = ['45410', '45414', '45415', '45416']
hgrepos = []
issue_num = 28649
keywords = ['patch']
message_count = 61.0
messages = ['280407', '280411', '280412', '280413', '280414', '280415', '280417', '280420', '280421', '280424', '280425', '280426', '280428', '280430', '280431', '280434', '280436', '280437', '280438', '280439', '280440', '280441', '280442', '280443', '280444', '280446', '280447', '280454', '280455', '280457', '280458', '280459', '280460', '280461', '280462', '280463', '280464', '280465', '280466', '280467', '280469', '280481', '280482', '280483', '280484', '280485', '280486', '280487', '280488', '280489', '280493', '280496', '280501', '280513', '280514', '280518', '280519', '280520', '280521', '280523', '280524']
nosy_count = 7.0
nosy_names = ['gvanrossum', 'ned.deily', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'levkivskyi']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue28649'
versions = ['Python 3.6', 'Python 3.7']

1st1 · 2016-11-09T16:00:08Z

Looks like we have a refleak somewhere in the system that typing.py is exposing.

The following code exposes the refleak:

    def test_refleak(self):
        T = typing.TypeVar('T')
        class C(typing.Generic[T], typing.Mapping[int, str]): ...

The change that first triggered the leak is 8f0df4db2b06.

I'm not an expert in typing.py, but I suspect that the real bug must be somewhere in CPython core.

Guido, do you have any idea how 8f0df4db2b06 could trigger this?

serhiy-storchaka · 2016-11-09T16:08:05Z

Could it be related to leaks in objects with empty __slots__? See bpo-24379 for details.

1st1 · 2016-11-09T16:10:49Z

Could it be related to leaks in objects with empty __slots__? See bpo-24379 for details.

Not sure... Do we have a test/patch for objects-with-empty-slots bug?

gvanrossum · 2016-11-09T16:15:06Z

Ivan, this was an early typing.py update (merged into CPython on Sept. 27). But the refleak is still with us. Do you have any intuition on what could be going on here?

My own intuition here is lacking, other than thinking there's a lot of metaclass magic going on in this code...

serhiy-storchaka · 2016-11-09T16:21:20Z

>>> import gc
>>> gc.collect()
23
>>> gc.collect()
0
>>> class A:
...     __slots__ = ()
... 
>>> del A
>>> gc.collect()
2
^ leak

serhiy-storchaka · 2016-11-09T16:24:10Z

See also https://mail.python.org/pipermail/python-dev/2015-October/141993.html.

gvanrossum · 2016-11-09T17:13:28Z

Serhiy, why is gc.collect() returning 2 after del A proof of a leak?
Yes, there's a cycle, but it's being GC'ed -- isn't that fine? Without the
slots the collect() call returns a larger number.

1st1 · 2016-11-09T17:50:58Z

Attached patch (slots_gc.patch) makes objects with empty slots tracked by GC:

   class A: __slots__ = ()
   gc.is_tracked(A()) == False  # before
   gc.is_tracked(A()) == True   # with the patch

This doesn't fix the refleak though.

1st1 · 2016-11-09T17:58:17Z

I removed all slots from typing.py, the refleak is still there.

1st1 · 2016-11-09T18:11:16Z

Wow. In default CPython branch, with no modifications, test_typing.py simply crashes in debug mode:

yury@ysmbp ~/dev/py/cpython $ ./python.exe -m test -R3:3 test_typing
Run tests sequentially
0:00:00 [1/1] test_typing
beginning 6 repetitions
123456
.test test_typing failed -- Traceback (most recent call last):
  File "/Users/yury/dev/py/cpython/Lib/test/test_typing.py", line 739, in test_generic_forvard_ref
    self.assertEqual(get_type_hints(foobar, globals(), locals()), {'x': List[List[T]]})
AssertionError: {'x': typing.List[typing.List[~T]]} != {'x': typing.List[typing.List[~T]]}
  {'x': typing.List[typing.List[~T]]}

test_typing failed

1 test failed:
test_typing

Total duration: 1 sec
Tests result: FAILURE

1st1 · 2016-11-09T18:13:39Z

The problem is that functools.lru_cache is used in _tp_cache. If I remove type caching, the original "refleak" is fixed.

MojoVampire · 2016-11-09T18:29:25Z

Serhiy: I think you forgot to make a global instance of the type for your demonstration. You mentioned in msg253899 that the loop is:

global instance -> type -> method -> module globals -> global instance

The example you gave doesn't instantiate class A, and it's the instance of A that doesn't participate in GC and causes the problem, right?

1st1 · 2016-11-09T18:33:26Z

Josh, Serhiy,

I've created an issue for tracking empty slots types: bpo-28651.

This issue is about typing.py using functools.lru_cache for internal cache That is what causing the refleaks, and inability of test_typing.py to be run in refleak hunting mode.

gvanrossum · 2016-11-09T18:37:56Z

I could use a hint on a complete program that establishes whether there is
or is not a refleak.

gvanrossum · 2016-11-09T18:48:07Z

Good sleuthing! So the bug is in _lru_cache?

On Wed, Nov 9, 2016 at 10:37 AM, Guido van Rossum <report@bugs.python.org>
wrote:

Guido van Rossum added the comment:

I could use a hint on a complete program that establishes whether there is
or is not a refleak.

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue28649\>

serhiy-storchaka · 2016-11-09T18:52:57Z

Serhiy, why is gc.collect() returning 2 after del A proof of a leak?

Sorry, it was bad example. I don't remember all details.

The problem is that functools.lru_cache is used in _tp_cache. If I remove type caching, the original "refleak" is fixed.

Proposed patch clears caches when search for reference leaks. This decreases the number of leaks.

Unpatched:
test_typing leaked [3125, 3089, 2897] references, sum=9111
test_typing leaked [1189, 1179, 1103] memory blocks, sum=3471

Patched:
test_typing leaked [125, 125, 125] references, sum=375
test_typing leaked [49, 51, 51] memory blocks, sum=151

See also bpo-23839.

gvanrossum · 2016-11-09T19:03:27Z

I have no objections against typing-clear-caches.patch (though we'll have to make sure to apply it to the GitHub upstream as well -- I can take care of that once you merge it to CPython).

Does that patch fix the refleak completely or only partially?

I have no opinion on empty slots.

1st1 · 2016-11-09T19:06:31Z

Good sleuthing! So the bug is in _lru_cache?

I don't think it's a bug of lru_cache. It stores strong references for arguments and return value of the decorated function (which btw isn't documented). I don't think it's possible to write efficient generic lru_cache that would just use weakrefs.

The problem is that because _tp_cache helper uses it, the lifetime of some typing classes becomes indefinite. Which shouldn't be a problem for a typical user of typing, since most of the time it is used on the module level.

We have two potential solutions to this problem:

Add a weak-ref type cache and refactor _tp_cache to use it;
Cleanup lru-cache before/adter running each typing test.

[1] is the ideal solution, [2] should work good enough.

ilevkivskyi · 2016-11-09T19:06:53Z

Yury, Guido, I think I understand why test_typing crashes in refleak hunting mode. The test relies on caching, namely type variables are compared by id, so that if cache is cleared between those two lines, then test fails. I think we should just update those test to use a usual class in forward reference. The patch is attached.

(Patch fixes the crash in refleak hunting mode, but the refleak is still there)

1st1 · 2016-11-09T19:10:29Z

Does that patch fix the refleak completely or only partially?

I tried to empty the cache in test_typing.BaseTestCase.setUp and tearDown; this doesn't fix all refleaks.

Do we have some other caches in typing.py?

serhiy-storchaka · 2016-11-09T19:14:31Z

Does that patch fix the refleak completely or only partially?

It fixes 96% of refleaks.

ilevkivskyi · 2016-11-09T19:15:16Z

Yury, _tp_cache should be the only one (of course most generic classes have _abc_cache since GenericMeta inherits from ABCMeta)

gvanrossum · 2016-11-09T19:18:26Z

FYI I've opened python/typing#323 to track this upstream.

gvanrossum · 2016-11-09T19:19:04Z

It fixes 96% of refleaks.

In typing.py? Or generally?

1st1 · 2016-11-09T19:20:50Z

Yury, _tp_cache should be the only one (of course most generic classes have _abc_cache since GenericMeta inherits from ABCMeta)

Then there is at least one more bug not related to lru_cache (or any kind of cache in typing).

And it's not classes-with-empty-slots bug either.

1st1 · 2016-11-09T19:28:28Z

Serhiy's patch typing-clear-caches.patch looks good, we should apply it (fixing the code style a little bit).

1st1 · 2016-11-09T23:03:17Z

Oh. This is a bug in C implementation of functools.lru_cache. I'll take a look.

ilevkivskyi · 2016-11-09T23:04:57Z

My impression is like something wrong happens when TypeError is raised by a cached function.

ilevkivskyi · 2016-11-09T23:23:11Z

A very simple repro:

    with self.assertRaises(TypeError):
        Union[[]]

It looks like the problem happens when a non-hashable argument is passed to cached function

1st1 · 2016-11-09T23:37:27Z

reflesk with only functools:

    def test_lru_type_error(self):
        @functools.lru_cache(maxsize=None)
        def foo(o):
            raise TypeError

        with self.assertRaises(TypeError):
            foo([])

1st1 · 2016-11-09T23:44:58Z

Found it, will open a separate issue.

1st1 · 2016-11-09T23:47:57Z

See http://bugs.python.org/issue28653 for details.

1st1 · 2016-11-10T01:20:09Z

So Ivan made an interesting observation: if we use Python version of functools.lru_cache, typing tests start to leak like crazy:

beginning 6 repetitions
123456
......
test_typing leaked [24980, 24980, 24980] references, sum=74940

I experimented a bit, and I *think* I know what's happening:

typing uses an lru cache for types.
It looks like that some types in the cache are being *reused* by different tests.
Because typing doesn't set the size of the lru cache, it's set to 128 (default).
When many typing tests execute, the cache invalidates some entries, but because types in test_typing are so complex, I believe that the resulting graph of objects it too complex for the GC to go through (10s of thousands of cross-linked objects).

A simple fix that removes fixes the refleak with Python version of lru_cache is to add a 'tearDown' method to 'test_typing.BaseTestCase':

    class BaseTestCase(TestCase):
        def tearDown(self):
            for f in typing._cleanups:
                f()

Now this all of this is just my guess of what's going on here. It might be something that has to be fixed in typing, or maybe we should just add tearDown.

Guido, Ivan, what do you think?

gvanrossum · 2016-11-10T01:24:30Z

But why is this so different from the C implementation of lru_cache?

1st1 · 2016-11-10T01:27:22Z

But why is this so different from the C implementation of lru_cache?

I don't know. Maybe C version of lru cache creates a bit less pressure on GC.

BTW, if I set maxsize=100000 for typing lru_cache, test_generic_forward_ref crashes in refleak-test mode. Maybe this is another bug...

1st1 · 2016-11-10T01:35:58Z

> But why is this so different from the C implementation of lru_cache?

I don't know. Maybe C version of lru cache creates a bit less pressure on GC.

Actually test/refleak.py now cleans up typing's lru cache & calls "gc.collect()". It seems that the typing cycles aren't collectable by the GC when Python lru cache is used.

ilevkivskyi · 2016-11-10T01:38:25Z

It looks like that some types in the cache are being *reused* by different tests.

I also noticed this some time ago, ideally we need to make all tests more isolated between each other.

Now this all of this is just my guess of what's going on here. It might be something that has to be fixed in typing, or maybe we should just add tearDown.

adding tearDown is a good intermediate solution, while I could go through tests and try to make them more isolated from each other where necessary.

BTW, if I set maxsize=100000 for typing lru_cache, test_generic_forward_ref crashes in refleak-test mode. Maybe this is another bug...

I also noticed this today, and this is something that I don't understand yet. This test should not depend on caching, but it looks like it does. I will think more about this.

gvanrossum · 2016-11-10T01:39:04Z

Segfault or traceback?

Anyway, it implements a doubly-linked list where each node is implemented
as a list of 4 items... Maybe there's something in the GC code that
recurses down at the C level??? I'm not sure a crash the Python version of
lru_cache with maxsize=100000 is high on my list of worries.

1st1 · 2016-11-10T01:46:29Z

Segfault or traceback?

Traceback. I should have said failed, not crashed :(

I'm not sure a crash the Python version of
lru_cache with maxsize=100000 is high on my list of worries.

The problem is that the test fails depending on cache size, and caching code shouldn't really do that ever.

ilevkivskyi · 2016-11-10T01:51:04Z

It looks like the tearDown() works really well. It also somehow fixes the problem with test_generic_forward_ref

ilevkivskyi · 2016-11-10T01:56:24Z

I think Yury is right here. The good plan would be to add the tearDown(), but also fix the problem with test_generic_forward_ref. I could work on it tomorrow (really don't have time today, sorry).

ilevkivskyi · 2016-11-10T08:13:43Z

Guido, Yury, it looks like I solved the puzzle. All the remaining problems are because of forward references. In particular, _ForwardRef keeps a reference to the frame where it was defined:

        typing_globals = globals()
        frame = sys._getframe(1)
        while frame is not None and frame.f_globals is typing_globals:
            frame = frame.f_back
        assert frame is not None
        self.__forward_frame__ = frame

This is old code from 2015 introduced to support __isinstance__ for forward refs. The latter is no more supported anyway, so that I believe this code should be removed.

I will make a PR upstream soon.

ilevkivskyi · 2016-11-10T08:43:50Z

BTW, if I set maxsize=100000 for typing lru_cache, test_generic_forward_ref crashes in refleak-test mode. Maybe this is another bug...

This one is also related to the mentioned in my previous message. Namely, forward references are only evaluated once. However, in refleak hunting mode tests are repeated multiple times, and every time test_generic_forward_ref is called, this code:

    class CC: ...

creates a new local class, while forward ref is still pointing to an old one.

There are two options:

keep this behaviour and update the test to use a globally defined class;
change this and re-evaluate forward refs every times.

However I would vote for a compromise solution: if the locals argument is not given, just use the evaluated value. But if a user gives non-None locals argument then that user most probably wants to re-evaluate the forward ref locally.

ilevkivskyi · 2016-11-10T10:31:12Z

OK, here are the PRs:

python/typing#327
python/typing#328

Those two seem to fix everything, I tried various cache sizes with both C and Python versions of lru_cache and found no refleaks.

1st1 · 2016-11-10T15:59:09Z

Good work, Ivan!

Those two seem to fix everything, I tried various cache sizes with both C and Python versions of lru_cache and found no refleaks.

Speaking of lru_cache: typing currently doesn't configure maxsize, which means that the cache is limited to 128 items. Is that OK? Should maxsize be set to None to make the cache unlimited, or, maybe, a higher number like 1024 is more appropriate?

gvanrossum · 2016-11-10T16:03:15Z

We discussed cache size previously and 128 is fine.

--Guido (mobile)

python-dev · 2016-11-10T16:28:20Z

New changeset 0da2e381ad71 by Guido van Rossum in branch '3.5':
Issue bpo-28649: fix first issue with _ForwardRef (#327)
https://hg.python.org/cpython/rev/0da2e381ad71

New changeset 249a1f0b2857 by Guido van Rossum in branch '3.6':
Issue bpo-28649: fix first issue with _ForwardRef (#327) (3.5->3.6)
https://hg.python.org/cpython/rev/249a1f0b2857

New changeset 304f017462e4 by Guido van Rossum in branch 'default':
Issue bpo-28649: fix first issue with _ForwardRef (#327) (3.6->3.7)
https://hg.python.org/cpython/rev/304f017462e4

python-dev · 2016-11-10T16:29:51Z

New changeset 555f0ca31587 by Guido van Rossum in branch '3.5':
Issue bpo-28649: fix second issue with _ForwardRef (#328)
https://hg.python.org/cpython/rev/555f0ca31587

New changeset 0f863906cf2e by Guido van Rossum in branch '3.6':
Issue bpo-28649: fix second issue with _ForwardRef (#328) (3.5->3.6)
https://hg.python.org/cpython/rev/0f863906cf2e

New changeset 84b4ac243508 by Guido van Rossum in branch 'default':
Issue bpo-28649: fix second issue with _ForwardRef (#328) (3.6->3.7)
https://hg.python.org/cpython/rev/84b4ac243508

gvanrossum · 2016-11-10T16:31:02Z

Thanks a bundle, Ivan! I've merged those two into CPython 3.5, 3.6, 3.7. We can now watch the build bots for a while and double-check the refleaks.

Yury, when you're happy with the situation can you close this issue?

1st1 · 2016-11-10T16:32:47Z

Thank you Guido and Ivan!

Yury, when you're happy with the situation can you close this issue?

I'll close it now, will reopen if something comes up.

gvanrossum · 2016-11-10T16:35:38Z

BTW, do we still want the tearDown()? Or is that no longer necessary?

ilevkivskyi · 2016-11-10T16:45:08Z

I think tearDown() is not necessary. But on the other hand it could be nice to have a method in tests to force cache clear. I would propose it not as a default, but as an opt-in to check that cache works well, or that a certain tested feature is "cache independent", etc.

1st1 added 3.7 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Nov 9, 2016

1st1 closed this as completed Nov 9, 2016

1st1 reopened this Nov 10, 2016

1st1 closed this as completed Nov 10, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

refleak in typing.py #72835

refleak in typing.py #72835

Comments

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

serhiy-storchaka commented Nov 9, 2016

1st1 commented Nov 9, 2016

gvanrossum commented Nov 9, 2016

serhiy-storchaka commented Nov 9, 2016

serhiy-storchaka commented Nov 9, 2016

gvanrossum commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

MojoVampire mannequin commented Nov 9, 2016

1st1 commented Nov 9, 2016

gvanrossum commented Nov 9, 2016

gvanrossum commented Nov 9, 2016

serhiy-storchaka commented Nov 9, 2016

gvanrossum commented Nov 9, 2016

1st1 commented Nov 9, 2016

ilevkivskyi commented Nov 9, 2016

1st1 commented Nov 9, 2016

serhiy-storchaka commented Nov 9, 2016

ilevkivskyi commented Nov 9, 2016

gvanrossum commented Nov 9, 2016

gvanrossum commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

ilevkivskyi commented Nov 9, 2016

ilevkivskyi commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 9, 2016

1st1 commented Nov 10, 2016

gvanrossum commented Nov 10, 2016

1st1 commented Nov 10, 2016

1st1 commented Nov 10, 2016

ilevkivskyi commented Nov 10, 2016

gvanrossum commented Nov 10, 2016

1st1 commented Nov 10, 2016

ilevkivskyi commented Nov 10, 2016

ilevkivskyi commented Nov 10, 2016

ilevkivskyi commented Nov 10, 2016

ilevkivskyi commented Nov 10, 2016

ilevkivskyi commented Nov 10, 2016

1st1 commented Nov 10, 2016

gvanrossum commented Nov 10, 2016

python-dev mannequin commented Nov 10, 2016

python-dev mannequin commented Nov 10, 2016

gvanrossum commented Nov 10, 2016

1st1 commented Nov 10, 2016

gvanrossum commented Nov 10, 2016

ilevkivskyi commented Nov 10, 2016