Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refleak in typing.py #72835

Closed
1st1 opened this issue Nov 9, 2016 · 61 comments
Closed

refleak in typing.py #72835

1st1 opened this issue Nov 9, 2016 · 61 comments
Labels
3.7 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@1st1
Copy link
Member

1st1 commented Nov 9, 2016

BPO 28649
Nosy @gvanrossum, @ned-deily, @serhiy-storchaka, @1st1, @MojoVampire, @ilevkivskyi
Files
  • slots_gc.patch
  • typing-clear-caches.patch
  • fix-typing-test.diff
  • fix-typing-test-v2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-11-10.16:32:47.208>
    created_at = <Date 2016-11-09.16:00:08.573>
    labels = ['3.7', 'type-bug', 'library', 'release-blocker']
    title = 'refleak in typing.py'
    updated_at = <Date 2016-11-10.16:45:07.923>
    user = 'https://github.com/1st1'

    bugs.python.org fields:

    activity = <Date 2016-11-10.16:45:07.923>
    actor = 'levkivskyi'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-11-10.16:32:47.208>
    closer = 'yselivanov'
    components = ['Library (Lib)']
    creation = <Date 2016-11-09.16:00:08.573>
    creator = 'yselivanov'
    dependencies = []
    files = ['45410', '45414', '45415', '45416']
    hgrepos = []
    issue_num = 28649
    keywords = ['patch']
    message_count = 61.0
    messages = ['280407', '280411', '280412', '280413', '280414', '280415', '280417', '280420', '280421', '280424', '280425', '280426', '280428', '280430', '280431', '280434', '280436', '280437', '280438', '280439', '280440', '280441', '280442', '280443', '280444', '280446', '280447', '280454', '280455', '280457', '280458', '280459', '280460', '280461', '280462', '280463', '280464', '280465', '280466', '280467', '280469', '280481', '280482', '280483', '280484', '280485', '280486', '280487', '280488', '280489', '280493', '280496', '280501', '280513', '280514', '280518', '280519', '280520', '280521', '280523', '280524']
    nosy_count = 7.0
    nosy_names = ['gvanrossum', 'ned.deily', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'levkivskyi']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue28649'
    versions = ['Python 3.6', 'Python 3.7']

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Looks like we have a refleak somewhere in the system that typing.py is exposing.

    The following code exposes the refleak:

        def test_refleak(self):
            T = typing.TypeVar('T')
            class C(typing.Generic[T], typing.Mapping[int, str]): ...

    The change that first triggered the leak is 8f0df4db2b06.

    I'm not an expert in typing.py, but I suspect that the real bug must be somewhere in CPython core.

    Guido, do you have any idea how 8f0df4db2b06 could trigger this?

    @1st1 1st1 added 3.7 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Nov 9, 2016
    @serhiy-storchaka
    Copy link
    Member

    Could it be related to leaks in objects with empty __slots__? See bpo-24379 for details.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Could it be related to leaks in objects with empty __slots__? See bpo-24379 for details.

    Not sure... Do we have a test/patch for objects-with-empty-slots bug?

    @gvanrossum
    Copy link
    Member

    Ivan, this was an early typing.py update (merged into CPython on Sept. 27). But the refleak is still with us. Do you have any intuition on what could be going on here?

    My own intuition here is lacking, other than thinking there's a lot of metaclass magic going on in this code...

    @serhiy-storchaka
    Copy link
    Member

    >>> import gc
    >>> gc.collect()
    23
    >>> gc.collect()
    0
    >>> class A:
    ...     __slots__ = ()
    ... 
    >>> del A
    >>> gc.collect()
    2
    ^ leak

    @serhiy-storchaka
    Copy link
    Member

    @gvanrossum
    Copy link
    Member

    Serhiy, why is gc.collect() returning 2 after del A proof of a leak?
    Yes, there's a cycle, but it's being GC'ed -- isn't that fine? Without the
    slots the collect() call returns a larger number.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Attached patch (slots_gc.patch) makes objects with empty slots tracked by GC:

       class A: __slots__ = ()
       gc.is_tracked(A()) == False  # before
       gc.is_tracked(A()) == True   # with the patch

    This doesn't fix the refleak though.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    I removed all slots from typing.py, the refleak is still there.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Wow. In default CPython branch, with no modifications, test_typing.py simply crashes in debug mode:

    yury@ysmbp ~/dev/py/cpython $ ./python.exe -m test -R3:3 test_typing
    Run tests sequentially
    0:00:00 [1/1] test_typing
    beginning 6 repetitions
    123456
    .test test_typing failed -- Traceback (most recent call last):
      File "/Users/yury/dev/py/cpython/Lib/test/test_typing.py", line 739, in test_generic_forvard_ref
        self.assertEqual(get_type_hints(foobar, globals(), locals()), {'x': List[List[T]]})
    AssertionError: {'x': typing.List[typing.List[~T]]} != {'x': typing.List[typing.List[~T]]}
      {'x': typing.List[typing.List[~T]]}

    test_typing failed

    1 test failed:
    test_typing

    Total duration: 1 sec
    Tests result: FAILURE

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    The problem is that functools.lru_cache is used in _tp_cache. If I remove type caching, the original "refleak" is fixed.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Nov 9, 2016

    Serhiy: I think you forgot to make a global instance of the type for your demonstration. You mentioned in msg253899 that the loop is:

    global instance -> type -> method -> module globals -> global instance

    The example you gave doesn't instantiate class A, and it's the instance of A that doesn't participate in GC and causes the problem, right?

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Josh, Serhiy,

    I've created an issue for tracking empty slots types: bpo-28651.

    This issue is about typing.py using functools.lru_cache for internal cache That is what causing the refleaks, and inability of test_typing.py to be run in refleak hunting mode.

    @gvanrossum
    Copy link
    Member

    I could use a hint on a complete program that establishes whether there is
    or is not a refleak.

    @gvanrossum
    Copy link
    Member

    Good sleuthing! So the bug is in _lru_cache?

    On Wed, Nov 9, 2016 at 10:37 AM, Guido van Rossum <report@bugs.python.org>
    wrote:

    Guido van Rossum added the comment:

    I could use a hint on a complete program that establishes whether there is
    or is not a refleak.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28649\>


    @serhiy-storchaka
    Copy link
    Member

    Serhiy, why is gc.collect() returning 2 after del A proof of a leak?

    Sorry, it was bad example. I don't remember all details.

    The problem is that functools.lru_cache is used in _tp_cache. If I remove type caching, the original "refleak" is fixed.

    Proposed patch clears caches when search for reference leaks. This decreases the number of leaks.

    Unpatched:
    test_typing leaked [3125, 3089, 2897] references, sum=9111
    test_typing leaked [1189, 1179, 1103] memory blocks, sum=3471

    Patched:
    test_typing leaked [125, 125, 125] references, sum=375
    test_typing leaked [49, 51, 51] memory blocks, sum=151

    See also bpo-23839.

    @gvanrossum
    Copy link
    Member

    I have no objections against typing-clear-caches.patch (though we'll have to make sure to apply it to the GitHub upstream as well -- I can take care of that once you merge it to CPython).

    Does that patch fix the refleak completely or only partially?

    I have no opinion on empty slots.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Good sleuthing! So the bug is in _lru_cache?

    I don't think it's a bug of lru_cache. It stores strong references for arguments and return value of the decorated function (which btw isn't documented). I don't think it's possible to write efficient generic lru_cache that would just use weakrefs.

    The problem is that because _tp_cache helper uses it, the lifetime of some typing classes becomes indefinite. Which shouldn't be a problem for a typical user of typing, since most of the time it is used on the module level.

    We have two potential solutions to this problem:

    1. Add a weak-ref type cache and refactor _tp_cache to use it;

    2. Cleanup lru-cache before/adter running each typing test.

    [1] is the ideal solution, [2] should work good enough.

    @ilevkivskyi
    Copy link
    Member

    Yury, Guido, I think I understand why test_typing crashes in refleak hunting mode. The test relies on caching, namely type variables are compared by id, so that if cache is cleared between those two lines, then test fails. I think we should just update those test to use a usual class in forward reference. The patch is attached.

    (Patch fixes the crash in refleak hunting mode, but the refleak is still there)

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Does that patch fix the refleak completely or only partially?

    I tried to empty the cache in test_typing.BaseTestCase.setUp and tearDown; this doesn't fix all refleaks.

    Do we have some other caches in typing.py?

    @serhiy-storchaka
    Copy link
    Member

    Does that patch fix the refleak completely or only partially?

    It fixes 96% of refleaks.

    @ilevkivskyi
    Copy link
    Member

    Yury, _tp_cache should be the only one (of course most generic classes have _abc_cache since GenericMeta inherits from ABCMeta)

    @gvanrossum
    Copy link
    Member

    FYI I've opened python/typing#323 to track this upstream.

    @gvanrossum
    Copy link
    Member

    It fixes 96% of refleaks.

    In typing.py? Or generally?

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Yury, _tp_cache should be the only one (of course most generic classes have _abc_cache since GenericMeta inherits from ABCMeta)

    Then there is at least one more bug not related to lru_cache (or any kind of cache in typing).

    And it's not classes-with-empty-slots bug either.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Serhiy's patch typing-clear-caches.patch looks good, we should apply it (fixing the code style a little bit).

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Oh. This is a bug in C implementation of functools.lru_cache. I'll take a look.

    @ilevkivskyi
    Copy link
    Member

    My impression is like something wrong happens when TypeError is raised by a cached function.

    @ilevkivskyi
    Copy link
    Member

    A very simple repro:

        with self.assertRaises(TypeError):
            Union[[]]

    It looks like the problem happens when a non-hashable argument is passed to cached function

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    reflesk with only functools:

        def test_lru_type_error(self):
            @functools.lru_cache(maxsize=None)
            def foo(o):
                raise TypeError
    
            with self.assertRaises(TypeError):
                foo([])

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    Found it, will open a separate issue.

    @1st1 1st1 closed this as completed Nov 9, 2016
    @1st1
    Copy link
    Member Author

    1st1 commented Nov 9, 2016

    See http://bugs.python.org/issue28653 for details.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 10, 2016

    So Ivan made an interesting observation: if we use Python version of functools.lru_cache, typing tests start to leak like crazy:

    beginning 6 repetitions
    123456
    ......
    test_typing leaked [24980, 24980, 24980] references, sum=74940
    

    I experimented a bit, and I *think* I know what's happening:

    • typing uses an lru cache for types.

    • It looks like that some types in the cache are being *reused* by different tests.

    • Because typing doesn't set the size of the lru cache, it's set to 128 (default).

    • When many typing tests execute, the cache invalidates some entries, but because types in test_typing are so complex, I believe that the resulting graph of objects it too complex for the GC to go through (10s of thousands of cross-linked objects).

    A simple fix that removes fixes the refleak with Python version of lru_cache is to add a 'tearDown' method to 'test_typing.BaseTestCase':

        class BaseTestCase(TestCase):
            def tearDown(self):
                for f in typing._cleanups:
                    f()

    Now this all of this is just my guess of what's going on here. It might be something that has to be fixed in typing, or maybe we should just add tearDown.

    Guido, Ivan, what do you think?

    @1st1 1st1 reopened this Nov 10, 2016
    @gvanrossum
    Copy link
    Member

    But why is this so different from the C implementation of lru_cache?

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 10, 2016

    But why is this so different from the C implementation of lru_cache?

    I don't know. Maybe C version of lru cache creates a bit less pressure on GC.

    BTW, if I set maxsize=100000 for typing lru_cache, test_generic_forward_ref crashes in refleak-test mode. Maybe this is another bug...

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 10, 2016

    > But why is this so different from the C implementation of lru_cache?

    I don't know. Maybe C version of lru cache creates a bit less pressure on GC.

    Actually test/refleak.py now cleans up typing's lru cache & calls "gc.collect()". It seems that the typing cycles aren't collectable by the GC when Python lru cache is used.

    @ilevkivskyi
    Copy link
    Member

    • It looks like that some types in the cache are being *reused* by different tests.

    I also noticed this some time ago, ideally we need to make all tests more isolated between each other.

    Now this all of this is just my guess of what's going on here. It might be something that has to be fixed in typing, or maybe we should just add tearDown.

    adding tearDown is a good intermediate solution, while I could go through tests and try to make them more isolated from each other where necessary.

    BTW, if I set maxsize=100000 for typing lru_cache, test_generic_forward_ref crashes in refleak-test mode. Maybe this is another bug...

    I also noticed this today, and this is something that I don't understand yet. This test should not depend on caching, but it looks like it does. I will think more about this.

    @gvanrossum
    Copy link
    Member

    Segfault or traceback?

    Anyway, it implements a doubly-linked list where each node is implemented
    as a list of 4 items... Maybe there's something in the GC code that
    recurses down at the C level??? I'm not sure a crash the Python version of
    lru_cache with maxsize=100000 is high on my list of worries.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 10, 2016

    Segfault or traceback?

    Traceback. I should have said failed, not crashed :(

    I'm not sure a crash the Python version of
    lru_cache with maxsize=100000 is high on my list of worries.

    The problem is that the test fails depending on cache size, and caching code shouldn't really do that ever.

    @ilevkivskyi
    Copy link
    Member

    It looks like the tearDown() works really well. It also somehow fixes the problem with test_generic_forward_ref

    @ilevkivskyi
    Copy link
    Member

    I think Yury is right here. The good plan would be to add the tearDown(), but also fix the problem with test_generic_forward_ref. I could work on it tomorrow (really don't have time today, sorry).

    @ilevkivskyi
    Copy link
    Member

    Guido, Yury, it looks like I solved the puzzle. All the remaining problems are because of forward references. In particular, _ForwardRef keeps a reference to the frame where it was defined:

            typing_globals = globals()
            frame = sys._getframe(1)
            while frame is not None and frame.f_globals is typing_globals:
                frame = frame.f_back
            assert frame is not None
            self.__forward_frame__ = frame

    This is old code from 2015 introduced to support __isinstance__ for forward refs. The latter is no more supported anyway, so that I believe this code should be removed.

    I will make a PR upstream soon.

    @ilevkivskyi
    Copy link
    Member

    BTW, if I set maxsize=100000 for typing lru_cache, test_generic_forward_ref crashes in refleak-test mode. Maybe this is another bug...

    This one is also related to the mentioned in my previous message. Namely, forward references are only evaluated once. However, in refleak hunting mode tests are repeated multiple times, and every time test_generic_forward_ref is called, this code:

        class CC: ...

    creates a new local class, while forward ref is still pointing to an old one.

    There are two options:

    • keep this behaviour and update the test to use a globally defined class;
    • change this and re-evaluate forward refs every times.

    However I would vote for a compromise solution: if the locals argument is not given, just use the evaluated value. But if a user gives non-None locals argument then that user most probably wants to re-evaluate the forward ref locally.

    @ilevkivskyi
    Copy link
    Member

    OK, here are the PRs:

    python/typing#327
    python/typing#328

    Those two seem to fix everything, I tried various cache sizes with both C and Python versions of lru_cache and found no refleaks.

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 10, 2016

    Good work, Ivan!

    Those two seem to fix everything, I tried various cache sizes with both C and Python versions of lru_cache and found no refleaks.

    Speaking of lru_cache: typing currently doesn't configure maxsize, which means that the cache is limited to 128 items. Is that OK? Should maxsize be set to None to make the cache unlimited, or, maybe, a higher number like 1024 is more appropriate?

    @gvanrossum
    Copy link
    Member

    We discussed cache size previously and 128 is fine.

    --Guido (mobile)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 10, 2016

    New changeset 0da2e381ad71 by Guido van Rossum in branch '3.5':
    Issue bpo-28649: fix first issue with _ForwardRef (#327)
    https://hg.python.org/cpython/rev/0da2e381ad71

    New changeset 249a1f0b2857 by Guido van Rossum in branch '3.6':
    Issue bpo-28649: fix first issue with _ForwardRef (#327) (3.5->3.6)
    https://hg.python.org/cpython/rev/249a1f0b2857

    New changeset 304f017462e4 by Guido van Rossum in branch 'default':
    Issue bpo-28649: fix first issue with _ForwardRef (#327) (3.6->3.7)
    https://hg.python.org/cpython/rev/304f017462e4

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 10, 2016

    New changeset 555f0ca31587 by Guido van Rossum in branch '3.5':
    Issue bpo-28649: fix second issue with _ForwardRef (#328)
    https://hg.python.org/cpython/rev/555f0ca31587

    New changeset 0f863906cf2e by Guido van Rossum in branch '3.6':
    Issue bpo-28649: fix second issue with _ForwardRef (#328) (3.5->3.6)
    https://hg.python.org/cpython/rev/0f863906cf2e

    New changeset 84b4ac243508 by Guido van Rossum in branch 'default':
    Issue bpo-28649: fix second issue with _ForwardRef (#328) (3.6->3.7)
    https://hg.python.org/cpython/rev/84b4ac243508

    @gvanrossum
    Copy link
    Member

    Thanks a bundle, Ivan! I've merged those two into CPython 3.5, 3.6, 3.7. We can now watch the build bots for a while and double-check the refleaks.

    Yury, when you're happy with the situation can you close this issue?

    @1st1
    Copy link
    Member Author

    1st1 commented Nov 10, 2016

    Thank you Guido and Ivan!

    Yury, when you're happy with the situation can you close this issue?

    I'll close it now, will reopen if something comes up.

    @1st1 1st1 closed this as completed Nov 10, 2016
    @gvanrossum
    Copy link
    Member

    BTW, do we still want the tearDown()? Or is that no longer necessary?

    @ilevkivskyi
    Copy link
    Member

    I think tearDown() is not necessary. But on the other hand it could be nice to have a method in tests to force cache clear. I would propose it not as a default, but as an opt-in to check that cache works well, or that a certain tested feature is "cache independent", etc.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants