Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refleak test failure in test_exceptions #89058

Closed
iritkatriel opened this issue Aug 12, 2021 · 73 comments
Closed

refleak test failure in test_exceptions #89058

iritkatriel opened this issue Aug 12, 2021 · 73 comments
Labels
3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@iritkatriel
Copy link
Member

iritkatriel commented Aug 12, 2021

BPO 44895
Nosy @gvanrossum, @vstinner, @cjerdonek, @ambv, @markshannon, @corona10, @pablogsal, @miss-islington, @iritkatriel
PRs
  • bpo-44895: Temporarily add an extra gc.collect() call #27746
  • [3.10] bpo-44895: Temporarily add an extra gc.collect() call (GH-27746) #27753
  • bpo-44895: skip test_no_hang_on_context_chain_cycle2 until the reflea… #27761
  • bpo-44895: Introduce PYTHONDUMPREFSFILE variable for refcount dumping #27767
  • [3.10] bpo-44895: skip test_no_hang_on_context_chain_cycle2 until the refleak is fixed (GH-27761) #27778
  • bpo-44895: libregrtest: refleak check clears types later #28113
  • Files
  • cpython-1628938551.dump.gz
  • refleak.py
  • strict_refleak_check.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-08-12.00:13:38.919>
    labels = ['interpreter-core', '3.10', '3.11']
    title = 'refleak test failure in test_exceptions'
    updated_at = <Date 2021-09-01.16:43:27.187>
    user = 'https://github.com/iritkatriel'

    bugs.python.org fields:

    activity = <Date 2021-09-01.16:43:27.187>
    actor = 'vstinner'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2021-08-12.00:13:38.919>
    creator = 'iritkatriel'
    dependencies = []
    files = ['50216', '50224', '50251']
    hgrepos = []
    issue_num = 44895
    keywords = ['patch']
    message_count = 71.0
    messages = ['399424', '399449', '399450', '399451', '399452', '399453', '399456', '399457', '399459', '399463', '399472', '399485', '399492', '399507', '399510', '399513', '399516', '399517', '399533', '399537', '399539', '399540', '399541', '399542', '399543', '399546', '399554', '399555', '399556', '399557', '399560', '399578', '399588', '399589', '399601', '399636', '399639', '399676', '399681', '399682', '399696', '399697', '399700', '399703', '399739', '399740', '399762', '399764', '399781', '399785', '399786', '399788', '399789', '399792', '399794', '399795', '399796', '399800', '399855', '399856', '399857', '400645', '400741', '400757', '400779', '400794', '400840', '400844', '400848', '400853', '400854']
    nosy_count = 9.0
    nosy_names = ['gvanrossum', 'vstinner', 'chris.jerdonek', 'lukasz.langa', 'Mark.Shannon', 'corona10', 'pablogsal', 'miss-islington', 'iritkatriel']
    pr_nums = ['27746', '27753', '27761', '27767', '27778', '28113']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue44895'
    versions = ['Python 3.10', 'Python 3.11']

    Linked PRs

    @iritkatriel
    Copy link
    Member Author

    For background see https://bugs.python.org/issue33930#msg399403

    iritkatriel@Irits-MBP cpython % repeat 10 ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory
    0:00:00 load avg: 2.32 Run tests sequentially
    0:00:00 load avg: 2.32 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......

    == Tests result: SUCCESS ==

    1 test OK.

    Total duration: 5.9 sec
    Tests result: SUCCESS
    0:00:00 load avg: 2.22 Run tests sequentially
    0:00:00 load avg: 2.22 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......

    == Tests result: SUCCESS ==

    1 test OK.

    Total duration: 5.8 sec
    Tests result: SUCCESS
    0:00:00 load avg: 2.20 Run tests sequentially
    0:00:00 load avg: 2.20 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......

    == Tests result: SUCCESS ==

    1 test OK.

    Total duration: 5.8 sec
    Tests result: SUCCESS
    0:00:00 load avg: 2.17 Run tests sequentially
    0:00:00 load avg: 2.17 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......
    test_exceptions leaked [6, 6, 4] references, sum=16
    test_exceptions leaked [6, 6, 4] memory blocks, sum=16
    test_exceptions failed (reference leak)

    == Tests result: FAILURE ==

    1 test failed:
    test_exceptions

    1 re-run test:
    test_exceptions

    Total duration: 5.9 sec
    Tests result: FAILURE
    0:00:00 load avg: 2.08 Run tests sequentially
    0:00:00 load avg: 2.08 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......
    test_exceptions leaked [6, 6, 6] references, sum=18
    test_exceptions leaked [6, 6, 6] memory blocks, sum=18
    test_exceptions failed (reference leak)

    == Tests result: FAILURE ==

    1 test failed:
    test_exceptions

    1 re-run test:
    test_exceptions

    Total duration: 5.8 sec
    Tests result: FAILURE
    0:00:00 load avg: 2.39 Run tests sequentially
    0:00:00 load avg: 2.39 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......
    test_exceptions leaked [6, 6, 6] references, sum=18
    test_exceptions leaked [6, 6, 6] memory blocks, sum=18
    test_exceptions failed (reference leak)

    == Tests result: FAILURE ==

    1 test failed:
    test_exceptions

    1 re-run test:
    test_exceptions

    Total duration: 6.0 sec
    Tests result: FAILURE
    0:00:00 load avg: 2.36 Run tests sequentially
    0:00:00 load avg: 2.36 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......
    test_exceptions leaked [6, 6, 6] references, sum=18
    test_exceptions leaked [6, 6, 6] memory blocks, sum=18
    test_exceptions failed (reference leak)

    == Tests result: FAILURE ==

    1 test failed:
    test_exceptions

    1 re-run test:
    test_exceptions

    Total duration: 6.0 sec
    Tests result: FAILURE
    0:00:00 load avg: 2.31 Run tests sequentially
    0:00:00 load avg: 2.31 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......

    == Tests result: SUCCESS ==

    1 test OK.

    Total duration: 6.3 sec
    Tests result: SUCCESS
    0:00:00 load avg: 2.20 Run tests sequentially
    0:00:00 load avg: 2.20 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......
    test_exceptions leaked [6, 6, 6] references, sum=18
    test_exceptions leaked [6, 6, 6] memory blocks, sum=18
    test_exceptions failed (reference leak)

    == Tests result: FAILURE ==

    1 test failed:
    test_exceptions

    1 re-run test:
    test_exceptions

    Total duration: 6.1 sec
    Tests result: FAILURE
    0:00:00 load avg: 2.35 Run tests sequentially
    0:00:00 load avg: 2.35 [1/1] test_exceptions

    @iritkatriel iritkatriel added 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Aug 12, 2021
    @vstinner
    Copy link
    Member

    I fail to reproduce the issue on Linux on the 3.10 branch. I interrupted the job after 20 minutes:

    $ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1
    (...)
    0:21:38 load avg: 2.01 [134] test_exceptions passed
    beginning 6 repetitions
    123456
    ......

    I tested the commit 5d44443.

    @iritkatriel
    Copy link
    Member Author

    I see it on Mac but not windows.

    Running the test with -v or adding prints makes the leak go away. I’m suspecting that one of the recursion tests fails to clear an exception or something like that.

    Removing any one of the tests makes it stop failing.

    @iritkatriel
    Copy link
    Member Author

    Victor, if the …cycle2 test is hanging then you need the change in PR27626.

    @vstinner
    Copy link
    Member

    I am not more lucky in the main branch. I ran it for 30 minutes and it didn't fail.

    $ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1
    (...)
    0:31:15 load avg: 1.88 [184] test_exceptions passed
    beginning 6 repetitions
    123456
    ......
    ^C
    Kill <TestWorkerProcess #1 running test=test_exceptions pid=173796 time=4.2 sec> process group

    == Tests result: INTERRUPTED ==
    Test suite interrupted by signal SIGINT.

    184 tests OK.

    Total duration: 31 min 19 sec
    Tests result: INTERRUPTED

    I tested commit 8ac0886.

    @vstinner
    Copy link
    Member

    Victor, if the …cycle2 test is hanging then you need the change in PR27626.

    That's the commit d5c2174. My main branch is up to date (commit 8ac0886), it includes the commit d5c2174.

    @iritkatriel
    Copy link
    Member Author

    I don't know why you need to kill it. It's not a long test.

    @vstinner
    Copy link
    Member

    I see it on Mac but not windows.

    Aha, maybe there is something specific on Mac which triggers an exact code path to trigger to bug. I already saw bugs like that.

    You can try to play with gc.set_threshold() to change how garbage collection is triggered.

    @iritkatriel
    Copy link
    Member Author

    I don't know why you need to kill it. It's not a long test.

    Ah I see, -F.

    @iritkatriel
    Copy link
    Member Author

    The problem disappears if I add a gc.collect() loop at the beginning of the new test:

    diff --git a/Lib/test/test_exceptions.py b/Lib/test/test_exceptions.py
    index 79798ecf05..e0aeac9d10 100644
    --- a/Lib/test/test_exceptions.py
    +++ b/Lib/test/test_exceptions.py
    @@ -1014,6 +1014,7 @@ def cycle():
     
         def test_no_hang_on_context_chain_cycle2(self):
             # See issue 25782. Cycle at head of context chain.
    +        while gc.collect(): pass
     
             class A(Exception):
                 pass

    @iritkatriel
    Copy link
    Member Author

    This fails (test_recursion_normalizing_with_no_memory is not needed):

    ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler

    @iritkatriel
    Copy link
    Member Author

    My smallest failing example so far:

        def test_recursion_in_except_handler(self):
    
            self.test_no_hang_on_context_chain_cycle2()
    
            def set_relative_recursion_limit(n):
                depth = 1
                while True:
                    try:
                        sys.setrecursionlimit(depth)
                    except RecursionError:
                        depth += 1
                    else:
                        break
                sys.setrecursionlimit(depth+n)
    
            def recurse_in_body_and_except():
                try:
                    recurse_in_body_and_except()
                except:
                    recurse_in_body_and_except()
    
            recursionlimit = sys.getrecursionlimit()
            try:
                set_relative_recursion_limit(10)
                try:
                    recurse_in_body_and_except()
                except RecursionError:
                    pass
                else:
                    self.fail("Should have raised a RecursionError")
            finally:
                sys.setrecursionlimit(recursionlimit)

    @cjerdonek
    Copy link
    Member

    Out of curiosity, is the failure deterministic in environments where it fails? If not, what is the source of the indeterminism -- some kind of race condition or something else?

    @iritkatriel
    Copy link
    Member Author

    It’s not deterministic.

    @ambv
    Copy link
    Contributor

    ambv commented Aug 13, 2021

    New changeset 7bf28cb by Irit Katriel in branch 'main':
    bpo-44895: Temporarily add an extra gc.collect() call (GH-27746)
    7bf28cb

    @iritkatriel
    Copy link
    Member Author

    what is the source of the indeterminism -- some kind of race condition or something else?

    My hunch is it could be timing of GC. But it's hard to say at this point.

    @miss-islington
    Copy link
    Contributor

    New changeset ebc5926 by Miss Islington (bot) in branch '3.10':
    bpo-44895: Temporarily add an extra gc.collect() call (GH-27746)
    ebc5926

    @ambv
    Copy link
    Contributor

    ambv commented Aug 13, 2021

    I merged the workaround to get refleak buildbots unstuck for the weekend. I bumped the issue stage back to "needs patch" as the available one isn't a solution.

    @vstinner
    Copy link
    Member

    Chris:

    Out of curiosity, is the failure deterministic in environments where it fails? If not, what is the source of the indeterminism -- some kind of race condition or something else?

    The GC uses counters and thresholds to decide which collection and when they should be collected. I tried to play with gc.set_threshold(), but I failed to reproduce the issue on Linux.

    Maybe it's a very precise threshold which triggers the issue. Between Linux and macOS, the site module executes different code paths which produce different GC counters.

    Sometimes, the GC must happen in a very precise line, one line later is too late. See bpo-44422 for a case of threading.enumerate() (different kind of bug, but it's related to the GC). See bpo-44184 for another example of GC bug which only occurred on Windows and only if you type an exact command line.

    @pablogsal
    Copy link
    Member

    I have tried to reproduce this on my Linux and Mac machines and I have been unable to do it for the time being in either of them.

    @iritkatriel
    Copy link
    Member Author

    The script below hangs on my mac (it's an extract from test_recursion_in_except_handler).

    -----------

    import sys
    
    count = 0
    def main():
    
      def f():
        global count
        count += 1
        try:
            f()
        except RecursionError:
            f()
    
      sys.setrecursionlimit(30)

    try:
    f()
    except RecursionError:
    pass

    main()
    print(count)

    When I kill it the traceback shows it alternating between the two recursive calls, but not in a regular pattern:

    ... [snipped a lot]

    File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
    File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
    [Previous line repeated 2 more times]
    RecursionError: maximum recursion depth exceeded

    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      [Previous line repeated 1 more time]
    RecursionError: maximum recursion depth exceeded
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/iritkatriel/src/cpython/tt.py", line 22, in <module>
        main()
        ^^^^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 18, in main
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      [Previous line repeated 10 more times]
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
        f()
        ^^^
      File "/Users/iritkatriel/src/cpython/tt.py", line 7, in f
        def f():
    KeyboardInterrupt

    @iritkatriel
    Copy link
    Member Author

    Adding Mark since he worked on recursion recently.

    @corona10
    Copy link
    Member

    From my macOS, I was not able to reproduce the issue also as like as Victor and Pablo.

    repeat 10 ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory

    However hang was able to reproduce.

    @corona10
    Copy link
    Member

    However hang was able to reproduce.

    I meant msg399539

    @corona10
    Copy link
    Member

    Oops sorry.

    Without PR 27746, I was able to reproduce on my macOS environment.

    test_exceptions leaked [6, 6, 6] references, sum=18
    test_exceptions leaked [6, 6, 6] memory blocks, sum=18
    test_exceptions failed (reference leak)

    @iritkatriel
    Copy link
    Member Author

    Regarding the hang in msg399539, do we need some special handling of the case were a RecursionError is being raised and its context is another RecursionError? Like a FatalError?

    @iritkatriel
    Copy link
    Member Author

    But this makes it not leak again, so now we know it's the traceback:

        try:
            raise B()
        except B:
            b = sys.exc_info()[1]
            b.__traceback__ = None
    

    @pablogsal
    Copy link
    Member

    I am still amazed that I cannot reproduce the leaks on my machine. Maybe I am missing some steps.

    Could you outline how are you reproducing the leaks?

    @pablogsal
    Copy link
    Member

    But this makes it not leak again, so now we know it's the traceback:

    The traceback is a very heavy gc object because it contains the frames, and in the locals of the frame is the traceback itself via the exception. This is a well known heavy cycle, but the gc should take care of it if the cycle becomes isolated

    @iritkatriel
    Copy link
    Member Author

    This doesn't leak:

        try:
            raise B()
        except B:
            tb = sys.exc_info()[1].__traceback__
    

    So this isn't about the traceback being part of the cycle. The B() needs to be there too.

    @markshannon
    Copy link
    Member

    refleak.py seems to run forever for me. Got bored after 6000 iterations.

    Which O/S and configure options are needed for refleak.py to fail?

    @iritkatriel
    Copy link
    Member Author

    I'm using a mac. It typically fails within at most 10-20 iterations.

    @iritkatriel
    Copy link
    Member Author

    Could you outline how are you reproducing the leaks?

    Sorry I missed this. Just run the script:

    iritkatriel@Irits-MBP cpython-1 % ./python.exe refleak.py
    -- 1 --
    beginning 6 repetitions
    123456
    ......
    -- 2 --
    beginning 6 repetitions
    123456
    ......
    -- 3 --
    beginning 6 repetitions
    123456
    ......
    test leaked [2, 4, 2] references, sum=8
    test leaked [2, 4, 2] memory blocks, sum=8

    @pablogsal
    Copy link
    Member

    There is some timing-related race condition at hand because I tried in a Linux machine and a MacBook Pro (intel) laptop and I wasn't able to reproduce after 20 min of the script running :(

    @vstinner
    Copy link
    Member

    Irit: do you still reproduce the issue using gc.disable()? Or you can try different values to call gc.set_threshold(). You may also try different values in ns.huntrleaks, like: ns.huntrleaks = (3, 20, 'tt_out.py'). With 20 iterations, do you still reproduce the leak?

    To make the issue more likely, you can change check_rc_deltas() in Lib/test/libregrtest/refleak.py. Try:

        def check_rc_deltas(deltas):
            return any(delta >= 1 for delta in deltas)

    With this change, the script stops immediately:

    test leaked [-4, 4, -4] references, sum=-4
    test leaked [-4, 4, -4] memory blocks, sum=-4

    Using 20 iterations, I get:

    test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] references, sum=0
    test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] memory blocks, sum=0

    @vstinner
    Copy link
    Member

    test leaked [-4, 4, -4] references, sum=-4

    If I disable the GC, the script no longer detects "leaks". Same using a GC threshold of 800.

    Using a GC threshold between 100 and 700, I reproduce the [-4, 4, -4] references leak.

    Funny bug.

    @vstinner
    Copy link
    Member

    You can also play with the clear() method of frame objects and traceback.clear_frames() (which expects a traceback object).

    @iritkatriel
    Copy link
    Member Author

    I think the leak happens when we invoke GC while the recursion limit is exceeded. It goes way if make this change:

        def recurse_in_body_and_except():
            try:
                recurse_in_body_and_except()
            except RecursionError as e:
                gc.disable()
                recurse_in_body_and_except()
                gc.enable()

    I also added a __del__ to B and saw it being called when recursion limit is small (though adding the __del__ makes the leak go away. Actually just adding "def f(): pass" is enough to make it go away).

    @vstinner
    Copy link
    Member

    A variant of this problem: bpo-36560 with a leak 1 memory block per iteration.

    @iritkatriel
    Copy link
    Member Author

    Victor - I am also getting a feeling that this may not be a real leak but rather a bug in the refleaks test bookkeeping.

    It could be a problem with the way _Py_RefTotal is incremented/decremented.

    @vstinner
    Copy link
    Member

    It could be a problem with the way _Py_RefTotal is incremented/decremented.

    It can be a cache which is kept alive and then clear the following iteration. See my previous comment:

    test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] references, sum=0

    libregrtest/refleak.py does its best to clear *all* caches, type caches, free lists, etc. A single cache which is not cleared can confused libregrtest.

    @iritkatriel
    Copy link
    Member Author

    Oh, I see now what you meant.

    I think you're right that it's cleanup. I changed dash_R_cleanup to put the sys._clear_type_cache() after clear_caches() and the leak is gone:

    @@ -166,9 +166,6 @@ def dash_R_cleanup(fs, ps, pic, zdc, abcs):
             zipimport._zip_directory_cache.clear()
             zipimport._zip_directory_cache.update(zdc)
     
    -    # clear type cache
    -    sys._clear_type_cache()
    -
         # Clear ABC registries, restoring previously saved ABC registries.
         abs_classes = [getattr(collections.abc, a) for a in collections.abc.__all__]
         abs_classes = filter(isabstract, abs_classes)
    @@ -181,6 +178,12 @@ def dash_R_cleanup(fs, ps, pic, zdc, abcs):
     
         clear_caches()
     
    +    # clear type cache
    +    sys._clear_type_cache()
    +
    +    support.gc_collect()
    +

    I did this because I suspected it's something related to types, because it only happen when we define the exception classes (A and B) inside the test function.

    So maybe the type objects of these exceptions are not cleared from the type cache because they have refcount > 0 but by the time clear_caches() and gc.collect() finish they are cleared?

    Note this doesn't mean the change I made is always right - maybe the whole body of dash_R_cleanup should be in a loop until nothing changes?

    @vstinner
    Copy link
    Member

    vstinner commented Sep 1, 2021

    I changed dash_R_cleanup to put the sys._clear_type_cache() after clear_caches() and the leak is gone

    Aha! The order of function calls in dash_R_cleanup() matters.

    Calling sys._clear_type_cache() at the end of dash_R_cleanup() perfectly makes sense. If you do that, please add a comment to explain that the order matters. For example, the code about ABC classes is likely to modify classes.

    About the gc_collect() call, I dislike calling it in clear_caches() and then again in dash_R_cleanup(). I would even prefer to *remove* it from both functions and call it explicitly at the call site: in dash_R() and in runtest.py.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 1, 2021

    New changeset 679cb47 by Victor Stinner in branch 'main':
    bpo-44895: libregrtest: refleak check clears types later (GH-28113)
    679cb47

    @iritkatriel
    Copy link
    Member Author

    With your patch I don't see a leak for refleak.py, but I still see one for

    ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1

    (once I remove the @Skip on test_no_hang_on_context_chain_cycle2).

    However, if I make this change it looks better:

    --- a/Lib/test/libregrtest/refleak.py
    +++ b/Lib/test/libregrtest/refleak.py
    @@ -90,8 +90,9 @@ def get_pooled_int(value):
         for i in rep_range:
             test_func()
     
    -        dash_R_cleanup(fs, ps, pic, zdc, abcs)
    -        support.gc_collect()
    +        for i in range(3):
    +            dash_R_cleanup(fs, ps, pic, zdc, abcs)
    +            support.gc_collect()

    @vstinner
    Copy link
    Member

    vstinner commented Sep 1, 2021

    (once I remove the @Skip on test_no_hang_on_context_chain_cycle2).

    Oh. I forgot that the test is still skipped!

    With your patch I don't see a leak for refleak.py, but I still see one for
    ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1

    If I apply attached strict_refleak_check.patch and I remove @unittest.skip("See bpo-44895") of test_no_hang_on_context_chain_cycle2(), I get:

    $ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory

    (...)
    test_exceptions leaked [6, -6, 6] references, sum=6
    (...)

    @iritkatriel
    Copy link
    Member Author

    I changed it to

    + for i in range(2):
    + dash_R_cleanup(fs, ps, pic, zdc, abcs)
    + support.gc_collect()

    (loop twice instead of 3 times) and that seems fine too.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel
    Copy link
    Member Author

    I was unable to reproduce this locally now. Making a PR to unskip the test, we'll see what the buildbots say.

    @iritkatriel
    Copy link
    Member Author

    Test has been re-enabled. Let's close this in the spirit of optimism - if the refleak returns we can reopen.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants