This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: refleak test failure in test_exceptions
Type: Stage: patch review
Components: Interpreter Core Versions: Python 3.11, Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, chris.jerdonek, corona10, gvanrossum, iritkatriel, lukasz.langa, miss-islington, pablogsal, vstinner
Priority: normal Keywords: patch

Created on 2021-08-12 00:13 by iritkatriel, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
cpython-1628938551.dump.gz corona10, 2021-08-14 10:59
refleak.py iritkatriel, 2021-08-16 20:13
strict_refleak_check.patch vstinner, 2021-09-01 16:43
Pull Requests
URL Status Linked Edit
PR 27746 merged iritkatriel, 2021-08-12 16:59
PR 27753 merged miss-islington, 2021-08-13 09:41
PR 27761 merged iritkatriel, 2021-08-13 15:24
PR 27767 merged corona10, 2021-08-14 11:01
PR 27778 merged miss-islington, 2021-08-16 08:36
PR 28113 merged vstinner, 2021-09-01 15:26
Messages (71)
msg399424 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 00:13
For background see https://bugs.python.org/issue33930#msg399403


iritkatriel@Irits-MBP cpython % repeat 10 ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory
0:00:00 load avg: 2.32 Run tests sequentially
0:00:00 load avg: 2.32 [1/1] test_exceptions
beginning 6 repetitions
123456
......

== Tests result: SUCCESS ==

1 test OK.

Total duration: 5.9 sec
Tests result: SUCCESS
0:00:00 load avg: 2.22 Run tests sequentially
0:00:00 load avg: 2.22 [1/1] test_exceptions
beginning 6 repetitions
123456
......

== Tests result: SUCCESS ==

1 test OK.

Total duration: 5.8 sec
Tests result: SUCCESS
0:00:00 load avg: 2.20 Run tests sequentially
0:00:00 load avg: 2.20 [1/1] test_exceptions
beginning 6 repetitions
123456
......

== Tests result: SUCCESS ==

1 test OK.

Total duration: 5.8 sec
Tests result: SUCCESS
0:00:00 load avg: 2.17 Run tests sequentially
0:00:00 load avg: 2.17 [1/1] test_exceptions
beginning 6 repetitions
123456
......
test_exceptions leaked [6, 6, 4] references, sum=16
test_exceptions leaked [6, 6, 4] memory blocks, sum=16
test_exceptions failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_exceptions

1 re-run test:
    test_exceptions

Total duration: 5.9 sec
Tests result: FAILURE
0:00:00 load avg: 2.08 Run tests sequentially
0:00:00 load avg: 2.08 [1/1] test_exceptions
beginning 6 repetitions
123456
......
test_exceptions leaked [6, 6, 6] references, sum=18
test_exceptions leaked [6, 6, 6] memory blocks, sum=18
test_exceptions failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_exceptions

1 re-run test:
    test_exceptions

Total duration: 5.8 sec
Tests result: FAILURE
0:00:00 load avg: 2.39 Run tests sequentially
0:00:00 load avg: 2.39 [1/1] test_exceptions
beginning 6 repetitions
123456
......
test_exceptions leaked [6, 6, 6] references, sum=18
test_exceptions leaked [6, 6, 6] memory blocks, sum=18
test_exceptions failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_exceptions

1 re-run test:
    test_exceptions

Total duration: 6.0 sec
Tests result: FAILURE
0:00:00 load avg: 2.36 Run tests sequentially
0:00:00 load avg: 2.36 [1/1] test_exceptions
beginning 6 repetitions
123456
......
test_exceptions leaked [6, 6, 6] references, sum=18
test_exceptions leaked [6, 6, 6] memory blocks, sum=18
test_exceptions failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_exceptions

1 re-run test:
    test_exceptions

Total duration: 6.0 sec
Tests result: FAILURE
0:00:00 load avg: 2.31 Run tests sequentially
0:00:00 load avg: 2.31 [1/1] test_exceptions
beginning 6 repetitions
123456
......

== Tests result: SUCCESS ==

1 test OK.

Total duration: 6.3 sec
Tests result: SUCCESS
0:00:00 load avg: 2.20 Run tests sequentially
0:00:00 load avg: 2.20 [1/1] test_exceptions
beginning 6 repetitions
123456
......
test_exceptions leaked [6, 6, 6] references, sum=18
test_exceptions leaked [6, 6, 6] memory blocks, sum=18
test_exceptions failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_exceptions

1 re-run test:
    test_exceptions

Total duration: 6.1 sec
Tests result: FAILURE
0:00:00 load avg: 2.35 Run tests sequentially
0:00:00 load avg: 2.35 [1/1] test_exceptions
msg399449 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-12 13:44
I fail to reproduce the issue on Linux on the 3.10 branch. I interrupted the job after 20 minutes:

$ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1
(...)
0:21:38 load avg: 2.01 [134] test_exceptions passed
beginning 6 repetitions
123456
......

I tested the commit 5d444434ad4e1943a88c9d3aadd300fd0f05dab7.
msg399450 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 13:59
I see it on Mac but not windows.

Running the test with -v or adding prints makes the leak go away. I’m suspecting that one of the recursion tests fails to clear an exception or something like that.

Removing any one of the tests makes it stop failing.
msg399451 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 14:02
Victor, if the …cycle2 test is hanging then you need the change in PR27626.
msg399452 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-12 14:17
I am not more lucky in the main branch. I ran it for 30 minutes and it didn't fail.

$ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1
(...)
0:31:15 load avg: 1.88 [184] test_exceptions passed
beginning 6 repetitions
123456
......
^C
Kill <TestWorkerProcess #1 running test=test_exceptions pid=173796 time=4.2 sec> process group

== Tests result: INTERRUPTED ==
Test suite interrupted by signal SIGINT.

184 tests OK.

Total duration: 31 min 19 sec
Tests result: INTERRUPTED


I tested commit 8ac0886091c27bf4b6bb0a9b571e174b554d31a4.
msg399453 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-12 14:18
> Victor, if the …cycle2 test is hanging then you need the change in PR27626.

That's the commit d5c217475c4957a8084ac3f92ae012ece5edc7cb. My main branch is up to date (commit 8ac0886091c27bf4b6bb0a9b571e174b554d31a4), it includes the commit d5c217475c4957a8084ac3f92ae012ece5edc7cb.
msg399456 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 14:21
I don't know why you need to kill it. It's not a long test.
msg399457 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-12 14:23
> I see it on Mac but not windows.

Aha, maybe there is something specific on Mac which triggers an exact code path to trigger to bug. I already saw bugs like that.

You can try to play with gc.set_threshold() to change how garbage collection is triggered.
msg399459 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 14:39
> I don't know why you need to kill it. It's not a long test.

Ah I see, -F.
msg399463 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 14:52
The problem disappears if I add a gc.collect() loop at the beginning of the new test:

diff --git a/Lib/test/test_exceptions.py b/Lib/test/test_exceptions.py
index 79798ecf05..e0aeac9d10 100644
--- a/Lib/test/test_exceptions.py
+++ b/Lib/test/test_exceptions.py
@@ -1014,6 +1014,7 @@ def cycle():
 
     def test_no_hang_on_context_chain_cycle2(self):
         # See issue 25782. Cycle at head of context chain.
+        while gc.collect(): pass
 
         class A(Exception):
             pass
msg399472 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 17:37
This fails (test_recursion_normalizing_with_no_memory is not needed):

./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler
msg399485 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-12 22:41
My smallest failing example so far:


    def test_recursion_in_except_handler(self):

        self.test_no_hang_on_context_chain_cycle2()

        def set_relative_recursion_limit(n):
            depth = 1
            while True:
                try:
                    sys.setrecursionlimit(depth)
                except RecursionError:
                    depth += 1
                else:
                    break
            sys.setrecursionlimit(depth+n)

        def recurse_in_body_and_except():
            try:
                recurse_in_body_and_except()
            except:
                recurse_in_body_and_except()

        recursionlimit = sys.getrecursionlimit()
        try:
            set_relative_recursion_limit(10)
            try:
                recurse_in_body_and_except()
            except RecursionError:
                pass
            else:
                self.fail("Should have raised a RecursionError")
        finally:
            sys.setrecursionlimit(recursionlimit)
msg399492 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2021-08-13 05:50
Out of curiosity, is the failure deterministic in environments where it fails? If not, what is the source of the indeterminism -- some kind of race condition or something else?
msg399507 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-13 08:48
It’s not deterministic.
msg399510 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-13 09:41
New changeset 7bf28cbb4bf37fa6bdfc2d3f8a3939066b3f8f22 by Irit Katriel in branch 'main':
bpo-44895: Temporarily add an extra gc.collect() call (GH-27746)
https://github.com/python/cpython/commit/7bf28cbb4bf37fa6bdfc2d3f8a3939066b3f8f22
msg399513 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-13 09:56
> what is the source of the indeterminism -- some kind of race condition or something else?


My hunch is it could be timing of GC. But it's hard to say at this point.
msg399516 - (view) Author: miss-islington (miss-islington) Date: 2021-08-13 10:21
New changeset ebc59262349d6901b825ed9101d604e826757262 by Miss Islington (bot) in branch '3.10':
bpo-44895: Temporarily add an extra gc.collect() call (GH-27746)
https://github.com/python/cpython/commit/ebc59262349d6901b825ed9101d604e826757262
msg399517 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-13 10:22
I merged the workaround to get refleak buildbots unstuck for the weekend. I bumped the issue stage back to "needs patch" as the available one isn't a solution.
msg399533 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-13 12:59
Chris:
> Out of curiosity, is the failure deterministic in environments where it fails? If not, what is the source of the indeterminism -- some kind of race condition or something else?

The GC uses counters and thresholds to decide which collection and when they should be collected. I tried to play with gc.set_threshold(), but I failed to reproduce the issue on Linux.

Maybe it's a very precise threshold which triggers the issue. Between Linux and macOS, the site module executes different code paths which produce different GC counters.

Sometimes, the GC must happen in a very precise line, one line later is too late. See bpo-44422 for a case of threading.enumerate() (different kind of bug, but it's related to the GC). See bpo-44184 for another example of GC bug which only occurred on Windows and only if you type an exact command line.
msg399537 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-08-13 13:59
I have tried to reproduce this on my Linux and Mac machines and I have been unable to do it for the time being in either of them.
msg399539 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-13 15:10
The script below hangs on my mac (it's an extract from test_recursion_in_except_handler).

-----------
import sys

count = 0
def main():

  def f():
    global count
    count += 1
    try:
        f()
    except RecursionError:
        f()

  sys.setrecursionlimit(30)

  try:
    f()
  except RecursionError:
    pass

main()
print(count)
-----------




When I kill it the traceback shows it alternating between the two recursive calls, but not in a regular pattern:


... [snipped a lot]

  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  [Previous line repeated 2 more times]
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  [Previous line repeated 1 more time]
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/iritkatriel/src/cpython/tt.py", line 22, in <module>
    main()
    ^^^^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 18, in main
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  [Previous line repeated 10 more times]
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
    f()
    ^^^
  File "/Users/iritkatriel/src/cpython/tt.py", line 7, in f
    def f():
KeyboardInterrupt
msg399540 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-13 15:11
Adding Mark since he worked on recursion recently.
msg399541 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-08-13 15:21
From my macOS, I was not able to reproduce the issue also as like as Victor and Pablo.

repeat 10 ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory

However hang was able to reproduce.
msg399542 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-08-13 15:21
> However hang was able to reproduce.

I meant msg399539
msg399543 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-08-13 15:30
Oops sorry. 

Without PR 27746, I was able to reproduce on my macOS environment.

test_exceptions leaked [6, 6, 6] references, sum=18
test_exceptions leaked [6, 6, 6] memory blocks, sum=18
test_exceptions failed (reference leak)
msg399546 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-13 17:16
Regarding the hang in msg399539, do we need some special handling of the case were a RecursionError is being raised and its context is another RecursionError?  Like a FatalError?
msg399554 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2021-08-13 19:39
> Maybe it's a very precise threshold which triggers the issue. Between Linux and macOS, the site module executes different code paths which produce different GC counters.
> Sometimes, the GC must happen in a very precise line, one line later is too late.

How does this explain it not being non-deterministic on, say, macOS since the same lines of code will be executing each time? Is there a way to force things to happen in a deterministic fashion? (I interpreted the responses to mean that it's non-deterministic even when run on the same machine, so I'm not talking about, say, the differences from running macOS on different machines.)
msg399555 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2021-08-13 19:43
"How does this explain it not being non-deterministic on" -> "How does this explain it not being deterministic on"
msg399556 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-13 19:53
For instance, if memory addresses of objects are different, so objects sort differently in different runs on the same system.
msg399557 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2021-08-13 20:26
Last question: might trying different values of PYTHONHASHSEED help?
msg399560 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-08-13 20:48
If someone can reproduce more or less reliably, it would be great to know where are the leaked objects. For that, compile CPython with "-with-trace-refs", surround the suspicious test in a loop (test_no_hang_on_context_chain_cycle2 for example) of a 1000 or more iterations and run the test with the PYTHONDUMPREFS=1 environment variable. You will see the leaked objects at shutdown hopefully.

Hopefully the leaked objects would be a gigantic set by then so we the number of references will be noticeable.
msg399578 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2021-08-14 00:57
FYI, I tried myself, and setting PYTHONHASHSEED didn't make the failures deterministic.
msg399588 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-08-14 10:59
Here is the dump I extracted.

0:00:00 load avg: 2.89 Run tests sequentially
0:00:00 load avg: 2.89 [1/1] test_exceptions
beginning 6 repetitions
123456
......
test_exceptions leaked [6, 6, 6] references, sum=18
test_exceptions leaked [6, 6, 6] memory blocks, sum=18
test_exceptions failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_exceptions

1 re-run test:
    test_exceptions

Total duration: 5.7 sec
Tests result: FAILURE
msg399589 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-08-14 11:03
GH-27767 is a tooling patch for extract dump easily.
See cpython-1628938551.dump.gz for msg399588.
msg399601 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-14 22:20
I created issue44917 for the recursion hang because it seems like it's really another bug.
msg399636 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-16 08:36
New changeset 62bc716fde20fcb7b47416c7959be9e66df93212 by Irit Katriel in branch 'main':
bpo-44895: skip test_no_hang_on_context_chain_cycle2 until the refleak is fixed (GH-27761)
https://github.com/python/cpython/commit/62bc716fde20fcb7b47416c7959be9e66df93212
msg399639 - (view) Author: miss-islington (miss-islington) Date: 2021-08-16 09:01
New changeset 1960409a6dcbd1e3527f02b523bd27df9086dd77 by Miss Islington (bot) in branch '3.10':
bpo-44895: skip test_no_hang_on_context_chain_cycle2 until the refleak is fixed (GH-27761)
https://github.com/python/cpython/commit/1960409a6dcbd1e3527f02b523bd27df9086dd77
msg399676 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-16 20:13
I've attached an even shorter script (refleak.py) that reproduces the leak.
msg399681 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-16 21:05
The leak seems to be due to the nested definitions of the Exception types A and B. It goes away if I move them out of the function so that it's like:


class A(Exception):
    pass
class B(Exception):
    pass

def test_no_hang_on_context_chain_cycle2():
    try:
        try:
            raise A()
        except A as _a:
            a = _a
            try:
                raise B()
            except B as _b:
                b = _b
        raise b
    except B:
        pass
msg399682 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-16 21:11
(Pdb) test_no_hang_on_context_chain_cycle2.__code__.co_consts[1]
<code object A at 0x10eecc840, file "/Users/iritkatriel/src/cpython-1/tt.py", line 37>
(Pdb) sys.getrefcount(test_no_hang_on_context_chain_cycle2.__code__.co_consts[1])
2
(Pdb) gc.get_referrers(test_no_hang_on_context_chain_cycle2.__code__.co_consts[1])
[]
msg399696 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-16 22:37
I've reproduced the leak in 3.10 (with the reflinks.py script).

In 3.9 the double-recursion segfaults:

Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow.
Python runtime state: initialized
msg399697 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-16 22:42
I meant refleaks.py, not reflinks.py.
msg399700 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-16 23:03
It's midnight, so just one last weirdness before I turn into a pumpkin:

If I add this assignment to x the issue seems to go away (or become less frequent so I don't see it?)   But if I assign x to a small constant it still leaks. 


def test_no_hang_on_context_chain_cycle2():
    x = list(range(100))
    class A(Exception):
        pass
    class B(Exception):
        pass
msg399703 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-08-16 23:09
>> x = list(range(100))

Is creating a lot of objects, and the gc has a chance to run on every object creation so maybe what's happening there is that you are inadvertently triggering the gc
msg399739 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 10:50
However, if I replace the assignment with a gc.collect() call, it still leaks:

def test_no_hang_on_context_chain_cycle2():
    while gc.collect():
        pass
    class A(Exception):
        pass
    class B(Exception):
        pass
    ...


I also changed gc_collect in Lib/test/support/__init__.py (which is called by the refleak check) to loop until gc.collect() returns 0 (currently it calls gc.collect() 3 times).  Still leaks.
msg399740 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 10:58
But if I also add a gc loop at the beginning of test_recursion_in_except_handler, the leak goes away.

def test_recursion_in_except_handler():
    while gc.collect(): 
       pass
    def set_relative_recursion_limit(n):
    ...


So maybe test_recursion_in_except_handler is resurrecting something that test_no_hang_on_context_chain_cycle2 has let go of?
msg399762 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-08-17 15:53
New changeset c2c857b40f226575d64e0d56a759cbd799f51e62 by Dong-hee Na in branch 'main':
bpo-44895: Introduce PYTHONDUMPREFSFILE variable for refcount dumping (GH-27767)
https://github.com/python/cpython/commit/c2c857b40f226575d64e0d56a759cbd799f51e62
msg399764 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 16:07
I'm pretty sure the frame locals is involved. 

This leaks (note that a is being re-raised, not b):

---------------------
def test_no_hang_on_context_chain_cycle2():
    class A(Exception): pass
    class B(Exception): pass

    try:
        try:
            raise A()
        except A as _a:
            a = _a
            try:
                raise B() 
            except B as _b:
                b = _b
        raise a 
    except A: 
        pass
---------------------

But this doesn't leak:

---------------------
def test_no_hang_on_context_chain_cycle2():
    class A(Exception): pass
    class B(Exception): pass

    try:
        try:
            raise A()
        except A as _a:
            a = _a
            try:
                raise B() 
            except B as _b:
                pass # b = _b  <== not saving b in a local
        raise a 
    except A: 
        pass
---------------------
msg399781 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-08-17 18:55
So there's code generated for an except clause with a variable that clears the variable at the end, so that this works:

try:
  ...
except Exception as err:
  ...
print(err)  # Should fail, since err was cleared when the except clause exited.

Maybe there's a bug here, either in the runtime or in the way that the code is generated?
msg399785 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 20:02
It's probably not that because this leaks too, and it skips the cleanup:

        try:
            raise B()
        except B:
            b = sys.exc_info()[1]
msg399786 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 20:04
But this makes it not leak again, so now we know it's the traceback:

        try:
            raise B()
        except B:
            b = sys.exc_info()[1]
            b.__traceback__ = None
msg399788 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-08-17 20:20
I am still amazed that I cannot reproduce the leaks on my machine. Maybe I am missing some steps.

Could you outline how are you reproducing the leaks?
msg399789 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-08-17 20:21
> But this makes it not leak again, so now we know it's the traceback:

The traceback is a very heavy gc object because it contains the frames, and in the locals of the frame is the traceback itself via the exception. This is a well known heavy cycle, but the gc should take care of it if the cycle becomes isolated
msg399792 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 20:59
This doesn't leak:

        try:
            raise B()
        except B:
            tb = sys.exc_info()[1].__traceback__

So this isn't about the traceback being part of the cycle. The B() needs to be there too.
msg399794 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-08-17 21:07
refleak.py seems to run forever for me. Got bored after 6000 iterations.

Which O/S and configure options are needed for refleak.py to fail?
msg399795 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 21:11
I'm using a mac. It typically fails within at most 10-20 iterations.
msg399796 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-17 21:16
> Could you outline how are you reproducing the leaks?

Sorry I missed this. Just run the script:


iritkatriel@Irits-MBP cpython-1 % ./python.exe refleak.py
-- 1 --
beginning 6 repetitions
123456
......
-- 2 --
beginning 6 repetitions
123456
......
-- 3 --
beginning 6 repetitions
123456
......
test leaked [2, 4, 2] references, sum=8
test leaked [2, 4, 2] memory blocks, sum=8
msg399800 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-08-17 22:34
There is some timing-related race condition at hand because I tried in a Linux machine and a MacBook Pro (intel) laptop and I wasn't able to reproduce after 20 min of the script running :(
msg399855 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-18 16:43
Irit: do you still reproduce the issue using gc.disable()? Or you can try different values to call gc.set_threshold(). You may also try different values in ns.huntrleaks, like: ns.huntrleaks = (3, 20, 'tt_out.py'). With 20 iterations, do you still reproduce the leak?

To make the issue more likely, you can change check_rc_deltas() in Lib/test/libregrtest/refleak.py. Try:

    def check_rc_deltas(deltas):
        return any(delta >= 1 for delta in deltas)

With this change, the script stops immediately:

test leaked [-4, 4, -4] references, sum=-4
test leaked [-4, 4, -4] memory blocks, sum=-4

Using 20 iterations, I get:

test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] references, sum=0
test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] memory blocks, sum=0
msg399856 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-18 16:48
> test leaked [-4, 4, -4] references, sum=-4

If I disable the GC, the script no longer detects "leaks". Same using a GC threshold of 800.

Using a GC threshold between 100 and 700, I reproduce the [-4, 4, -4] references leak.

Funny bug.
msg399857 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-18 16:49
You can also play with the clear() method of frame objects and traceback.clear_frames() (which expects a traceback object).
msg400645 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-30 18:14
I think the leak happens when we invoke GC while the recursion limit is exceeded. It goes way if make this change:


    def recurse_in_body_and_except():
        try:
            recurse_in_body_and_except()
        except RecursionError as e:
            gc.disable()
            recurse_in_body_and_except()
            gc.enable()


I also added a __del__ to B and saw it being called when recursion limit is small (though adding the __del__ makes the leak go away. Actually just adding "def f(): pass" is enough to make it go away).
msg400741 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-31 15:04
A variant of this problem: bpo-36560 with a leak 1 memory block per iteration.
msg400757 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-31 16:36
Victor - I am also getting a feeling that this may not be a real leak but rather a bug in the refleaks test bookkeeping. 

It could be a problem with the way _Py_RefTotal is incremented/decremented.
msg400779 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-31 21:20
> It could be a problem with the way _Py_RefTotal is incremented/decremented.

It can be a cache which is kept alive and then clear the following iteration. See my previous comment:

test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] references, sum=0

libregrtest/refleak.py does its best to clear *all* caches, type caches, free lists, etc. A single cache which is not cleared can confused libregrtest.
msg400794 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-31 23:02
Oh, I see now what you meant. 

I think you're right that it's cleanup. I changed dash_R_cleanup to put the sys._clear_type_cache() after clear_caches() and the leak is gone:

@@ -166,9 +166,6 @@ def dash_R_cleanup(fs, ps, pic, zdc, abcs):
         zipimport._zip_directory_cache.clear()
         zipimport._zip_directory_cache.update(zdc)
 
-    # clear type cache
-    sys._clear_type_cache()
-
     # Clear ABC registries, restoring previously saved ABC registries.
     abs_classes = [getattr(collections.abc, a) for a in collections.abc.__all__]
     abs_classes = filter(isabstract, abs_classes)
@@ -181,6 +178,12 @@ def dash_R_cleanup(fs, ps, pic, zdc, abcs):
 
     clear_caches()
 
+    # clear type cache
+    sys._clear_type_cache()
+
+    support.gc_collect()
+



I did this because I suspected it's something related to types, because it only happen when we define the exception classes (A and B) inside the test function. 

So maybe the type objects of these exceptions are not cleared from the type cache because they have refcount > 0 but by the time      clear_caches() and gc.collect() finish they are cleared?


Note this doesn't mean the change I made is always right - maybe the whole body of dash_R_cleanup should be in a loop until nothing changes?
msg400840 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-01 15:06
>  I changed dash_R_cleanup to put the sys._clear_type_cache() after clear_caches() and the leak is gone

Aha! The order of function calls in dash_R_cleanup() matters.

Calling sys._clear_type_cache() at the end of dash_R_cleanup() perfectly makes sense. If you do that, please add a comment to explain that the order matters. For example, the code about ABC classes is likely to modify classes.

About the gc_collect() call, I dislike calling it in clear_caches() and then again in dash_R_cleanup(). I would even prefer to *remove* it from both functions and call it explicitly at the call site: in dash_R() and in runtest.py.
msg400844 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-01 15:45
New changeset 679cb4781ea370c3b3ce40d3334dc404d7e9d92b by Victor Stinner in branch 'main':
bpo-44895: libregrtest: refleak check clears types later (GH-28113)
https://github.com/python/cpython/commit/679cb4781ea370c3b3ce40d3334dc404d7e9d92b
msg400848 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-09-01 15:56
With your patch I don't see a leak for refleak.py, but I still see one for 

./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1

(once I remove the @skip on test_no_hang_on_context_chain_cycle2). 


However, if I make this change it looks better:


--- a/Lib/test/libregrtest/refleak.py
+++ b/Lib/test/libregrtest/refleak.py
@@ -90,8 +90,9 @@ def get_pooled_int(value):
     for i in rep_range:
         test_func()
 
-        dash_R_cleanup(fs, ps, pic, zdc, abcs)
-        support.gc_collect()
+        for i in range(3):
+            dash_R_cleanup(fs, ps, pic, zdc, abcs)
+            support.gc_collect()
msg400853 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-01 16:13
> (once I remove the @skip on test_no_hang_on_context_chain_cycle2). 

Oh. I forgot that the test is still skipped!


> With your patch I don't see a leak for refleak.py, but I still see one for 
> ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1

If I apply attached strict_refleak_check.patch and I remove @unittest.skip("See issue 44895") of test_no_hang_on_context_chain_cycle2(), I get:

$ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory

(...)
test_exceptions leaked [6, -6, 6] references, sum=6
(...)
msg400854 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-09-01 16:18
I changed it to 

+        for i in range(2):
+            dash_R_cleanup(fs, ps, pic, zdc, abcs)
+            support.gc_collect()


(loop twice instead of 3 times) and that seems fine too.
History
Date User Action Args
2022-04-11 14:59:48adminsetgithub: 89058
2021-09-01 16:43:27vstinnersetfiles: + strict_refleak_check.patch
2021-09-01 16:18:25iritkatrielsetmessages: + msg400854
2021-09-01 16:13:43vstinnersetmessages: + msg400853
2021-09-01 15:56:22iritkatrielsetmessages: + msg400848
2021-09-01 15:45:35vstinnersetmessages: + msg400844
2021-09-01 15:26:25vstinnersetkeywords: + patch
pull_requests: + pull_request26554
2021-09-01 15:06:03vstinnersetmessages: + msg400840
2021-08-31 23:02:17iritkatrielsetmessages: + msg400794
2021-08-31 21:20:29vstinnersetmessages: + msg400779
2021-08-31 16:36:38iritkatrielsetmessages: + msg400757
2021-08-31 15:04:34vstinnersetmessages: + msg400741
2021-08-30 18:14:36iritkatrielsetmessages: + msg400645
2021-08-18 16:49:48vstinnersetmessages: + msg399857
2021-08-18 16:48:12vstinnersetmessages: + msg399856
2021-08-18 16:43:00vstinnersetmessages: + msg399855
2021-08-17 22:34:16pablogsalsetmessages: + msg399800
2021-08-17 21:16:31iritkatrielsetmessages: + msg399796
2021-08-17 21:11:21iritkatrielsetmessages: + msg399795
2021-08-17 21:07:14Mark.Shannonsetmessages: + msg399794
2021-08-17 20:59:10iritkatrielsetmessages: + msg399792
2021-08-17 20:21:45pablogsalsetmessages: + msg399789
2021-08-17 20:20:20pablogsalsetmessages: + msg399788
2021-08-17 20:04:47iritkatrielsetmessages: + msg399786
2021-08-17 20:02:35iritkatrielsetmessages: + msg399785
2021-08-17 18:55:09gvanrossumsetnosy: + gvanrossum
messages: + msg399781
2021-08-17 16:08:12iritkatrielsetmessages: - msg399763
2021-08-17 16:07:56iritkatrielsetmessages: + msg399764
2021-08-17 16:05:33iritkatrielsetmessages: + msg399763
2021-08-17 15:53:03corona10setmessages: + msg399762
2021-08-17 10:58:33iritkatrielsetmessages: + msg399740
2021-08-17 10:50:19iritkatrielsetmessages: + msg399739
2021-08-16 23:09:02pablogsalsetmessages: + msg399703
2021-08-16 23:03:04iritkatrielsetkeywords: - patch

messages: + msg399700
2021-08-16 22:42:27iritkatrielsetmessages: + msg399697
2021-08-16 22:37:09iritkatrielsetmessages: + msg399696
versions: + Python 3.10
2021-08-16 21:11:56iritkatrielsetmessages: + msg399682
2021-08-16 21:05:37iritkatrielsetmessages: + msg399681
2021-08-16 20:13:38iritkatrielsetfiles: + refleak.py

messages: + msg399676
2021-08-16 09:01:18miss-islingtonsetmessages: + msg399639
2021-08-16 08:36:57lukasz.langasetmessages: + msg399636
2021-08-16 08:36:56miss-islingtonsetpull_requests: + pull_request26250
2021-08-14 22:20:44iritkatrielsetmessages: + msg399601
2021-08-14 11:03:21corona10setmessages: + msg399589
2021-08-14 11:01:36corona10setpull_requests: + pull_request26243
2021-08-14 10:59:54corona10setfiles: + cpython-1628938551.dump.gz

messages: + msg399588
2021-08-14 00:57:18chris.jerdoneksetmessages: + msg399578
2021-08-13 20:48:28pablogsalsetmessages: + msg399560
2021-08-13 20:26:21chris.jerdoneksetmessages: + msg399557
2021-08-13 19:53:58iritkatrielsetmessages: + msg399556
2021-08-13 19:43:42chris.jerdoneksetmessages: + msg399555
2021-08-13 19:39:52chris.jerdoneksetmessages: + msg399554
2021-08-13 17:16:59iritkatrielsetmessages: + msg399546
2021-08-13 15:30:56corona10setmessages: + msg399543
2021-08-13 15:24:48iritkatrielsetstage: needs patch -> patch review
pull_requests: + pull_request26237
2021-08-13 15:21:45corona10setmessages: + msg399542
2021-08-13 15:21:11corona10setmessages: + msg399541
2021-08-13 15:11:28iritkatrielsetnosy: + Mark.Shannon
messages: + msg399540
2021-08-13 15:10:26iritkatrielsetmessages: + msg399539
2021-08-13 13:59:30pablogsalsetnosy: + pablogsal
messages: + msg399537
2021-08-13 12:59:45vstinnersetmessages: + msg399533
2021-08-13 10:22:36lukasz.langasetmessages: + msg399517
stage: patch review -> needs patch
2021-08-13 10:21:29miss-islingtonsetmessages: + msg399516
2021-08-13 09:56:28iritkatrielsetmessages: + msg399513
2021-08-13 09:41:53miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request26229
2021-08-13 09:41:39lukasz.langasetnosy: + lukasz.langa
messages: + msg399510
2021-08-13 08:48:16iritkatrielsetmessages: + msg399507
2021-08-13 05:50:28chris.jerdoneksetnosy: + chris.jerdonek
messages: + msg399492
2021-08-12 22:41:28iritkatrielsetmessages: + msg399485
2021-08-12 17:37:31iritkatrielsetmessages: + msg399472
2021-08-12 16:59:53iritkatrielsetkeywords: + patch
stage: patch review
pull_requests: + pull_request26224
2021-08-12 14:52:17iritkatrielsetmessages: + msg399463
2021-08-12 14:39:45iritkatrielsetmessages: + msg399459
2021-08-12 14:23:43vstinnersetmessages: + msg399457
2021-08-12 14:21:37iritkatrielsetmessages: + msg399456
2021-08-12 14:18:27vstinnersetmessages: + msg399453
2021-08-12 14:17:02vstinnersetmessages: + msg399452
2021-08-12 14:02:39iritkatrielsetmessages: + msg399451
2021-08-12 13:59:55iritkatrielsetmessages: + msg399450
2021-08-12 13:44:23vstinnersetmessages: + msg399449
2021-08-12 11:00:13vstinnersetnosy: + vstinner
2021-08-12 09:10:34corona10setnosy: + corona10
2021-08-12 00:13:38iritkatrielcreate