classification
Title: test_functools leaks randomly 1 memory block
Type: Stage: patch review
Components: Tests Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: inada.naoki, levkivskyi, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2019-04-08 14:22 by vstinner, last changed 2019-09-24 10:45 by vstinner.

Files
File name Uploaded Description Edit
test_functools.py vstinner, 2019-04-08 15:33
test_functools2.py vstinner, 2019-04-08 17:11
Pull Requests
URL Status Linked Edit
PR 12743 closed vstinner, 2019-04-09 10:55
PR 12744 merged vstinner, 2019-04-09 12:01
PR 12745 merged vstinner, 2019-04-09 12:42
PR 12747 merged vstinner, 2019-04-09 16:01
PR 12749 merged vstinner, 2019-04-09 16:31
Messages (16)
msg339643 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-08 14:22
Sometimes, each run of test_functools leaks exactly 1 memory block, even when the whole test is "re-run in verbose mode". Sometimes, it doesn't leak.

https://buildbot.python.org/all/#/builders/80/builds/550

test_functools leaked [1, 1, 1] memory blocks, sum=3
Re-running test 'test_functools' in verbose mode
test_functools leaked [1, 1, 1] memory blocks, sum=3

Maybe the problem comes from 

Example on Linux:

$ ./python -m test -F -r -j1 -R 3:3 test_functools  
Using random seed 3891892
Run tests in parallel using 1 child processes
0:00:01 load avg: 2.38 [  1] test_functools passed
beginning 6 repetitions
123456
......
(...)
0:00:06 load avg: 2.27 [  6] test_functools passed
beginning 6 repetitions
123456
......
0:00:07 load avg: 2.27 [  7/1] test_functools failed
beginning 6 repetitions
123456
......
test_functools leaked [1, 2, 1] memory blocks, sum=4
0:00:08 load avg: 2.27 [  8/1] test_functools passed
beginning 6 repetitions
123456
......

== Tests result: FAILURE ==

7 tests OK.

1 test failed:
    test_functools

Total duration: 8 sec 333 ms
Tests result: FAILURE
msg339651 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-08 15:33
Current bisection progress: I'm able to reproduce a failure with attached test_functools.py (211 lines) using:

./python -m test -F -j5 -R 3:3 test_functools  

It takes between 4 and 200 runs to reproduce the failure, I don't understand what triggers the bug.

I failed to identify if the bug is more likely when the system is idle or when the system is busy. It looks purely random.
msg339662 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-08 17:11
Some progress.

I'm now able to reproduce the issue with attached test_functools2.py (103 lines, a single test method):

$ cp (...)test_functools2.py Lib/test/
$ ./configure -C --with-pydebug CFLAGS=-O0
$ make
$ ./python -m test -F -j5 -R 3:3 test_functools2
(...)
0:00:02 load avg: 0.95 [ 19/1] test_functools failed
beginning 6 repetitions
123456
......
test_functools leaked [1, 1, 1] memory blocks, sum=3
(...)
Tests result: FAILURE

I disabled the C accelerator _functools.

I'm now trying to disable the C accelerator of abc, but I had a small issue: bpo-36565.
msg339663 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-08 17:14
The issue comes maybe from the complex code in Lib/test/libregrtest/refleak.py to handle the ABC cache:

    abcs = {}
    for abc in [getattr(collections.abc, a) for a in collections.abc.__all__]:
        if not isabstract(abc):
            continue
        for obj in abc.__subclasses__() + [abc]:
            abcs[obj] = _get_dump(obj)[0]

...

    # clear type cache
    sys._clear_type_cache()

    # Clear ABC registries, restoring previously saved ABC registries.
    abs_classes = [getattr(collections.abc, a) for a in collections.abc.__all__]
    abs_classes = filter(isabstract, abs_classes)
    for abc in abs_classes:
        for obj in abc.__subclasses__() + [abc]:
            for ref in abcs.get(obj, set()):
                if ref() is not None:
                    obj.register(ref())
            obj._abc_caches_clear()

    clear_caches()
msg339731 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 11:20
This issue isn't a real memory leak: if I use -R 3:10 instead of -R 3:3, the test doesn't fail anymore.

But the issue is still annoying since it makes Refleaks buildbot workers fail randomly :-/

This issue remembers me the unstable multiprocessing tests:

* bpo-33735: test_multiprocessing_spawn leaked [1, 2, 1] memory blocks on AMD64 Windows8.1 Refleaks 3.7
* bpo-33984: test_multiprocessing_forkserver leaked [1, 2, 1] memory blocks on x86 Gentoo Refleaks 3.x



Patch to always display memory allocations differences:

diff --git a/Lib/test/libregrtest/refleak.py b/Lib/test/libregrtest/refleak.py
index d68ea63b5b..997be819fa 100644
--- a/Lib/test/libregrtest/refleak.py
+++ b/Lib/test/libregrtest/refleak.py
@@ -118,6 +118,8 @@ def dash_R(the_module, test, indirect_test, huntrleaks):
                 print(msg, file=refrep)
                 refrep.flush()
             failed = True
+    if not failed:
+        print(alloc_deltas[nwarmup:])
     return failed
 

Truncated output with the patch:

vstinner@apu$ ./python -m test -F -r -j1 -R 3:10 test_functools
Using random seed 4308771
Run tests in parallel using 1 child processes
0:00:04 load avg: 0.91 [  1] test_functools passed
[0, 1, 2, 0, 0, 0, 0, 0, 0, 0]
...
0:00:13 load avg: 0.92 [  3] test_functools passed
[2, 1, 0, 0, 0, 0, 0, 0, 0, 0]
...
0:00:17 load avg: 0.93 [  4] test_functools passed
[0, 3, 0, 0, 0, 0, 0, 0, 0, 0]
...
0:00:21 load avg: 0.93 [  5] test_functools passed
[0, 1, 0, 0, 2, 0, 0, 0, 0, 0]
...
0:00:26 load avg: 0.93 [  6] test_functools passed
[0, 4, 0, 0, 0, 0, 0, 0, 0, 0]
...
0:00:34 load avg: 0.87 [  8] test_functools passed
[0, 1, 0, 2, 0, 0, 0, 0, 0, 0]
...
0:01:06 load avg: 1.15 [ 15] test_functools passed
[0, 1, 0, 2, 0, -1, 1, 0, 0, 0]
...
0:01:10 load avg: 1.46 [ 16] test_functools passed
[0, 4, 0, 0, 0, 0, 0, 0, -1, 1]
...

The maximum sum() of these list is around 5 on 10 runs: not every run leaks a memory block. It looks more like a internal cache which is "unstable" if you look at the number of allocated memory blocks.
msg339732 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 11:22
I wrote PR 12743 to modify _abc: use weakref.WeakSet rather than using directly a set of weak references. With this PR, calling _get_dump() should clear all broken weak references. I'm not sure if it's really useful for this specific issue, since dash_R_cleanup() calls obj._abc_caches_clear() on all abstract base classes (ABC): all caches must be removed before reading the total number of allocated memory blocks (sys.getallocatedblocks()).
msg339733 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-04-09 11:28
_abc was written for optimization. Your PR makes it using slower Python code. Could you please test how much this hits the performance of abc?
msg339735 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 12:05
If I modify libregrtest with the following patch:

diff --git a/Lib/test/libregrtest/refleak.py b/Lib/test/libregrtest/refleak.py
index 0bb8a0a2bf..f0225a9768 100644
--- a/Lib/test/libregrtest/refleak.py
+++ b/Lib/test/libregrtest/refleak.py
@@ -128,7 +128,7 @@ def dash_R(ns, the_module, test_name, test_func):
     failed = False
     for deltas, item_name, checker in [
         (rc_deltas, 'references', check_rc_deltas),
-        (alloc_deltas, 'memory blocks', check_rc_deltas),
+        (alloc_deltas, 'memory blocks', check_fd_deltas),
         (fd_deltas, 'file descriptors', check_fd_deltas)
     ]:
         # ignore warmup runs

And I add the following file Lib/test/test_noop.py:

import unittest

class NoopTests(unittest.TestCase):
    def test_noop(self):
        pass

regrtest detects a "leak":

$ ./python -m test -R 3:3 test_noop
Run tests sequentially
0:00:00 load avg: 0.55 [1/1] test_noop
beginning 6 repetitions
123456
......
test_noop leaked [0, 1, 0] memory blocks, sum=1
test_noop failed

== Tests result: FAILURE ==

1 test failed:
    test_noop

Total duration: 113 ms
Tests result: FAILURE


The issue comes from this look in Lib/test/libregrtest/refleak.py:

    for i in range(repcount):
        indirect_test()
        alloc_after, rc_after, fd_after = dash_R_cleanup(fs, ps, pic, zdc,
                                                         abcs)
        print('.', end='', file=sys.stderr, flush=True)
        if i >= nwarmup:
            rc_deltas[i] = get_pooled_int(rc_after - rc_before)
            alloc_deltas[i] = get_pooled_int(alloc_after - alloc_before)
            fd_deltas[i] = get_pooled_int(fd_after - fd_before)
        alloc_before = alloc_after
        rc_before = rc_after
        fd_before = fd_after

Because of "if i >= nwarmup:", get_pooled_int() isn't call during "warmup", whereas the purpose of the warmup is to warmup *everything*.

Maybe get_pooled_int() allocates one frame object and keeps it alive in its "zombi frame". Maybe something else is allocated and kept alive.

Anything, removing "if i >= nwarmup:" to always compute deltas fix this specific issue.

Attached PR 12744 fix this bug.
msg339738 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 12:23
New changeset 5aaac94eeb44697e92b0951385cd557bc27e0f6a by Victor Stinner in branch 'master':
bpo-36560: Fix reference leak hunting in regrtest (GH-12744)
https://github.com/python/cpython/commit/5aaac94eeb44697e92b0951385cd557bc27e0f6a
msg339768 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 16:01
New changeset 9c14061a2c2df9a9b84d0aab190a50c24a0d52f4 by Victor Stinner in branch '2.7':
bpo-36560: Fix reference leak hunting in regrtest (GH-12744) (GH-12745)
https://github.com/python/cpython/commit/9c14061a2c2df9a9b84d0aab190a50c24a0d52f4
msg339772 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 16:26
New changeset bb4447897a5f141eecf42987a1191a3330c5d7ed by Victor Stinner in branch 'master':
bpo-36560: regrtest: don't collect the GC twice (GH-12747)
https://github.com/python/cpython/commit/bb4447897a5f141eecf42987a1191a3330c5d7ed
msg339777 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 16:43
> _abc was written for optimization. Your PR makes it using slower Python code. Could you please test how much this hits the performance of abc?

I mostly wrote PR 12743 to debug this issue. Please see the discussion directly on the PR ;-)
msg339781 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-09 16:56
New changeset 86f0354fcb815312295b923c55e39364d85d0388 by Victor Stinner in branch '3.7':
[3.7] bpo-36560: regrtest: don't collect the GC twice (GH-12747) (GH-12749)
https://github.com/python/cpython/commit/86f0354fcb815312295b923c55e39364d85d0388
msg339965 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 11:21
Patch making check on memory block leaks stricter:

diff --git a/Lib/test/libregrtest/refleak.py b/Lib/test/libregrtest/refleak.py
index 235d6bfd3a..dfadabdef6 100644
--- a/Lib/test/libregrtest/refleak.py
+++ b/Lib/test/libregrtest/refleak.py
@@ -130,7 +130,7 @@ def dash_R(ns, the_module, test_name, test_func):
     failed = False
     for deltas, item_name, checker in [
         (rc_deltas, 'references', check_rc_deltas),
-        (alloc_deltas, 'memory blocks', check_rc_deltas),
+        (alloc_deltas, 'memory blocks', check_fd_deltas),
         (fd_deltas, 'file descriptors', check_fd_deltas)
     ]:
         # ignore warmup runs


Using this patch, at least the following tests fail:

* test_asyncio
* test_code
* test_collections
* test_contextlib
* test_contextlib_async
* test_ctypes
* test_functools
* test_multiprocessing_forkserver
* test_multiprocessing_spawn
* test_regrtest
* test_statistics
* test_typing
* test_xml_etree_c

I didn't analyze why yet. I guess that they are not real memory leaks, but more minor issue in the code checking for memory leaks. Sadly, it seems like such small glitch can cause a whole Refleak buildbot worker to fail :-(
msg340386 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-17 10:16
Another example:

AMD64 Fedora Rawhide Refleaks 3.6
https://buildbot.python.org/all/#/builders/193/builds/5
test_contextlib leaked [8, 1, 1] memory blocks, sum=10

I cannot reproduce the issue on my ("idle") laptop. It sounds like a random failure (race condition).
msg353071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-24 10:45
Recent fail on AMD64 Fedora Rawhide Refleaks 3.x:
https://buildbot.python.org/all/#builders/189/builds/162
test_functools leaked [1, 2, 1] memory blocks, sum=4
History
Date User Action Args
2019-09-24 10:45:21vstinnersetmessages: + msg353071
2019-04-17 10:16:39vstinnersetmessages: + msg340386
2019-04-11 11:21:06vstinnersetmessages: + msg339965
2019-04-09 16:56:07vstinnersetmessages: + msg339781
2019-04-09 16:43:06vstinnersetmessages: + msg339777
2019-04-09 16:31:39vstinnersetpull_requests: + pull_request12676
2019-04-09 16:26:21vstinnersetmessages: + msg339772
2019-04-09 16:01:27vstinnersetpull_requests: + pull_request12674
2019-04-09 16:01:19vstinnersetmessages: + msg339768
2019-04-09 12:42:33vstinnersetpull_requests: + pull_request12668
2019-04-09 12:23:51vstinnersetmessages: + msg339738
2019-04-09 12:05:09vstinnersetmessages: + msg339735
2019-04-09 12:01:30vstinnersetpull_requests: + pull_request12667
2019-04-09 11:28:48serhiy.storchakasetnosy: + inada.naoki, levkivskyi
messages: + msg339733
2019-04-09 11:22:53vstinnersetmessages: + msg339732
2019-04-09 11:20:13vstinnersetmessages: + msg339731
2019-04-09 10:55:15vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request12666
2019-04-08 17:23:43serhiy.storchakasetnosy: + serhiy.storchaka
2019-04-08 17:14:54vstinnersetmessages: + msg339663
2019-04-08 17:11:59vstinnersetfiles: + test_functools2.py

messages: + msg339662
2019-04-08 15:33:24vstinnersetfiles: + test_functools.py

messages: + msg339651
2019-04-08 14:22:16vstinnercreate