Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_code leaked [1, 1, 1] memory blocks on x86 Gentoo Refleaks 3.6/3.x #75400

Closed
vstinner opened this issue Aug 16, 2017 · 24 comments
Closed

test_code leaked [1, 1, 1] memory blocks on x86 Gentoo Refleaks 3.6/3.x #75400

vstinner opened this issue Aug 16, 2017 · 24 comments
Labels
3.7 (EOL) end of life performance Performance or resource usage tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

BPO 31217
Nosy @brettcannon, @pitrou, @vstinner, @zware
PRs
  • bpo-31217: Fix regrtest -R for small integer #3258
  • bpo-31217: Fix regrtest -R for small integer #3260
  • [3.6] bpo-31217: Fix regrtest -R for small integer (#3260) #3261
  • Files
  • test_leak.py
  • leak.py
  • test_haypoleak.py
  • leak2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-09-01.13:13:53.099>
    created_at = <Date 2017-08-16.09:41:40.492>
    labels = ['3.7', 'tests', 'performance']
    title = 'test_code leaked [1, 1, 1] memory blocks on x86 Gentoo Refleaks 3.6/3.x'
    updated_at = <Date 2017-09-01.13:13:53.098>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2017-09-01.13:13:53.098>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-09-01.13:13:53.099>
    closer = 'vstinner'
    components = ['Tests']
    creation = <Date 2017-08-16.09:41:40.492>
    creator = 'vstinner'
    dependencies = []
    files = ['47110', '47112', '47113', '47114']
    hgrepos = []
    issue_num = 31217
    keywords = []
    message_count = 24.0
    messages = ['300344', '300442', '300443', '300689', '301049', '301051', '301054', '301055', '301059', '301069', '301070', '301072', '301073', '301074', '301075', '301076', '301077', '301079', '301081', '301082', '301090', '301094', '301100', '301101']
    nosy_count = 4.0
    nosy_names = ['brett.cannon', 'pitrou', 'vstinner', 'zach.ware']
    pr_nums = ['3258', '3260', '3261']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue31217'
    versions = ['Python 3.6', 'Python 3.7']

    @vstinner
    Copy link
    Member Author

    http://buildbot.python.org/all/builders/x86%20Gentoo%20Refleaks%203.6/builds/58/steps/test/logs/stdio

    test_code leaked [1, 1, 1] memory blocks, sum=3
    (...)
    Re-running test 'test_code' in verbose mode
    (...)
    test_code leaked [1, 1, 1] memory blocks, sum=3

    I'm unable to reproduce the bug:

    haypo@selma$ ./python -m test -R 3:3 test_code
    (...)
    1 test OK.

    @vstinner vstinner added tests Tests in the Lib/test dir performance Performance or resource usage labels Aug 16, 2017
    @vstinner
    Copy link
    Member Author

    It's even worse, many tests fail with 1 to 3 memory blocks on x86 Gentoo Refleaks 3.x... but then pass when run again.

    I created bpo-31227 "regrtest: reseed random with the same seed before running a test file" which might help (or not?) this issue.

    http://buildbot.python.org/all/builders/x86%20Gentoo%20Refleaks%203.x/builds/62

    test_sys leaked [1, 1, 1] memory blocks, sum=3
    test_robotparser leaked [1, 1, 1] memory blocks, sum=3
    test_nntplib leaked [1, 1, 1] memory blocks, sum=3
    test_tools leaked [3, 2, 1] memory blocks, sum=6
    test_fork1 leaked [1, 1, 1] memory blocks, sum=3
    test_poll leaked [1, 1, 1] memory blocks, sum=3
    test_asyncio leaked [2, 1, 1] memory blocks, sum=4

    Re-running failed tests in verbose mode
    Re-running test 'test_sys' in verbose mode
    Re-running test 'test_robotparser' in verbose mode
    Re-running test 'test_nntplib' in verbose mode
    Re-running test 'test_tools' in verbose mode
    test_tools leaked [1, 3, 1] memory blocks, sum=5
    Re-running test 'test_fork1' in verbose mode
    Re-running test 'test_poll' in verbose mode
    Re-running test 'test_asyncio' in verbose mode

    @vstinner
    Copy link
    Member Author

    regrtest also has an issue: when tests are run again, they should be run in a fresh process, especially when tests are run in parallel (-jN).

    @vstinner vstinner changed the title test_code leaked [1, 1, 1] memory blocks on x86 Gentoo Refleaks 3.6 test_code leaked [1, 1, 1] memory blocks on x86 Gentoo Refleaks 3.6/3.x Aug 21, 2017
    @vstinner vstinner added the 3.7 (EOL) end of life label Aug 21, 2017
    @vstinner
    Copy link
    Member Author

    I succeeded to reproduce a leak on the x86 Gentoo Refleaks 3.x buildbot using the following command. I'm unable to reproduce the bug on my Fedora 25.

    pydev@stormageddon ~/cpython $ ./python -m test -R 3:3 -j1 -m test.test_code.CoExtra.test_free_different_thread test_code -F
    Run tests in parallel using 1 child processes
    0:00:00 load avg: 2.26 [ 1/1] test_code failed
    beginning 6 repetitions
    123456
    ......
    test_code leaked [1, 1, 1] memory blocks, sum=3
    0:00:01 load avg: 2.26 [ 2/2] test_code failed
    beginning 6 repetitions
    123456
    ......
    test_code leaked [1, 1, 1] memory blocks, sum=3

    2 tests failed:
    test_code test_code

    Total duration: 2 sec
    Tests result: FAILURE

    @vstinner
    Copy link
    Member Author

    Another command to reproduce a leak:

    ./python -m test -R 3:3 -j1 test_sys -F -m test.test_sys.SysModuleTest.test_current_frames

    @vstinner
    Copy link
    Member Author

    Current status of my analysis. I'm able to reproduce the bug using this modified Lib/test/test_sys.py:
    ---

    import unittest, test.support
    from test.support.script_helper import assert_python_ok
    import threading
    
    def f123():
        pass
    
    class SysModuleTest(unittest.TestCase):
        def test_current_frames(self):
            t = threading.Thread(target=f123)
            t.start()
            t.join()

    And this shell script:
    ---
    set -x
    while true; do PYTHONHASHSEED=$RANDOM ./python -m test -R 3:3 -j1 test_sys -m test_current_frames || break; done
    ---

    With this command, when the test fail, it's possible to reproduce the bug 100% of times using the written PYTHONHASHSEED. Example:
    ---
    pydev@stormageddon ~/cpython $ ./bug.sh
    ++ true
    ++ PYTHONHASHSEED=23016
    ++ ./python -m test -R 3:3 -j1 test_sys -m test_current_frames
    Run tests in parallel using 1 child processes
    0:00:00 load avg: 0.15 [1/1] test_sys passed
    beginning 6 repetitions
    123456
    ......
    1 test OK.

    Total duration: 982 ms
    Tests result: SUCCESS
    ++ true
    ++ PYTHONHASHSEED=9197
    ++ ./python -m test -R 3:3 -j1 test_sys -m test_current_frames
    Run tests in parallel using 1 child processes
    0:00:00 load avg: 0.22 [1/1/1] test_sys failed
    beginning 6 repetitions
    123456
    ......
    test_sys leaked [1, 1, 1] memory blocks, sum=3

    1 test failed:
    test_sys

    Total duration: 1000 ms
    Tests result: FAILURE
    ++ break

    pydev@stormageddon ~/cpython $ PYTHONHASHSEED=9197 ./python -m test -R 3:3 -j1 test_sys -m test_current_frames
    Run tests in parallel using 1 child processes
    0:00:00 load avg: 0.20 [1/1/1] test_sys failed
    beginning 6 repetitions
    123456
    ......
    test_sys leaked [1, 1, 1] memory blocks, sum=3

    1 test failed:
    test_sys

    Total duration: 987 ms
    Tests result: FAILURE
    ---

    @vstinner
    Copy link
    Member Author

    Ok, I succeeded to reproduce the issue on my amd64 laptop using gcc -m32. I'm now sure that the issue can be reproduced on 32-bit. Maybe the issue can be reproduced on 64-bit but is more unlikely?

    @vstinner
    Copy link
    Member Author

    Copy attached test_leak.py to Lib/test/ and run:

    set -x
    while true; do PYTHONHASHSEED=$RANDOM ./python -m test -R 3:30 -j1 test_leak || break; done

    @vstinner
    Copy link
    Member Author

    I created bpo-31317: Memory leak in dict with shared keys.

    @vstinner
    Copy link
    Member Author

    I'm still able to reproduce the bug using attached leak.py script which is now 100 lines long, but has to import logging, socket and pickle. Without these imports, the bug hides (maybe the hardcoded hash values must be recomputed, but I failed to find hash values in that case).

    @pitrou
    Copy link
    Member

    pitrou commented Aug 31, 2017

    Attached test script still fails even though the objects are pre-allocated, and without setting a hash seed.

    $ ./python -m test -R 3:6 test_haypoleak
    Run tests sequentially
    0:00:00 load avg: 0.23 [1/1] test_haypoleak
    beginning 9 repetitions
    123456789
    .........
    test_haypoleak leaked [1, 1, 1, 1, 1, 1] memory blocks, sum=6
    test_haypoleak failed

    @vstinner
    Copy link
    Member Author

    leak2.py: script based on my leak.py and Antoine's test_haypoleak.py.

    If you replace "if 0:" with "if 1:", it seems like the test doesn't anymore.

    It seems like the "leak" is the call to sys.getallocatedblocks() which creates a new integer, and the integer is kept alive between two loop iterations.

    Maybe I missed something.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 31, 2017

    It seems like the "leak" is the call to sys.getallocatedblocks() which creates a new integer, and the integer is kept alive between two loop iterations.

    I doubt it. If that was the case, the reference count would increase as well.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 31, 2017

    Actually, leak2.py doesn't attempt to cleanup anything between runs, so it can't be reliable for detecting leaks. You lack dash_R_cleanup() somewhere.

    @vstinner
    Copy link
    Member Author

    Antoine: "I doubt it. If that was the case, the reference count would increase as well."

    The bug is really weird :-)

    Antoine: "Actually, leak2.py doesn't attempt to cleanup anything between runs, so it can't be reliable for detecting leaks. You lack dash_R_cleanup() somewhere."

    I simplified dash_R_cleanup() and at the end, it was empty :-) I don't see how the test_current_frames() would need to clear any kind of cache.

    Maybe:

    • sys._clear_type_cache()
    • gc.collect()
    • C function clear_freelists() of gcmodule.c

    But I tried to call these functions, and it doesn't change anything.

    I don't see how calling set.add() and set.discard() would impact any cache, except of maybe of a free list?

    The strange thing is that calling dangling.clear() explicitly in test_current_frames() "fixes the leak" (hides the bug?). But any tiny change on this file also hides the bug. The script is very fragile.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 31, 2017

    Ahah.
    Actually, it's quite simple :-) On 64-bit Python:

    >>> id(82914 - 82913) == id(1)
    True

    On 32-bit Python:

    >>> id(82914 - 82913) == id(1)
    False

    So the first non-zero alloc_delta really has a snowball effect, as it creates new memory block which will produce a non-zero alloc_delta on the next run, etc.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 1, 2017

    This patch does the trick:

    diff --git a/Lib/test/libregrtest/refleak.py b/Lib/test/libregrtest/refleak.py
    index efe5210..68e490a 100644
    --- a/Lib/test/libregrtest/refleak.py
    +++ b/Lib/test/libregrtest/refleak.py
    @@ -48,6 +48,11 @@ def dash_R(the_module, test, indirect_test, huntrleaks):
         print("beginning", repcount, "repetitions", file=sys.stderr)
         print(("1234567890"*(repcount//10 + 1))[:repcount], file=sys.stderr,
               flush=True)
    +
    +    int_pool = {i: i for i in range(-1000, 1000)}
    +    def get_pooled_int(v):
    +        return int_pool.setdefault(v, v)
    +
         # initialize variables to make pyflakes quiet
         rc_before = alloc_before = fd_before = 0
         for i in range(repcount):
    @@ -56,9 +61,9 @@ def dash_R(the_module, test, indirect_test, huntrleaks):
                                                              abcs)
             print('.', end='', file=sys.stderr, flush=True)
             if i >= nwarmup:
    -            rc_deltas[i] = rc_after - rc_before
    -            alloc_deltas[i] = alloc_after - alloc_before
    -            fd_deltas[i] = fd_after - fd_before
    +            rc_deltas[i] = get_pooled_int(rc_after - rc_before)
    +            alloc_deltas[i] = get_pooled_int(alloc_after - alloc_before)
    +            fd_deltas[i] = get_pooled_int(fd_after - fd_before)
             alloc_before = alloc_after
             rc_before = rc_after
             fd_before = fd_after

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2017

    So the first non-zero alloc_delta really has a snowball effect, as it creates new memory block which will produce a non-zero alloc_delta on the next run, etc.

    Oh, I suspected an issue around this code but I was unable to explain it. I focused strongly on test_current_frames(), whereas this function is just fine... It's really strange that the bug only triggers on very specific conditions.

    """
    Actually, it's quite simple :-) On 64-bit Python:

    >>> id(82914 - 82913) == id(1)
    True

    On 32-bit Python:

    >>> id(82914 - 82913) == id(1)
    False
    """

    That's very strange.

    Another workaround:

    diff --git a/Lib/test/libregrtest/refleak.py b/Lib/test/libregrtest/refleak.py
    index efe52107cb..35d3f8e42d 100644
    --- a/Lib/test/libregrtest/refleak.py
    +++ b/Lib/test/libregrtest/refleak.py
    @@ -56,9 +56,10 @@ def dash_R(the_module, test, indirect_test, huntrleaks):
                                                              abcs)
             print('.', end='', file=sys.stderr, flush=True)
             if i >= nwarmup:
    -            rc_deltas[i] = rc_after - rc_before
    -            alloc_deltas[i] = alloc_after - alloc_before
    -            fd_deltas[i] = fd_after - fd_before
    +            def maybe_small_long(x): return int(str(x))
    +            rc_deltas[i] = maybe_small_long(rc_after - rc_before)
    +            alloc_deltas[i] = maybe_small_long(alloc_after - alloc_before)
    +            fd_deltas[i] = maybe_small_long(fd_after - fd_before)
             alloc_before = alloc_after
             rc_before = rc_after
             fd_before = fd_after

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2017

    Ok, I wrote a PR for my str(int(x)) workaround: #3258

    I tested manually on the buildbot that it fixes all bugs listed on this issue.

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2017

    + int_pool = {i: i for i in range(-1000, 1000)}
    + def get_pooled_int(v):
    + return int_pool.setdefault(v, v)

    I'm not sure that I understand this code. It makes sure that you get a single object in memory for the same integer, and not only for Python "small integer singletons"?

    About the -1000..1000 range: if a function leaks more than 1 memory block or more than 1 reference, there is already something wrong no?

    Maybe it's ok to only care of values -1, 0 and 1 :-) My code only cares of Python small integer singletons.

    I have no strong preference between my code or yours. I only care of fixing the buildbot :-) What do you prefer, Antoine?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 1, 2017

    It makes sure that you get a single object in memory for the same integer, and not only for Python "small integer singletons"?

    Yes.

    About the -1000..1000 range: if a function leaks more than 1 memory block or more than 1 reference, there is already something wrong no?

    Well, the current code can still hide it, for example if a function leaks [1000, 0, 1000, 1000, 1000] references, the code says it's ok.

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2017

    New changeset 6c2feab by Victor Stinner in branch 'master':
    bpo-31217: Fix regrtest -R for small integer (bpo-3260)
    6c2feab

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2017

    New changeset 98c849a by Victor Stinner in branch '3.6':
    bpo-31217: Fix regrtest -R for small integer (bpo-3260) (bpo-3261)
    98c849a

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2017

    Ok, this issue was really insane. Thank you very much Antoine Pitrou for explaning me the root bug ("int - int") and for the fix. The bug is now fixed in Python 3.6 and master (3.7). I checked manually the applied fix on the Gentoo Refleak 3.x buildbot using test_code and test_sys run in a loop: no failure after 30 runs.

    *Hopefully*, it was a bug in the test, not in the code ;-)

    The fix should help to detect real memory leaks, and maybe we can now replace again all() with any() in libregrtest/refleak.py to detect more leaks: see bpo-30776.

    @vstinner vstinner closed this as completed Sep 1, 2017
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage tests Tests in the Lib/test dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants