Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_exceptions.ExceptionTests.test_recursion_in_except_handler stack overflow on Windows debug builds #88514

Closed
Fidget-Spinner opened this issue Jun 8, 2021 · 17 comments
Labels
3.11 only security fixes OS-windows type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@Fidget-Spinner
Copy link
Member

BPO 44348
Nosy @pfmoore, @orsenthil, @vstinner, @tjguk, @markshannon, @zware, @zooba, @pablogsal, @isidentical, @Fidget-Spinner
PRs
  • bpo-44348: Revert "bpo-39573: Py_TYPE becomes a static inline function (GH-26493)" #26596
  • bpo-44348: Move trace-info to thread-state #26623
  • bpo-44348: Use debug builds for windows CI #26627
  • bpo-44348: BaseException deallocator uses trashcan #28190
  • Files
  • bench.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-09-07.13:43:45.052>
    created_at = <Date 2021-06-08.09:47:57.526>
    labels = ['OS-windows', 'type-crash', '3.11']
    title = 'test_exceptions.ExceptionTests.test_recursion_in_except_handler stack overflow on Windows debug builds'
    updated_at = <Date 2021-09-07.16:45:25.006>
    user = 'https://github.com/Fidget-Spinner'

    bugs.python.org fields:

    activity = <Date 2021-09-07.16:45:25.006>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-09-07.13:43:45.052>
    closer = 'vstinner'
    components = ['Windows']
    creation = <Date 2021-06-08.09:47:57.526>
    creator = 'kj'
    dependencies = []
    files = ['50266']
    hgrepos = []
    issue_num = 44348
    keywords = ['patch']
    message_count = 17.0
    messages = ['395316', '395320', '395322', '395340', '395392', '395451', '395514', '400992', '401150', '401161', '401185', '401194', '401195', '401269', '401271', '401305', '401307']
    nosy_count = 10.0
    nosy_names = ['paul.moore', 'orsenthil', 'vstinner', 'tim.golden', 'Mark.Shannon', 'zach.ware', 'steve.dower', 'pablogsal', 'BTaskaya', 'kj']
    pr_nums = ['26596', '26623', '26627', '28190']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue44348'
    versions = ['Python 3.11']

    @Fidget-Spinner
    Copy link
    Member Author

    Hi all, I created this issue after discussion in https://bugs.python.org/issue39573#msg395206:

    In bpo-39573 "[C API] Make PyObject an opaque structure in the limited C API" the commit f3fa63e ("Py_TYPE becomes a static inline function") may have caused test_recursion_in_except_handler to fail only on windows debug builds (according to git bisect anyways) due to higher stack consumption causing a stack overflow.

    That test was added in 4e7a69b (Dec 2020), and was passing until last week. Currently another test (bpo-11105, test_ast test_recursion_direct) is masking the broken test, but it's also a stack overflow in windows debug. When I reverted commit f3fa63e, test_recursion_in_except_handler passes but test_recursion_direct still fails.

    For test_recursion_direct, the overflow occurs after the Py_Enter/LeaveRecursiveCall guard, when calling some other function like _PyObject_LookupAttr.

    I can think of a few possible ways to fix this (in no particular order):

    1. Increase stack size for windows builds from 2MB (some valid points against this though, see msg395177).
    2. Decrease the global recursion limit only for windows debug.
    3. Custom recursion limits for each module depending on OS, which Batuhan has been working on for AST module at documentation minor issue : "Step back: WSGI" section from "HOWTO Use Python in the web" #70737.
    4. Skip the tests on windows debug, since most people on windows use release builds anyways which are unaffected.

    Thanks for your time! Which approach would you prefer?

    @Fidget-Spinner Fidget-Spinner added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes OS-windows type-crash A hard crash of the interpreter, possibly with a core dump labels Jun 8, 2021
    @pablogsal
    Copy link
    Member

    There are many buildbot failures today and this is getting a bit out of hand so I'm starting with a revert of f3fa63e until we have a better plan.

    @pablogsal
    Copy link
    Member

    New changeset 6d518bb by Pablo Galindo in branch 'main':
    bpo-44348: Revert "bpo-39573: Py_TYPE becomes a static inline function (GH-26493)" (GH-26596)
    6d518bb

    @zooba
    Copy link
    Member

    zooba commented Jun 8, 2021

    Decrease the global recursion limit only for windows debug.

    This is already what we do, so if someone has increased stack usage, they should also decrease the value here.

    Hopefully the non-debug value doesn't have to be reduced. In that case, it's worth looking at the stack allocations to reduce it (e.g. I once fixed this by moving a rarely used buffer into a dynamic allocation rather than a local, which saved about 1KB/frame).

    @vstinner
    Copy link
    Member

    vstinner commented Jun 9, 2021

    See also bpo-44360 "test_compile segfault on AMD64 Ubuntu 3.x".

    @Fidget-Spinner
    Copy link
    Member Author

    @Steve thanks for the info,

    This is already what we do, so if someone has increased stack usage, they should also decrease the value here.

    I see the recursion limit set in _PyEval_InitState but I don't see any special settings for windows. Is it somewhere else or am I missing something here :( ?

    BTW, I'm thinking about changing the CI checks for windows to test on debug builds. I find it strange that the macOS and ubuntu CIs test on debug builds while windows seemingly doesn't[1]. If we change it, similar errors may be caught earlier at the PR stage rather than when it's committed and causes the buildbots to fail. What do you think?

    [1] https://github.com/python/cpython/blob/main/.github/workflows/build.yml#L95

    @markshannon
    Copy link
    Member

    New changeset 54cb638 by Mark Shannon in branch 'main':
    bpo-44348: Move trace-info to thread-state (GH-26623)
    54cb638

    @vstinner
    Copy link
    Member

    vstinner commented Sep 3, 2021

    See bpo-45094: "Consider using __forceinline and __attribute__((always_inline)) on static inline functions (Py_INCREF, Py_TYPE) for debug builds".

    @vstinner
    Copy link
    Member

    vstinner commented Sep 6, 2021

    Another *working* approach to fix this issue: bpo-45115 "Windows: enable compiler optimizations when building Python in debug mode".

    @zooba
    Copy link
    Member

    zooba commented Sep 6, 2021

    It should also be possible to reduce the stack size of each frame. We've done that before, because there were multiple temporary buffers allocated on the stack that were trivially moved into functions. I bet we can also reduce the number of indirections, as last time I had to investigate a native stack we seem to have bloated up to 6-7 C frames for each Python frame.

    Failing that, I'd rather have an option to build with higher stack size just for tests (or users who care that much - my experience is that people tend to care more about many Python processes rather than high recursion). That only affects python[w][_uwp].exe.

    We could also add a threading API to allow creating new threads with larger stack size. That would allow us to write tests that can do it, and also enable users who may run into it. I'd prefer Windows just be able to expand the stack better at runtime, but the reserved address space apparently wins the back-compat argument.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 6, 2021

    test_recursion_in_except_handler() creates chained of exceptions. When an exception is deallocated, it calls the deallocator of another exception, etc.

    • recurse_in_except() sub-test creates chains of 11 nested deallocator calls
    • recurse_in_body_and_except() sub-test creates a chain of 8192 nested deallocator calls

    I'm not sure how recurse_in_body_and_except() can creates a chain which is so long, knowing that the test sets the recursion limit to 44 frames (default Python limit is 1000).

    @vstinner
    Copy link
    Member

    vstinner commented Sep 6, 2021

    bench.py: pyperf benchmarks on BaseException_dealloc().

    Benchmarks results on PR 28190 on Fedora 34 with LTO (I didn't use PGO nor CPU pinning). It says "faster" but it's likely noise in the build or in the benchmark (again, I didn't use CPU pinning, and I used my laptop while the benchmark was running. IMO it means that the trashcan overhead is not significant on this benchmark.

    $ python3 -m pyperf compare_to ref.json trashcan.json --table
    +----------------+---------+-----------------------+
    | Benchmark      | ref     | trashcan              |
    +================+=========+=======================+
    | list 10_000    | 1.27 ms | 1.18 ms: 1.08x faster |
    +----------------+---------+-----------------------+
    | chain 10_000   | 3.94 ms | 3.89 ms: 1.01x faster |
    +----------------+---------+-----------------------+
    | Geometric mean | (ref)   | 1.01x faster          |
    +----------------+---------+-----------------------+

    Benchmark hidden because not significant (6): list 10, list 100, list 1000, chain 10, chain 100, chain 1000

    @vstinner
    Copy link
    Member

    vstinner commented Sep 6, 2021

    Using trashcan, it's possible to create a long longer than 10 000 exceptions. But currently, Python does crash in this case, so I stopped the benchmark at 10 000.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 7, 2021

    New changeset fb30509 by Victor Stinner in branch 'main':
    bpo-44348: BaseException deallocator uses trashcan (GH-28190)
    fb30509

    @vstinner
    Copy link
    Member

    vstinner commented Sep 7, 2021

    Even if Python 3.9 and 3.10 have the idea, I don't think that long exception chains are common enough to justify a backport. I prefer to leave the code as it it in stable branches, and see how things go with the change in the main branch. I close the issue.

    @vstinner vstinner removed 3.9 only security fixes 3.10 only security fixes labels Sep 7, 2021
    @vstinner vstinner closed this as completed Sep 7, 2021
    @vstinner vstinner removed 3.9 only security fixes 3.10 only security fixes labels Sep 7, 2021
    @vstinner vstinner closed this as completed Sep 7, 2021
    @vstinner
    Copy link
    Member

    vstinner commented Sep 7, 2021

    I rejected my two other ideas to fix this issue:

    • bpo-45094: Consider using __forceinline and __attribute__((always_inline)) on static inline functions (Py_INCREF, Py_TYPE) for debug builds
    • bpo-45115: Windows: enable compiler optimizations when building Python in debug mode

    @vstinner
    Copy link
    Member

    vstinner commented Sep 7, 2021

    I marked bpo-27277 as a duplicate of this issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes OS-windows type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants