Title: test_exceptions.ExceptionTests.test_recursion_in_except_handler stack overflow on Windows debug builds
Type: crash Stage: patch review
Components: Windows Versions: Python 3.11, Python 3.10, Python 3.9
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, Mark.Shannon, kj, orsenthil, pablogsal, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords: patch

Created on 2021-06-08 09:47 by kj, last changed 2021-06-10 07:47 by Mark.Shannon.

Pull Requests
URL Status Linked Edit
PR 26596 merged pablogsal, 2021-06-08 11:07
PR 26623 merged Mark.Shannon, 2021-06-09 11:01
PR 26627 open kj, 2021-06-09 18:06
Messages (7)
msg395316 - (view) Author: Ken Jin (kj) * (Python triager) Date: 2021-06-08 09:47
Hi all, I created this issue after discussion in

In issue39573 "[C API] Make PyObject an opaque structure in the limited C API" the commit f3fa63ec75fdbb4a08a10957a5c631bf0c4a5970 ("Py_TYPE becomes a static inline function") may have caused test_recursion_in_except_handler to fail only on windows debug builds (according to git bisect anyways) due to higher stack consumption causing a stack overflow.

That test was added in 4e7a69bdb63a104587759d7784124492dcdd496e (Dec 2020), and was passing until last week. Currently another test (bpo-11105, test_ast test_recursion_direct) is masking the broken test, but it's also a stack overflow in windows debug. When I reverted commit f3fa63ec75fdbb4a08a10957a5c631bf0c4a5970, test_recursion_in_except_handler passes but test_recursion_direct still fails.

For test_recursion_direct, the overflow occurs after the Py_Enter/LeaveRecursiveCall guard, when calling some other function like _PyObject_LookupAttr.

I can think of a few possible ways to fix this (in no particular order):

1. Increase stack size for windows builds from 2MB (some valid points against this though, see msg395177).
2. Decrease the global recursion limit only for windows debug.
3. Custom recursion limits for each module depending on OS, which Batuhan has been working on for AST module at GH-26550.
4. Skip the tests on windows debug, since most people on windows use release builds anyways which are unaffected.

Thanks for your time! Which approach would you prefer?
msg395320 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-08 11:02
There are many buildbot failures today and this is getting a bit out of hand so I'm starting with a revert of f3fa63ec75fdbb4a08a10957a5c631bf0c4a5970 until we have a better plan.
msg395322 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-08 11:24
New changeset 6d518bb3a11f9b16098f45b21a13ebe8f537f045 by Pablo Galindo in branch 'main':
bpo-44348: Revert "bpo-39573: Py_TYPE becomes a static inline function (GH-26493)" (GH-26596)
msg395340 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-06-08 16:51
> Decrease the global recursion limit only for windows debug.

This is already what we do, so if someone has increased stack usage, they should also decrease the value here.

Hopefully the non-debug value doesn't have to be reduced. In that case, it's worth looking at the stack allocations to reduce it (e.g. I once fixed this by moving a rarely used buffer into a dynamic allocation rather than a local, which saved about 1KB/frame).
msg395392 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-09 08:06
See also bpo-44360 "test_compile segfault on AMD64 Ubuntu 3.x".
msg395451 - (view) Author: Ken Jin (kj) * (Python triager) Date: 2021-06-09 18:07
@Steve thanks for the info,

> This is already what we do, so if someone has increased stack usage, they should also decrease the value here.

I see the recursion limit set in _PyEval_InitState but I don't see any special settings for windows. Is it somewhere else or am I missing something here :( ?

BTW, I'm thinking about changing the CI checks for windows to test on debug builds. I find it strange that the macOS and ubuntu CIs test on debug builds while windows seemingly doesn't[1]. If we change it, similar errors may be caught earlier at the PR stage rather than when it's committed and causes the buildbots to fail. What do you think?

msg395514 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-06-10 07:47
New changeset 54cb63863f19a7c64d9a3a5fd97bdfc0dd7ab374 by Mark Shannon in branch 'main':
bpo-44348: Move trace-info to thread-state (GH-26623)
Date User Action Args
2021-06-10 07:47:08Mark.Shannonsetmessages: + msg395514
2021-06-09 18:07:33kjsetmessages: + msg395451
2021-06-09 18:06:59kjsetpull_requests: + pull_request25213
2021-06-09 11:01:27Mark.Shannonsetnosy: + Mark.Shannon
pull_requests: + pull_request25208
2021-06-09 10:31:02orsenthilsetnosy: + orsenthil
2021-06-09 08:06:57vstinnersetmessages: + msg395392
2021-06-08 16:51:19steve.dowersetmessages: + msg395340
2021-06-08 11:24:57pablogsalsetmessages: + msg395322
2021-06-08 11:07:12pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request25179
2021-06-08 11:02:30pablogsalsetmessages: + msg395320
2021-06-08 09:47:57kjcreate