Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_marshal: crash in Python 3.7b5 on Windows 10 #77901

Closed
vstinner opened this issue May 31, 2018 · 29 comments
Closed

test_marshal: crash in Python 3.7b5 on Windows 10 #77901

vstinner opened this issue May 31, 2018 · 29 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes OS-windows tests Tests in the Lib/test dir type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@vstinner
Copy link
Member

BPO 33720
Nosy @pfmoore, @vstinner, @tjguk, @ned-deily, @zware, @serhiy-storchaka, @zooba, @miss-islington
PRs
  • bpo-33720: Improve test for the stack overflow in marshal.loads(). #7336
  • bpo-33720: Reduces maximum marshal recursion depth on release builds. #7401
  • [3.7] bpo-33720: Reduces maximum marshal recursion depth on release builds. (GH-7401) #7405
  • bpo-33720: Refactor marshalling/unmarshalling floats. #8071
  • [3.7] bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336) #8104
  • [3.6] bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336) #8105
  • [2.7] bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336). #8107
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/zooba'
    closed_at = <Date 2018-07-24.08:07:53.063>
    created_at = <Date 2018-05-31.15:13:35.091>
    labels = ['3.8', '3.7', 'tests', 'OS-windows', 'type-crash']
    title = 'test_marshal: crash in Python 3.7b5 on Windows 10'
    updated_at = <Date 2018-07-24.08:07:53.062>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2018-07-24.08:07:53.062>
    actor = 'serhiy.storchaka'
    assignee = 'steve.dower'
    closed = True
    closed_date = <Date 2018-07-24.08:07:53.063>
    closer = 'serhiy.storchaka'
    components = ['Tests', 'Windows']
    creation = <Date 2018-05-31.15:13:35.091>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33720
    keywords = ['patch']
    message_count = 29.0
    messages = ['318323', '318325', '318326', '318327', '318328', '318329', '318330', '318331', '318442', '318497', '318689', '318690', '318693', '318695', '318696', '318698', '318714', '318771', '320053', '320221', '321012', '321054', '321055', '321059', '321084', '321087', '321089', '321092', '322279']
    nosy_count = 8.0
    nosy_names = ['paul.moore', 'vstinner', 'tim.golden', 'ned.deily', 'zach.ware', 'serhiy.storchaka', 'steve.dower', 'miss-islington']
    pr_nums = ['7336', '7401', '7405', '8071', '8104', '8105', '8107']
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue33720'
    versions = ['Python 3.7', 'Python 3.8']

    @vstinner
    Copy link
    Member Author

    Follow-up of bpo-33719.

    C:\Users\vstinner\AppData\Local\Programs\Python\Python37>python.exe -m test test_marshal -v
    == CPython 3.7.0b5 (v3.7.0b5:abb8802389, May 31 2018, 01:54:01) [MSC v.1913 64 bit (AMD64)]
    == Windows-10-10.0.16299-SP0 little-endian
    == cwd: C:\Users\vstinner\AppData\Local\Temp\test_python_3836
    == CPU count: 2
    == encodings: locale=cp1252, FS=utf-8
    Run tests sequentially
    0:00:00 [1/1] test_marshal
    (...)
    test_loads_2x_code (test.test_marshal.BugsTestCase) ... Windows fatal exception: stack overflow

    Current thread 0x000003a0 (most recent call first):
    File "C:\Users\vstinner\AppData\Local\Programs\Python\Python37\lib\unittest\case.py", line 178 in handle
    File "C:\Users\vstinner\AppData\Local\Programs\Python\Python37\lib\unittest\case.py", line 743 in assertRaises
    File "C:\Users\vstinner\AppData\Local\Programs\Python\Python37\lib\test\test_marshal.py", line 215 in test_loads_2x_code
    (...)

    Crashes in test_marshal is on old topic:

    Current stack size: 2 million bytes (1.9 MiB)

    PCbuild/python.vcxproj: <StackReserveSize>2000000</StackReserveSize>
    PCbuild/pythonw.vcxproj: <StackReserveSize>2000000</StackReserveSize>

    @vstinner vstinner added 3.8 only security fixes tests Tests in the Lib/test dir OS-windows type-crash A hard crash of the interpreter, possibly with a core dump labels May 31, 2018
    @vstinner
    Copy link
    Member Author

    I compiled the master branch of Python in release mode using VS2015 (MSC v.1912 64 bit) and I failed to reproduce the crash:

    @vstinner vstinner added 3.7 (EOL) end of life and removed 3.8 only security fixes labels May 31, 2018
    @vstinner
    Copy link
    Member Author

    I compiled the master branch of Python in release mode using VS2015 (MSC v.1912 64 bit) and I failed to reproduce the crash

    I also failed to reproduce the crash in the 3.7 branch.

    I guess that the python.org binary has been compiled differently.

    @zooba
    Copy link
    Member

    zooba commented May 31, 2018

    The uploaded binary is compiled with PGO enabled (and trained on most of the test suite). I'll check it out - hopefully we don't need to do anything drastic and can get away with either a compiler update or disabling optimizations on a single function.

    @zooba zooba self-assigned this May 31, 2018
    @zooba
    Copy link
    Member

    zooba commented May 31, 2018

    Ned, FYI

    @zooba zooba added 3.8 only security fixes release-blocker labels May 31, 2018
    @vstinner
    Copy link
    Member Author

    priority: normal -> release blocker

    I don't think that it's a release blocker. test_marshal does only crash on corner cases which should not occur on usual "valid" data.

    @zooba
    Copy link
    Member

    zooba commented May 31, 2018

    A crash in the test suite should be fixed, especially since we have protection against this crash (and a test that validates it).

    In this case, apparently the stack allocation for each frame of r_object grew and now there isn't room for 2000 calls (the value of MAX_MARSHAL_STACK_DEPTH). This isn't really a robust way of handling it anyway, so I'll check out whether there's an easy way to safely probe the stack before recursing, otherwise we'll just have to cut the number a bit further.

    @zooba
    Copy link
    Member

    zooba commented May 31, 2018

    I need to stop working on this right now, but here's the locals layout in a normal release build in r_object:

    @rdi @rdi p = 0x00000034655ea3d0 0000003465403f6 @rsp+0x0080 v = 0x0000000000000000 0000003465403fc @rsp+0x00e0 buf = char [256] ""
    00000034`654040c0 @rsp+0x01e0 buf = char [256] ""

    In the PGO build, it looks like this:
    000000be1e003b50 @rsp+0x0080 v = 0x000000000000000
    000000be1e003b58 @rsp+0x0088 is_interned = 0n0 000000be1e003ef0 @rsp+0x0420 buf = char [256] ""
    000000be`1e003ff0 @rsp+0x0520 buf = char [256] ""

    I need to ping the team and figure out why the buffers are so far removed from the rest of the stack, and figure out what's in the gap. That seems to be the core of the problem.

    @zooba
    Copy link
    Member

    zooba commented Jun 1, 2018

    FYI, I posted the problem at https://developercommunity.visualstudio.com/content/problem/265778/pgo-generates-large-stack-frame.html (I _think_ it's public).

    Until then, it's probably best to reduce the recursion limit. Going to a depth of 1000 seems okay, and it certainly seems less bad than disabling optimisations.

    @serhiy-storchaka
    Copy link
    Member

    As for the test itself, the original test was added in dc78cc6. It tested that the stack overflow is not happen when unmarshal a fake code object with a deeply nested dict instead of co_code. It was renamed to test_loads_recursion in cf0fab2. In bpo-16475 (d7009c6) it was updated to mirror the changed structure of the code object in Python 3, but the original test was kept with the name test_loads_2x_code. Actually both tests work the same, because the value of few first bytes doesn't affect the result of these tests.

    Yet this test is fragile, it depends on the unstable structure of the code object, and it serves its purpose only because input data checking is not very strong in the current implementation. Future changes in the marshal module can make this test invalid, and this will not be noticed.

    In PR 7336 this test is rewritten in more reliable way.

    This likely is not directly related to the crash.

    @zooba
    Copy link
    Member

    zooba commented Jun 4, 2018

    Ned - PR 7401 fixes the crash at the cost of reducing the permitted recursion depth in marshal.

    Your call whether we take this fix or disable PGO, but I'd much rather take this fix (PGO has full-interpreter effects whereas this is very narrow). When I hear back about the compiler bug - if it is one - we should be able to raise the limit again without causing any compatibility issues.

    @ned-deily
    Copy link
    Member

    Steve, I'm OK with the proposed reduction as a workaround. We have other platform- and build-specific limits based on stack sizes etc. It's hard to have one-size-fits-all.

    @zooba
    Copy link
    Member

    zooba commented Jun 4, 2018

    New changeset 2a4a62b by Steve Dower in branch 'master':
    bpo-33720: Reduces maximum marshal recursion depth on release builds. (GH-7401)
    2a4a62b

    @zooba
    Copy link
    Member

    zooba commented Jun 4, 2018

    Thanks. I merged, and miss-islington is backporting it. I guess at this stage you're manually cherry-picking into your own 3.7 branch for release? I'll merge for 3.7.1 though.

    @ned-deily
    Copy link
    Member

    I guess at this stage you're manually cherry-picking into your own 3.7 branch for release?

    No, not until 3.7.0rc1 next week. No need to cherry pick. Thanks!

    @miss-islington
    Copy link
    Contributor

    New changeset 103058e by Miss Islington (bot) in branch '3.7':
    bpo-33720: Reduces maximum marshal recursion depth on release builds. (GH-7401)
    103058e

    @ned-deily
    Copy link
    Member

    With the workaround merged for 3.7.0rc1, I'm downgrading from "release blocker" to "high" since I believe Steve wants to keep this issue open for a possible compiler resolution.

    @zooba
    Copy link
    Member

    zooba commented Jun 5, 2018

    Thanks. Agreed this should stay open, but it is blocked on a compiler fix.

    @vstinner
    Copy link
    Member Author

    Thanks. Agreed this should stay open, but it is blocked on a compiler fix.

    Do you want to have a different limit depending on the compiler and the compiler version? I'm not sure that it's worth it. I'm happy with the heuristic "use lower limit on Windows". By design, the test is fragile.

    I would suggest to close the issue, but you can keep it open if you prefer to increase the limit on a fixed compiler.

    @serhiy-storchaka
    Copy link
    Member

    What are your thoughts about rewriting the test (PR 7336)?

    @serhiy-storchaka
    Copy link
    Member

    The alternate solution is provided by PR 8071. The culprit of the main stack consumption is the code for unmarshalling float and complex in protocols 0 and 1 which uses 256-bytes buffers on the stack. Moving it to the separate function allows to decrease the stack consumption and increase the maximum marshal recursion depth on Windows.

    @zooba
    Copy link
    Member

    zooba commented Jul 4, 2018

    Thanks for writing that second PR, Serhiy! I'd already spotted that as the likely fix on our side, though I do want to complete the investigation into why MSVC generated unnecessarily sparse frames. But that is basically an internal discussion for me to follow up on at work.

    @zooba
    Copy link
    Member

    zooba commented Jul 4, 2018

    I added the backport to 3.7 label to your PR, as I want to restore the marshal stack depth limit in 3.7.1.

    @serhiy-storchaka
    Copy link
    Member

    Unfortunately PR 8071 doesn't fix the crash on PGO builds. :-( Initially I tested it only on the debug build, and now have tested on the PGO build.

    @serhiy-storchaka
    Copy link
    Member

    New changeset fc05e68 by Serhiy Storchaka in branch 'master':
    bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336)
    fc05e68

    @miss-islington
    Copy link
    Contributor

    New changeset 3f121a4 by Miss Islington (bot) in branch '3.7':
    bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336)
    3f121a4

    @miss-islington
    Copy link
    Contributor

    New changeset 878c4fe by Miss Islington (bot) in branch '3.6':
    bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336)
    878c4fe

    @serhiy-storchaka
    Copy link
    Member

    New changeset 9720f60 by Serhiy Storchaka in branch '2.7':
    [2.7] bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336). (GH-8107)
    9720f60

    @serhiy-storchaka
    Copy link
    Member

    New changeset c573499 by Serhiy Storchaka in branch 'master':
    bpo-33720: Refactor marshalling/unmarshalling floats. (GH-8071)
    c573499

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes OS-windows tests Tests in the Lib/test dir type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants