classification
Title: test_marshal: crash in Python 3.7b5 on Windows 10
Type: crash Stage: resolved
Components: Tests, Windows Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steve.dower Nosy List: miss-islington, ned.deily, paul.moore, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Priority: high Keywords: patch

Created on 2018-05-31 15:13 by vstinner, last changed 2018-07-24 08:07 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 7336 merged serhiy.storchaka, 2018-06-02 09:54
PR 7401 merged steve.dower, 2018-06-04 16:22
PR 7405 merged miss-islington, 2018-06-04 20:25
PR 8071 merged serhiy.storchaka, 2018-07-03 20:29
PR 8104 merged miss-islington, 2018-07-05 08:18
PR 8105 merged miss-islington, 2018-07-05 08:20
PR 8107 merged serhiy.storchaka, 2018-07-05 08:50
Messages (29)
msg318323 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 15:13
Follow-up of bpo-33719.

C:\Users\vstinner\AppData\Local\Programs\Python\Python37>python.exe -m test test_marshal -v
== CPython 3.7.0b5 (v3.7.0b5:abb8802389, May 31 2018, 01:54:01) [MSC v.1913 64 bit (AMD64)]
== Windows-10-10.0.16299-SP0 little-endian
== cwd: C:\Users\vstinner\AppData\Local\Temp\test_python_3836
== CPU count: 2
== encodings: locale=cp1252, FS=utf-8
Run tests sequentially
0:00:00 [1/1] test_marshal
(...)
test_loads_2x_code (test.test_marshal.BugsTestCase) ... Windows fatal exception: stack overflow

Current thread 0x000003a0 (most recent call first):
  File "C:\Users\vstinner\AppData\Local\Programs\Python\Python37\lib\unittest\case.py", line 178 in handle
  File "C:\Users\vstinner\AppData\Local\Programs\Python\Python37\lib\unittest\case.py", line 743 in assertRaises
  File "C:\Users\vstinner\AppData\Local\Programs\Python\Python37\lib\test\test_marshal.py", line 215 in test_loads_2x_code
  (...)

Crashes in test_marshal is on old topic:

* bpo-1050
* bpo-2286
* bpo-25264
* bpo-22734
* bpo-27019

Current stack size: 2 million bytes (1.9 MiB)

PCbuild/python.vcxproj:      <StackReserveSize>2000000</StackReserveSize>
PCbuild/pythonw.vcxproj:      <StackReserveSize>2000000</StackReserveSize>
msg318325 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 15:17
I compiled the master branch of Python in release mode using VS2015 (MSC v.1912 64 bit) and I failed to reproduce the crash:

* PCbuild/build.bat -e -p x64
* python -m test -v test_marshal
* no crash
msg318326 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 15:26
> I compiled the master branch of Python in release mode using VS2015 (MSC v.1912 64 bit) and I failed to reproduce the crash

I also failed to reproduce the crash in the 3.7 branch.

I guess that the python.org binary has been compiled differently.
msg318327 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-05-31 15:36
The uploaded binary is compiled with PGO enabled (and trained on most of the test suite). I'll check it out - hopefully we don't need to do anything drastic and can get away with either a compiler update or disabling optimizations on a single function.
msg318328 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-05-31 15:36
Ned, FYI
msg318329 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 15:39
> priority: normal -> release blocker

I don't think that it's a release blocker. test_marshal does only crash on corner cases which should not occur on usual "valid" data.
msg318330 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-05-31 15:48
A crash in the test suite should be fixed, especially since we have protection against this crash (and a test that validates it).

In this case, apparently the stack allocation for each frame of r_object grew and now there isn't room for 2000 calls (the value of MAX_MARSHAL_STACK_DEPTH). This isn't really a robust way of handling it anyway, so I'll check out whether there's an easy way to safely probe the stack before recursing, otherwise we'll just have to cut the number a bit further.
msg318331 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-05-31 16:13
I need to stop working on this right now, but here's the locals layout in a normal release build in r_object:

@rdi              @rdi                                p = 0x00000034`655ea3d0
00000034`65403f60 @rsp+0x0080                         v = 0x00000000`00000000
00000034`65403fc0 @rsp+0x00e0                       buf = char [256] ""
00000034`654040c0 @rsp+0x01e0                       buf = char [256] ""

In the PGO build, it looks like this:
000000be`1e003b50 @rsp+0x0080                         v = 0x00000000`00000000
000000be`1e003b58 @rsp+0x0088               is_interned = 0n0
000000be`1e003ef0 @rsp+0x0420                       buf = char [256] ""
000000be`1e003ff0 @rsp+0x0520                       buf = char [256] ""

I need to ping the team and figure out why the buffers are so far removed from the rest of the stack, and figure out what's in the gap. That seems to be the core of the problem.
msg318442 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-01 17:46
FYI, I posted the problem at https://developercommunity.visualstudio.com/content/problem/265778/pgo-generates-large-stack-frame.html (I _think_ it's public).

Until then, it's probably best to reduce the recursion limit. Going to a depth of 1000 seems okay, and it certainly seems less bad than disabling optimisations.
msg318497 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-02 10:52
As for the test itself, the original test was added in dc78cc6f7cea2040114a350aa9db835cc433ae05. It tested that the stack overflow is not happen when unmarshal a fake code object with a deeply nested dict instead of co_code. It was renamed to test_loads_recursion in cf0fab2686799b562c9c6f6255557e36f9af095e. In issue16475 (d7009c69136a3809282804f460902ab42e9972f6) it was updated to mirror the changed structure of the code object in Python 3, but the original test was kept with the name test_loads_2x_code. Actually both tests work the same, because the value of few first bytes doesn't affect the result of these tests.

Yet this test is fragile, it depends on the unstable structure of the code object, and it serves its purpose only because input data checking is not very strong in the current implementation. Future changes in the marshal module can make this test invalid, and this will not be noticed.

In PR 7336 this test is rewritten in more reliable way.

This likely is not directly related to the crash.
msg318689 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-04 19:24
Ned - PR 7401 fixes the crash at the cost of reducing the permitted recursion depth in marshal.

Your call whether we take this fix or disable PGO, but I'd much rather take this fix (PGO has full-interpreter effects whereas this is very narrow). When I hear back about the compiler bug - if it is one - we should be able to raise the limit again without causing any compatibility issues.
msg318690 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-06-04 19:28
Steve, I'm OK with the proposed reduction as a workaround.  We have other platform- and build-specific limits based on stack sizes etc.  It's hard to have one-size-fits-all.
msg318693 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-04 20:25
New changeset 2a4a62ba4ae770bbc7b7fdec0760031c83fe1f7b by Steve Dower in branch 'master':
bpo-33720: Reduces maximum marshal recursion depth on release builds. (GH-7401)
https://github.com/python/cpython/commit/2a4a62ba4ae770bbc7b7fdec0760031c83fe1f7b
msg318695 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-04 20:26
Thanks. I merged, and miss-islington is backporting it. I guess at this stage you're manually cherry-picking into your own 3.7 branch for release? I'll merge for 3.7.1 though.
msg318696 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-06-04 20:28
> I guess at this stage you're manually cherry-picking into your own 3.7 branch for release?

No, not until 3.7.0rc1 next week.  No need to cherry pick.  Thanks!
msg318698 - (view) Author: miss-islington (miss-islington) Date: 2018-06-04 20:41
New changeset 103058e19bea48f4a80afde51df04338a753e952 by Miss Islington (bot) in branch '3.7':
bpo-33720: Reduces maximum marshal recursion depth on release builds. (GH-7401)
https://github.com/python/cpython/commit/103058e19bea48f4a80afde51df04338a753e952
msg318714 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-06-04 22:04
With the workaround merged for 3.7.0rc1, I'm downgrading from "release blocker" to "high" since I believe Steve wants to keep this issue open for a possible compiler resolution.
msg318771 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-05 16:59
Thanks. Agreed this should stay open, but it is blocked on a compiler fix.
msg320053 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-20 10:18
> Thanks. Agreed this should stay open, but it is blocked on a compiler fix.

Do you want to have a different limit depending on the compiler and the compiler version? I'm not sure that it's worth it. I'm happy with the heuristic "use lower limit on Windows". By design, the test is fragile.

I would suggest to close the issue, but you can keep it open if you prefer to increase the limit on a fixed compiler.
msg320221 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-22 10:29
What are your thoughts about rewriting the test (PR 7336)?
msg321012 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-04 06:47
The alternate solution is provided by PR 8071. The culprit of the main stack consumption is the code for unmarshalling float and complex in protocols 0 and 1 which uses 256-bytes buffers on the stack. Moving it to the separate function allows to decrease the stack consumption and increase the maximum marshal recursion depth on Windows.
msg321054 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-07-04 15:45
Thanks for writing that second PR, Serhiy! I'd already spotted that as the likely fix on our side, though I do want to complete the investigation into why MSVC generated unnecessarily sparse frames. But that is basically an internal discussion for me to follow up on at work.
msg321055 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-07-04 15:45
I added the backport to 3.7 label to your PR, as I want to restore the marshal stack depth limit in 3.7.1.
msg321059 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-04 17:17
Unfortunately PR 8071 doesn't fix the crash on PGO builds. :-(  Initially I tested it only on the debug build, and now have tested on the PGO build.
msg321084 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-05 08:17
New changeset fc05e68d8fac70349b7ea17ec14e7e0cfa956121 by Serhiy Storchaka in branch 'master':
bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336)
https://github.com/python/cpython/commit/fc05e68d8fac70349b7ea17ec14e7e0cfa956121
msg321087 - (view) Author: miss-islington (miss-islington) Date: 2018-07-05 08:45
New changeset 3f121a40c7526aedb2a40a1be8f0ae96b267bf26 by Miss Islington (bot) in branch '3.7':
bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336)
https://github.com/python/cpython/commit/3f121a40c7526aedb2a40a1be8f0ae96b267bf26
msg321089 - (view) Author: miss-islington (miss-islington) Date: 2018-07-05 08:48
New changeset 878c4fe40e8954372e9bf80ae555045ecae68e73 by Miss Islington (bot) in branch '3.6':
bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336)
https://github.com/python/cpython/commit/878c4fe40e8954372e9bf80ae555045ecae68e73
msg321092 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-05 09:20
New changeset 9720f60f2aba457121bfe42d09aa3ed91f28b86f by Serhiy Storchaka in branch '2.7':
[2.7] bpo-33720: Improve tests for the stack overflow in marshal.loads(). (GH-7336). (GH-8107)
https://github.com/python/cpython/commit/9720f60f2aba457121bfe42d09aa3ed91f28b86f
msg322279 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-24 07:55
New changeset c5734998d91e9953fd179ba6ed7015b6343e8191 by Serhiy Storchaka in branch 'master':
bpo-33720: Refactor marshalling/unmarshalling floats. (GH-8071)
https://github.com/python/cpython/commit/c5734998d91e9953fd179ba6ed7015b6343e8191
History
Date User Action Args
2018-07-24 08:07:53serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-07-24 07:55:52serhiy.storchakasetmessages: + msg322279
2018-07-05 09:20:21serhiy.storchakasetmessages: + msg321092
2018-07-05 08:50:16serhiy.storchakasetpull_requests: + pull_request7701
2018-07-05 08:48:49miss-islingtonsetmessages: + msg321089
2018-07-05 08:45:26miss-islingtonsetmessages: + msg321087
2018-07-05 08:20:31miss-islingtonsetpull_requests: + pull_request7700
2018-07-05 08:18:32miss-islingtonsetpull_requests: + pull_request7699
2018-07-05 08:17:29serhiy.storchakasetmessages: + msg321084
2018-07-04 17:17:17serhiy.storchakasetmessages: + msg321059
2018-07-04 15:45:58steve.dowersetmessages: + msg321055
2018-07-04 15:45:14steve.dowersetmessages: + msg321054
2018-07-04 06:47:06serhiy.storchakasetmessages: + msg321012
2018-07-03 20:29:02serhiy.storchakasetpull_requests: + pull_request7681
2018-06-22 10:29:24serhiy.storchakasetmessages: + msg320221
2018-06-20 10:18:25vstinnersetmessages: + msg320053
2018-06-05 16:59:36steve.dowersetmessages: + msg318771
2018-06-04 22:04:14ned.deilysetpriority: release blocker -> high

messages: + msg318714
2018-06-04 20:41:51miss-islingtonsetnosy: + miss-islington
messages: + msg318698
2018-06-04 20:28:17ned.deilysetmessages: + msg318696
2018-06-04 20:26:25steve.dowersetmessages: + msg318695
2018-06-04 20:25:27miss-islingtonsetpull_requests: + pull_request7031
2018-06-04 20:25:03steve.dowersetmessages: + msg318693
2018-06-04 19:28:37ned.deilysetmessages: + msg318690
2018-06-04 19:24:19steve.dowersetmessages: + msg318689
2018-06-04 16:22:24steve.dowersetpull_requests: + pull_request7027
2018-06-02 10:52:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg318497
2018-06-02 09:54:47serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request6967
2018-06-01 17:46:37steve.dowersetmessages: + msg318442
2018-05-31 16:13:29steve.dowersetmessages: + msg318331
2018-05-31 15:48:12steve.dowersetmessages: + msg318330
2018-05-31 15:39:24vstinnersetmessages: + msg318329
2018-05-31 15:36:53steve.dowersetpriority: normal -> release blocker
versions: + Python 3.8
nosy: + ned.deily

messages: + msg318328
2018-05-31 15:36:27steve.dowersetassignee: steve.dower
messages: + msg318327
2018-05-31 15:26:26vstinnersetmessages: + msg318326
2018-05-31 15:17:52vstinnersetversions: + Python 3.7, - Python 3.8
2018-05-31 15:17:44vstinnersetmessages: + msg318325
2018-05-31 15:14:15vstinnersettype: crash
2018-05-31 15:13:35vstinnercreate