msg281573 - (view) |
Author: Martin Richard (martius) * |
Date: 2016-11-23 18:03 |
Hi,
I stumbled upon a SEGFAULT while trying Python 3.6.0 on a project using
asyncio. I can't really figure out what's happening, so I reduced the original
code triggering the bug down to a reproducible case (which looks a bit clunky,
sorry). The case has been tested on two Linux systems (Archlinux and Debian),
and with several versions of Python.
The bug appears between 3.6.0a4 (most recent version tested not affected) and
3.60b1 (so before the C asyncio module I believe), and is not fixed in the
current repository tip (changeset: 105345:3addf93f4111).
I also produced a traceback using gdb (see bellow).
The segfault happens around the "await" in the body of Cursor._read_data(),
interestingly, if I change anything in the body of the method, the SEGFAULT
disappears and the code works as expected. Also, it seems that calling it from
an other coroutine (main() in the example) is required to trigger the bug.
Cheers,
Martin
Case (also attached as test.py) :
import asyncio
loop = asyncio.get_event_loop()
class Connection:
def read_until(self, *args, **kwargs):
return self
async def all(self):
return b"\n"
class Cursor:
def __init__(self):
self._connection = Connection()
self._max_bytes = 100
self._data = bytearray()
async def _read_data(self):
# XXX segfault there, if I change anything in the code, it works...
while True:
data = await self._connection.read_until(
b'\n', max_bytes=self._max_bytes).all()
self._max_bytes -= len(data)
if data == b'\n':
break
self._data.extend(data)
async def main():
await Cursor()._read_data()
loop.run_until_complete(main())
Traceback extract (with Python3.6.0b4, --with-pydebug on Linux):
Program received signal SIGSEGV, Segmentation fault.
0x000000000046d177 in _PyGen_yf (gen=gen@entry=0x7ffff34bdaf8) at Objects/genobject.c:361
361 Py_INCREF(yf);
(gdb) bt
#0 0x000000000046d177 in _PyGen_yf (gen=gen@entry=0x7ffff34bdaf8) at Objects/genobject.c:361
#1 0x000000000052f49c in _PyEval_EvalFrameDefault (f=0x7ffff67067d8, throwflag=<optimized out>)
at Python/ceval.c:1992
#2 0x000000000052a0fc in PyEval_EvalFrameEx (f=f@entry=0x7ffff67067d8, throwflag=throwflag@entry=0)
at Python/ceval.c:718
#3 0x000000000046d393 in gen_send_ex (gen=gen@entry=0x7ffff34bdc08, arg=<optimized out>,
exc=exc@entry=0, closing=closing@entry=0) at Objects/genobject.c:189
#4 0x000000000046de8d in _PyGen_Send (gen=gen@entry=0x7ffff34bdc08, arg=<optimized out>)
at Objects/genobject.c:308
#5 0x00007ffff384ba2c in task_step_impl (task=task@entry=0x7ffff3263bd8, exc=exc@entry=0x0)
at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1963
#6 0x00007ffff384c72e in task_step (task=0x7ffff3263bd8, exc=0x0)
at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:2247
#7 0x00007ffff384ca79 in task_call_step (arg=<optimized out>, task=<optimized out>)
at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1848
#8 TaskSendMethWrapper_call (o=<optimized out>, args=<optimized out>, kwds=<optimized out>)
at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1167
#9 0x0000000000446702 in PyObject_Call (func=0x7ffff37d7f60, args=0x7ffff7fb8058, kwargs=0x0)
at Objects/abstract.c:2246
#10 0x00000000005295c8 in do_call_core (func=func@entry=0x7ffff37d7f60,
callargs=callargs@entry=0x7ffff7fb8058, kwdict=kwdict@entry=0x0) at Python/ceval.c:5054
#11 0x0000000000534c64 in _PyEval_EvalFrameDefault (f=0xb4cb48, throwflag=<optimized out>)
at Python/ceval.c:3354
#12 0x000000000052a0fc in PyEval_EvalFrameEx (f=f@entry=0xb4cb48, throwflag=throwflag@entry=0)
at Python/ceval.c:718
#13 0x000000000052a1cc in _PyFunction_FastCall (co=<optimized out>, args=0xb4c5b0, nargs=nargs@entry=1,
globals=<optimized out>) at Python/ceval.c:4867
(...)
|
msg281575 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-11-23 18:10 |
I'm marking this as a release blocker since it is a (new) segfault.
|
msg281582 - (view) |
Author: Ned Deily (ned.deily) *  |
Date: 2016-11-23 19:25 |
FWIW, hg bisect finds:
changeset 103444:a77756e480c2: bad
The first bad revision is:
changeset: 103444:a77756e480c2
parent: 103442:914a81781291
user: Victor Stinner <victor.stinner@gmail.com>
date: Fri Sep 09 10:17:08 2016 -0700
summary: Rework CALL_FUNCTION* opcodes
Also, FWIW, the successful runs prior to this revision report:
base_events.py:481: ResourceWarning: unclosed event loop <_UnixSelectorEventLoop running=False closed=False debug=False>
|
msg281587 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2016-11-23 21:40 |
Wow, Martin, this is a very interesting one. Thanks so much for tracking this down and reducing the test case.
So far this doesn't seem to be related to async/await: the interpreter stack seems to enter some strange state under some conditions, and async/await just happens to be the thing that explodes.
To those trying to debug this: you don't need asyncio, just replace `loop.run_until_complete(main())` with `main().send(None)`.
|
msg281591 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-23 23:11 |
Oh wow, the bug is tricky.
_PyGen_yf() checks if the next instruction is YIELD_FROM using code[f_lasti+2]. The problem is that WORDCODE kept f_lasti == -1 for a frame not executed yet: f_lasti+2 is 1 in this case.
Problem: code[1] is not an operation code, but the argment. Python 3.6 bytecode now uses the "wordcode" format: 16-bit units of (operation, argument).
The obvious and simple fix is to add a special case for f_lasti==-1 in _PyGen_yf(): see attached patch.
pygen_yf.patch fixes the crash.
--
Another much larger change would be to change f_lasti to -2... In the rewiew of the huge WORDCODE patch, if I recall correctly, I asked Demur to use -1 for backward compatibility. Or maybe I asked to kept the backward compatibility at the Python level using a getter converting -2 to -1... I don't recall correctly.
See http://bugs.python.org/issue26647 for wordcode.
|
msg281592 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-23 23:12 |
It would be nice if Demur Rumed and/or Serhiy Storchaka can review pygen_yf.patch since Demur wrote the wordcode patch and Serhiy reviewed the wordcode patch.
|
msg281593 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-23 23:57 |
> Another much larger change would be to change f_lasti to -2...
Attached lasti.patch implements this idea. I consider that it makes the C code simpler because getting the next instruction (f_lasti + 2) doesn't require a special case anymore.
My patch keeps f_lasti == -1 at the Python level for backward compatibility.
lasti.patch is only a backward incompatible change at the C level.
--
Between pygen_yf.patch and lasti.patch, I prefer lasti.patch even if 3.6 is at its last beta version before the final version. I prefer to fix the C API. Later it will be much harder to fix it.
--
I read again the wordcode issue #26647:
I wrote on the review of wpy7.patch: "The overall change LGTM, but I'm no more 100% sure that starting f_lasti=-1 is safe."
http://bugs.python.org/review/26647/#msg17
I wrote: "IMHO it's ok to break the C API, but I would prefer to keep the backward compatibility for the Python API (replace any negative number with -1 for the Python API)."
http://bugs.python.org/issue26647#msg262758
Serhiy: "I think we should make yet few related changes: (...) * Change f_lasti, tb_lasti etc to count code units instead of bytes."
http://bugs.python.org/issue26647#msg262758
|
msg281597 - (view) |
Author: Philip Dubé (Demur Rumed) * |
Date: 2016-11-24 02:02 |
I had considered this, for some reason I didn't realize code[1] could be equal to YIELD_FROM, felt it'd always be false
f_lasti being -2 has always been my preference, lasti.patch lgtm
|
msg281598 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2016-11-24 02:06 |
> f_lasti being -2 has always been my preference, lasti.patch lgtm
How about we commit pygen_yf.patch to 3.6 and whatever else in 3.7?
lasti.patch is too invasive for 3.6 at this point, -1 for it in 3.6.
|
msg281608 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-11-24 06:08 |
I prefer pygen_yf.patch (with addressing Yury's suggestions). It is much simpler, and in any case I want to change f_lasti to count words rather than bytes (issue27129). It would be again -1 at the start.
|
msg281632 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-24 14:25 |
Updated patch. I added a NEWS entry and the two assertions from lasti.patch. These assertions are basic sanity checks, not directly related to this issue, they shouldn't harm. A started frame must never go back to the not-started state (f_lasti < 0).
Yury: "How about we commit pygen_yf.patch to 3.6 and whatever else in 3.7?"
Wordcode is already a massive change in our bytecode, I would prefer to not touch f_lasti just as a late update of WORDCODE in 3.7.
I'm ok to keep f_lasti == -1 for frames not started yet.
Serhiy: "I prefer pygen_yf.patch (with addressing Yury's suggestions)."
Ok, fine. I'm also more confident in smaller changes when we are very close to the final release! So let's forget lasti.patch ;-) (Sorry Demur!)
--
FYI attached test.py reproduces the bug because Cursor._read_data() starts with the instruction "SETUP_LOOP 72" and 72 is the code of the operation YIELD_FROM :-)
To reproduce the bug, you must have a code object where the second byte is 72. It's not easy to control the generated bytecode from the Python code, so I decided to not write an unit test for this bug.
|
msg281635 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-11-24 14:52 |
pygen_yf-2.patch LGTM.
Agreed about tests.
|
msg281644 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2016-11-24 18:52 |
LGTM
|
msg281657 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2016-11-24 21:35 |
New changeset 303cedfb9e7a by Victor Stinner in branch '3.6':
Fix _PyGen_yf()
https://hg.python.org/cpython/rev/303cedfb9e7a
|
msg281660 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-24 21:53 |
Thanks Martin Richard to continue to find (and report!) crazy corner cases ;-)
|
msg281694 - (view) |
Author: Martin Richard (martius) * |
Date: 2016-11-25 10:28 |
Thank you all for fixing this so quickly, it's been done amazingly fast!
|
msg281695 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-25 10:31 |
Martin Richard: "Thank you all for fixing this so quickly, it's been done amazingly fast!"
Even if I didn't write the commit a77756e480c2 which introduced the regression, I pushed it and so I felt responsible. It's a good motivation to fix it quickly. So nobody notice :-D
|
msg281936 - (view) |
Author: Ned Deily (ned.deily) *  |
Date: 2016-11-29 02:56 |
(Victor also merged this into default branch for 3.7.0 in 6453ff3328b8)
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:39 | admin | set | github: 72968 |
2017-03-31 16:36:24 | dstufft | set | pull_requests:
+ pull_request978 |
2016-11-29 02:56:36 | ned.deily | set | priority: release blocker ->
messages:
+ msg281936 stage: commit review -> resolved |
2016-11-25 10:31:49 | vstinner | set | messages:
+ msg281695 |
2016-11-25 10:28:56 | martius | set | messages:
+ msg281694 |
2016-11-24 21:53:41 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg281660
|
2016-11-24 21:35:18 | python-dev | set | nosy:
+ python-dev messages:
+ msg281657
|
2016-11-24 18:52:37 | yselivanov | set | messages:
+ msg281644 |
2016-11-24 14:52:48 | serhiy.storchaka | set | assignee: vstinner messages:
+ msg281635 stage: commit review |
2016-11-24 14:25:29 | vstinner | set | files:
+ pygen_yf-2.patch
messages:
+ msg281632 |
2016-11-24 06:08:52 | serhiy.storchaka | set | messages:
+ msg281608 |
2016-11-24 02:06:27 | yselivanov | set | messages:
+ msg281598 |
2016-11-24 02:02:48 | Demur Rumed | set | messages:
+ msg281597 |
2016-11-23 23:57:51 | vstinner | set | files:
+ lasti.patch
messages:
+ msg281593 |
2016-11-23 23:12:34 | vstinner | set | nosy:
+ Demur Rumed messages:
+ msg281592
|
2016-11-23 23:11:37 | vstinner | set | files:
+ pygen_yf.patch keywords:
+ patch |
2016-11-23 23:11:25 | vstinner | set | messages:
+ msg281591 |
2016-11-23 21:42:46 | yselivanov | set | nosy:
+ serhiy.storchaka
|
2016-11-23 21:40:08 | yselivanov | set | messages:
+ msg281587 |
2016-11-23 19:25:30 | ned.deily | set | nosy:
+ vstinner messages:
+ msg281582
|
2016-11-23 18:10:14 | r.david.murray | set | priority: normal -> release blocker nosy:
+ r.david.murray, ned.deily messages:
+ msg281575
|
2016-11-23 18:03:27 | martius | create | |