classification
Title: SEGFAULT when running a given coroutine
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vstinner Nosy List: Demur Rumed, martius, ned.deily, python-dev, r.david.murray, serhiy.storchaka, vstinner, yselivanov
Priority: Keywords: patch

Created on 2016-11-23 18:03 by martius, last changed 2017-03-31 16:36 by dstufft. This issue is now closed.

Files
File name Uploaded Description Edit
test.py martius, 2016-11-23 18:03 reduced test case
pygen_yf.patch vstinner, 2016-11-23 23:11 review
lasti.patch vstinner, 2016-11-23 23:57 review
pygen_yf-2.patch vstinner, 2016-11-24 14:25 review
Pull Requests
URL Status Linked Edit
PR 552 closed dstufft, 2017-03-31 16:36
Messages (18)
msg281573 - (view) Author: Martin Richard (martius) * Date: 2016-11-23 18:03
Hi,

I stumbled upon a SEGFAULT while trying Python 3.6.0 on a project using
asyncio. I can't really figure out what's happening, so I reduced the original
code triggering the bug down to a reproducible case (which looks a bit clunky,
sorry). The case has been tested on two Linux systems (Archlinux and Debian),
and with several versions of Python.

The bug appears between 3.6.0a4 (most recent version tested not affected) and
3.60b1 (so before the C asyncio module I believe), and is not fixed in the
current repository tip (changeset: 105345:3addf93f4111).

I also produced a traceback using gdb (see bellow).

The segfault happens around the "await" in the body of Cursor._read_data(),
interestingly, if I change anything in the body of the method, the SEGFAULT
disappears and the code works as expected. Also, it seems that calling it from
an other coroutine (main() in the example) is required to trigger the bug.

Cheers,
Martin


Case (also attached as test.py) : 
import asyncio

loop = asyncio.get_event_loop()


class Connection:
    def read_until(self, *args, **kwargs):
        return self

    async def all(self):
        return b"\n"


class Cursor:
    def __init__(self):
        self._connection = Connection()
        self._max_bytes = 100
        self._data = bytearray()

    async def _read_data(self):
        # XXX segfault there, if I change anything in the code, it works...
        while True:
            data = await self._connection.read_until(
                b'\n', max_bytes=self._max_bytes).all()
            self._max_bytes -= len(data)
            if data == b'\n':
                break
            self._data.extend(data)


async def main():
    await Cursor()._read_data()


loop.run_until_complete(main())


Traceback extract (with Python3.6.0b4, --with-pydebug on Linux):

Program received signal SIGSEGV, Segmentation fault.
0x000000000046d177 in _PyGen_yf (gen=gen@entry=0x7ffff34bdaf8) at Objects/genobject.c:361
361	        Py_INCREF(yf);
(gdb) bt
#0  0x000000000046d177 in _PyGen_yf (gen=gen@entry=0x7ffff34bdaf8) at Objects/genobject.c:361
#1  0x000000000052f49c in _PyEval_EvalFrameDefault (f=0x7ffff67067d8, throwflag=<optimized out>)
    at Python/ceval.c:1992
#2  0x000000000052a0fc in PyEval_EvalFrameEx (f=f@entry=0x7ffff67067d8, throwflag=throwflag@entry=0)
    at Python/ceval.c:718
#3  0x000000000046d393 in gen_send_ex (gen=gen@entry=0x7ffff34bdc08, arg=<optimized out>, 
    exc=exc@entry=0, closing=closing@entry=0) at Objects/genobject.c:189
#4  0x000000000046de8d in _PyGen_Send (gen=gen@entry=0x7ffff34bdc08, arg=<optimized out>)
    at Objects/genobject.c:308
#5  0x00007ffff384ba2c in task_step_impl (task=task@entry=0x7ffff3263bd8, exc=exc@entry=0x0)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1963
#6  0x00007ffff384c72e in task_step (task=0x7ffff3263bd8, exc=0x0)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:2247
#7  0x00007ffff384ca79 in task_call_step (arg=<optimized out>, task=<optimized out>)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1848
#8  TaskSendMethWrapper_call (o=<optimized out>, args=<optimized out>, kwds=<optimized out>)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1167
#9  0x0000000000446702 in PyObject_Call (func=0x7ffff37d7f60, args=0x7ffff7fb8058, kwargs=0x0)
    at Objects/abstract.c:2246
#10 0x00000000005295c8 in do_call_core (func=func@entry=0x7ffff37d7f60, 
    callargs=callargs@entry=0x7ffff7fb8058, kwdict=kwdict@entry=0x0) at Python/ceval.c:5054
#11 0x0000000000534c64 in _PyEval_EvalFrameDefault (f=0xb4cb48, throwflag=<optimized out>)
    at Python/ceval.c:3354
#12 0x000000000052a0fc in PyEval_EvalFrameEx (f=f@entry=0xb4cb48, throwflag=throwflag@entry=0)
    at Python/ceval.c:718
#13 0x000000000052a1cc in _PyFunction_FastCall (co=<optimized out>, args=0xb4c5b0, nargs=nargs@entry=1, 
    globals=<optimized out>) at Python/ceval.c:4867
(...)
msg281575 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-11-23 18:10
I'm marking this as a release blocker since it is a (new) segfault.
msg281582 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-11-23 19:25
FWIW, hg bisect finds:

changeset 103444:a77756e480c2: bad
The first bad revision is:
changeset:   103444:a77756e480c2
parent:      103442:914a81781291
user:        Victor Stinner <victor.stinner@gmail.com>
date:        Fri Sep 09 10:17:08 2016 -0700
summary:     Rework CALL_FUNCTION* opcodes

Also, FWIW, the successful runs prior to this revision report:

base_events.py:481: ResourceWarning: unclosed event loop <_UnixSelectorEventLoop running=False closed=False debug=False>
msg281587 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-11-23 21:40
Wow, Martin, this is a very interesting one.  Thanks so much for tracking this down and reducing the test case.

So far this doesn't seem to be related to async/await: the interpreter stack seems to enter some strange state under some conditions, and async/await just happens to be the thing that explodes.

To those trying to debug this: you don't need asyncio, just replace `loop.run_until_complete(main())` with `main().send(None)`.
msg281591 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-23 23:11
Oh wow, the bug is tricky.

_PyGen_yf() checks if the next instruction is YIELD_FROM using code[f_lasti+2]. The problem is that WORDCODE kept f_lasti == -1 for a frame not executed yet: f_lasti+2 is 1 in this case.

Problem: code[1] is not an operation code, but the argment. Python 3.6 bytecode now uses the "wordcode" format: 16-bit units of (operation, argument).

The obvious and simple fix is to add a special case for f_lasti==-1 in _PyGen_yf(): see attached patch.

pygen_yf.patch fixes the crash.

--

Another much larger change would be to change f_lasti to -2... In the rewiew of the huge WORDCODE patch, if I recall correctly, I asked Demur to use -1 for backward compatibility. Or maybe I asked to kept the backward compatibility at the Python level using a getter converting -2 to -1... I don't recall correctly.

See http://bugs.python.org/issue26647 for wordcode.
msg281592 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-23 23:12
It would be nice if Demur Rumed and/or Serhiy Storchaka can review  pygen_yf.patch since Demur wrote the wordcode patch and Serhiy reviewed the wordcode patch.
msg281593 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-23 23:57
> Another much larger change would be to change f_lasti to -2...

Attached lasti.patch implements this idea. I consider that it makes the C code simpler because getting the next instruction (f_lasti + 2) doesn't require a special case anymore.

My patch keeps f_lasti == -1 at the Python level for backward compatibility.

lasti.patch is only a backward incompatible change at the C level.

--

Between pygen_yf.patch and lasti.patch, I prefer lasti.patch even if 3.6 is at its last beta version before the final version. I prefer to fix the C API. Later it will be much harder to fix it.

--

I read again the wordcode issue #26647:

I wrote on the review of wpy7.patch: "The overall change LGTM, but I'm no more 100% sure that starting f_lasti=-1 is safe."
http://bugs.python.org/review/26647/#msg17

I wrote: "IMHO it's ok to break the C API, but I would prefer to keep the backward compatibility for the Python API (replace any negative number with -1 for the Python API)."
http://bugs.python.org/issue26647#msg262758

Serhiy: "I think we should make yet few related changes: (...) * Change f_lasti, tb_lasti etc to count code units instead of bytes."
http://bugs.python.org/issue26647#msg262758
msg281597 - (view) Author: Demur Rumed (Demur Rumed) * Date: 2016-11-24 02:02
I had considered this, for some reason I didn't realize code[1] could be equal to YIELD_FROM, felt it'd always be false

f_lasti being -2 has always been my preference, lasti.patch lgtm
msg281598 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-11-24 02:06
> f_lasti being -2 has always been my preference, lasti.patch lgtm

How about we commit pygen_yf.patch to 3.6 and whatever else in 3.7?  

lasti.patch is too invasive for 3.6 at this point, -1 for it in 3.6.
msg281608 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-24 06:08
I prefer pygen_yf.patch (with addressing Yury's suggestions). It is much simpler, and in any case I want to change f_lasti to count words rather than bytes (issue27129). It would be again -1 at the start.
msg281632 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-24 14:25
Updated patch. I added a NEWS entry and the two assertions from  lasti.patch. These assertions are basic sanity checks, not directly related to this issue, they shouldn't harm. A started frame must never go back to the not-started state (f_lasti < 0).

Yury: "How about we commit pygen_yf.patch to 3.6 and whatever else in 3.7?"

Wordcode is already a massive change in our bytecode, I would prefer to not touch f_lasti just as a late update of WORDCODE in 3.7.

I'm ok to keep f_lasti == -1 for frames not started yet.


Serhiy: "I prefer pygen_yf.patch (with addressing Yury's suggestions)."

Ok, fine. I'm also more confident in smaller changes when we are very close to the final release! So let's forget lasti.patch ;-) (Sorry Demur!)

--

FYI attached test.py reproduces the bug because Cursor._read_data() starts with the instruction  "SETUP_LOOP 72" and 72 is the code of the operation YIELD_FROM :-)

To reproduce the bug, you must have a code object where the second byte is 72. It's not easy to control the generated bytecode from the Python code, so I decided to not write an unit test for this bug.
msg281635 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-24 14:52
pygen_yf-2.patch LGTM.

Agreed about tests.
msg281644 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-11-24 18:52
LGTM
msg281657 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-11-24 21:35
New changeset 303cedfb9e7a by Victor Stinner in branch '3.6':
Fix _PyGen_yf()
https://hg.python.org/cpython/rev/303cedfb9e7a
msg281660 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-24 21:53
Thanks Martin Richard to continue to find (and report!) crazy corner cases ;-)
msg281694 - (view) Author: Martin Richard (martius) * Date: 2016-11-25 10:28
Thank you all for fixing this so quickly, it's been done amazingly fast!
msg281695 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-25 10:31
Martin Richard: "Thank you all for fixing this so quickly, it's been done amazingly fast!"

Even if I didn't write the commit a77756e480c2 which introduced the regression, I pushed it and so I felt responsible. It's a good motivation to fix it quickly. So nobody notice :-D
msg281936 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-11-29 02:56
(Victor also merged this into default branch for 3.7.0 in 6453ff3328b8)
History
Date User Action Args
2017-03-31 16:36:24dstufftsetpull_requests: + pull_request978
2016-11-29 02:56:36ned.deilysetpriority: release blocker ->

messages: + msg281936
stage: commit review -> resolved
2016-11-25 10:31:49vstinnersetmessages: + msg281695
2016-11-25 10:28:56martiussetmessages: + msg281694
2016-11-24 21:53:41vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg281660
2016-11-24 21:35:18python-devsetnosy: + python-dev
messages: + msg281657
2016-11-24 18:52:37yselivanovsetmessages: + msg281644
2016-11-24 14:52:48serhiy.storchakasetassignee: vstinner
messages: + msg281635
stage: commit review
2016-11-24 14:25:29vstinnersetfiles: + pygen_yf-2.patch

messages: + msg281632
2016-11-24 06:08:52serhiy.storchakasetmessages: + msg281608
2016-11-24 02:06:27yselivanovsetmessages: + msg281598
2016-11-24 02:02:48Demur Rumedsetmessages: + msg281597
2016-11-23 23:57:51vstinnersetfiles: + lasti.patch

messages: + msg281593
2016-11-23 23:12:34vstinnersetnosy: + Demur Rumed
messages: + msg281592
2016-11-23 23:11:37vstinnersetfiles: + pygen_yf.patch
keywords: + patch
2016-11-23 23:11:25vstinnersetmessages: + msg281591
2016-11-23 21:42:46yselivanovsetnosy: + serhiy.storchaka
2016-11-23 21:40:08yselivanovsetmessages: + msg281587
2016-11-23 19:25:30ned.deilysetnosy: + vstinner
messages: + msg281582
2016-11-23 18:10:14r.david.murraysetpriority: normal -> release blocker
nosy: + r.david.murray, ned.deily
messages: + msg281575

2016-11-23 18:03:27martiuscreate