Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGFAULT when running a given coroutine #72968

Closed
Martiusweb opened this issue Nov 23, 2016 · 18 comments
Closed

SEGFAULT when running a given coroutine #72968

Martiusweb opened this issue Nov 23, 2016 · 18 comments
Assignees
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@Martiusweb
Copy link
Member

BPO 28782
Nosy @vstinner, @ned-deily, @bitdancer, @serhiy-storchaka, @1st1, @Martiusweb, @serprex
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Files
  • test.py: reduced test case
  • pygen_yf.patch
  • lasti.patch
  • pygen_yf-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2016-11-24.21:53:41.147>
    created_at = <Date 2016-11-23.18:03:27.409>
    labels = ['interpreter-core', '3.7', 'type-crash']
    title = 'SEGFAULT when running a given coroutine'
    updated_at = <Date 2017-03-31.16:36:24.369>
    user = 'https://github.com/Martiusweb'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:24.369>
    actor = 'dstufft'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2016-11-24.21:53:41.147>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2016-11-23.18:03:27.409>
    creator = 'martius'
    dependencies = []
    files = ['45612', '45614', '45615', '45625']
    hgrepos = []
    issue_num = 28782
    keywords = ['patch']
    message_count = 18.0
    messages = ['281573', '281575', '281582', '281587', '281591', '281592', '281593', '281597', '281598', '281608', '281632', '281635', '281644', '281657', '281660', '281694', '281695', '281936']
    nosy_count = 8.0
    nosy_names = ['vstinner', 'ned.deily', 'r.david.murray', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'martius', 'Demur Rumed']
    pr_nums = ['552']
    priority = None
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue28782'
    versions = ['Python 3.6', 'Python 3.7']

    @Martiusweb
    Copy link
    Member Author

    Hi,

    I stumbled upon a SEGFAULT while trying Python 3.6.0 on a project using
    asyncio. I can't really figure out what's happening, so I reduced the original
    code triggering the bug down to a reproducible case (which looks a bit clunky,
    sorry). The case has been tested on two Linux systems (Archlinux and Debian),
    and with several versions of Python.

    The bug appears between 3.6.0a4 (most recent version tested not affected) and
    3.60b1 (so before the C asyncio module I believe), and is not fixed in the
    current repository tip (changeset: 105345:3addf93f4111).

    I also produced a traceback using gdb (see bellow).

    The segfault happens around the "await" in the body of Cursor._read_data(),
    interestingly, if I change anything in the body of the method, the SEGFAULT
    disappears and the code works as expected. Also, it seems that calling it from
    an other coroutine (main() in the example) is required to trigger the bug.

    Cheers,
    Martin

    Case (also attached as test.py) :
    import asyncio

    loop = asyncio.get_event_loop()
    
    
    class Connection:
        def read_until(self, *args, **kwargs):
            return self
    
        async def all(self):
            return b"\n"
    
    
    class Cursor:
        def __init__(self):
            self._connection = Connection()
            self._max_bytes = 100
            self._data = bytearray()
    
        async def _read_data(self):
            # XXX segfault there, if I change anything in the code, it works...
            while True:
                data = await self._connection.read_until(
                    b'\n', max_bytes=self._max_bytes).all()
                self._max_bytes -= len(data)
                if data == b'\n':
                    break
                self._data.extend(data)
    
    
    async def main():
        await Cursor()._read_data()
    
    
    loop.run_until_complete(main())

    Traceback extract (with Python3.6.0b4, --with-pydebug on Linux):

    Program received signal SIGSEGV, Segmentation fault.
    0x000000000046d177 in _PyGen_yf (gen=gen@entry=0x7ffff34bdaf8) at Objects/genobject.c:361
    361 Py_INCREF(yf);
    (gdb) bt
    #0 0x000000000046d177 in _PyGen_yf (gen=gen@entry=0x7ffff34bdaf8) at Objects/genobject.c:361
    #1 0x000000000052f49c in _PyEval_EvalFrameDefault (f=0x7ffff67067d8, throwflag=<optimized out>)
    at Python/ceval.c:1992
    #2 0x000000000052a0fc in PyEval_EvalFrameEx (f=f@entry=0x7ffff67067d8, throwflag=throwflag@entry=0)
    at Python/ceval.c:718
    #3 0x000000000046d393 in gen_send_ex (gen=gen@entry=0x7ffff34bdc08, arg=<optimized out>,
    exc=exc@entry=0, closing=closing@entry=0) at Objects/genobject.c:189
    #4 0x000000000046de8d in _PyGen_Send (gen=gen@entry=0x7ffff34bdc08, arg=<optimized out>)
    at Objects/genobject.c:308
    #5 0x00007ffff384ba2c in task_step_impl (task=task@entry=0x7ffff3263bd8, exc=exc@entry=0x0)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1963
    #6 0x00007ffff384c72e in task_step (task=0x7ffff3263bd8, exc=0x0)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:2247
    #7 0x00007ffff384ca79 in task_call_step (arg=<optimized out>, task=<optimized out>)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1848
    #8 TaskSendMethWrapper_call (o=<optimized out>, args=<optimized out>, kwds=<optimized out>)
    at (...)/Python-3.6.0b4/Modules/_asynciomodule.c:1167
    #9 0x0000000000446702 in PyObject_Call (func=0x7ffff37d7f60, args=0x7ffff7fb8058, kwargs=0x0)
    at Objects/abstract.c:2246
    #10 0x00000000005295c8 in do_call_core (func=func@entry=0x7ffff37d7f60,
    callargs=callargs@entry=0x7ffff7fb8058, kwdict=kwdict@entry=0x0) at Python/ceval.c:5054
    #11 0x0000000000534c64 in _PyEval_EvalFrameDefault (f=0xb4cb48, throwflag=<optimized out>)
    at Python/ceval.c:3354
    #12 0x000000000052a0fc in PyEval_EvalFrameEx (f=f@entry=0xb4cb48, throwflag=throwflag@entry=0)
    at Python/ceval.c:718
    #13 0x000000000052a1cc in _PyFunction_FastCall (co=<optimized out>, args=0xb4c5b0, nargs=nargs@entry=1,
    globals=<optimized out>) at Python/ceval.c:4867
    (...)

    @Martiusweb Martiusweb added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 23, 2016
    @bitdancer
    Copy link
    Member

    I'm marking this as a release blocker since it is a (new) segfault.

    @ned-deily
    Copy link
    Member

    FWIW, hg bisect finds:

    changeset 103444:a77756e480c2: bad
    The first bad revision is:
    changeset: 103444:a77756e480c2
    parent: 103442:914a81781291
    user: Victor Stinner <victor.stinner@gmail.com>
    date: Fri Sep 09 10:17:08 2016 -0700
    summary: Rework CALL_FUNCTION* opcodes

    Also, FWIW, the successful runs prior to this revision report:

    base_events.py:481: ResourceWarning: unclosed event loop <_UnixSelectorEventLoop running=False closed=False debug=False>

    @1st1
    Copy link
    Member

    1st1 commented Nov 23, 2016

    Wow, Martin, this is a very interesting one. Thanks so much for tracking this down and reducing the test case.

    So far this doesn't seem to be related to async/await: the interpreter stack seems to enter some strange state under some conditions, and async/await just happens to be the thing that explodes.

    To those trying to debug this: you don't need asyncio, just replace loop.run_until_complete(main()) with main().send(None).

    @vstinner
    Copy link
    Member

    Oh wow, the bug is tricky.

    _PyGen_yf() checks if the next instruction is YIELD_FROM using code[f_lasti+2]. The problem is that WORDCODE kept f_lasti == -1 for a frame not executed yet: f_lasti+2 is 1 in this case.

    Problem: code[1] is not an operation code, but the argment. Python 3.6 bytecode now uses the "wordcode" format: 16-bit units of (operation, argument).

    The obvious and simple fix is to add a special case for f_lasti==-1 in _PyGen_yf(): see attached patch.

    pygen_yf.patch fixes the crash.

    --

    Another much larger change would be to change f_lasti to -2... In the rewiew of the huge WORDCODE patch, if I recall correctly, I asked Demur to use -1 for backward compatibility. Or maybe I asked to kept the backward compatibility at the Python level using a getter converting -2 to -1... I don't recall correctly.

    See http://bugs.python.org/issue26647 for wordcode.

    @vstinner
    Copy link
    Member

    It would be nice if Demur Rumed and/or Serhiy Storchaka can review pygen_yf.patch since Demur wrote the wordcode patch and Serhiy reviewed the wordcode patch.

    @vstinner
    Copy link
    Member

    Another much larger change would be to change f_lasti to -2...

    Attached lasti.patch implements this idea. I consider that it makes the C code simpler because getting the next instruction (f_lasti + 2) doesn't require a special case anymore.

    My patch keeps f_lasti == -1 at the Python level for backward compatibility.

    lasti.patch is only a backward incompatible change at the C level.

    --

    Between pygen_yf.patch and lasti.patch, I prefer lasti.patch even if 3.6 is at its last beta version before the final version. I prefer to fix the C API. Later it will be much harder to fix it.

    --

    I read again the wordcode issue bpo-26647:

    I wrote on the review of wpy7.patch: "The overall change LGTM, but I'm no more 100% sure that starting f_lasti=-1 is safe."
    http://bugs.python.org/review/26647/#msg17

    I wrote: "IMHO it's ok to break the C API, but I would prefer to keep the backward compatibility for the Python API (replace any negative number with -1 for the Python API)."
    http://bugs.python.org/issue26647#msg262758

    Serhiy: "I think we should make yet few related changes: (...) * Change f_lasti, tb_lasti etc to count code units instead of bytes."
    http://bugs.python.org/issue26647#msg262758

    @serprex
    Copy link
    Mannequin

    serprex mannequin commented Nov 24, 2016

    I had considered this, for some reason I didn't realize code[1] could be equal to YIELD_FROM, felt it'd always be false

    f_lasti being -2 has always been my preference, lasti.patch lgtm

    @1st1
    Copy link
    Member

    1st1 commented Nov 24, 2016

    f_lasti being -2 has always been my preference, lasti.patch lgtm

    How about we commit pygen_yf.patch to 3.6 and whatever else in 3.7?

    lasti.patch is too invasive for 3.6 at this point, -1 for it in 3.6.

    @serhiy-storchaka
    Copy link
    Member

    I prefer pygen_yf.patch (with addressing Yury's suggestions). It is much simpler, and in any case I want to change f_lasti to count words rather than bytes (bpo-27129). It would be again -1 at the start.

    @vstinner
    Copy link
    Member

    Updated patch. I added a NEWS entry and the two assertions from lasti.patch. These assertions are basic sanity checks, not directly related to this issue, they shouldn't harm. A started frame must never go back to the not-started state (f_lasti < 0).

    Yury: "How about we commit pygen_yf.patch to 3.6 and whatever else in 3.7?"

    Wordcode is already a massive change in our bytecode, I would prefer to not touch f_lasti just as a late update of WORDCODE in 3.7.

    I'm ok to keep f_lasti == -1 for frames not started yet.

    Serhiy: "I prefer pygen_yf.patch (with addressing Yury's suggestions)."

    Ok, fine. I'm also more confident in smaller changes when we are very close to the final release! So let's forget lasti.patch ;-) (Sorry Demur!)

    --

    FYI attached test.py reproduces the bug because Cursor._read_data() starts with the instruction "SETUP_LOOP 72" and 72 is the code of the operation YIELD_FROM :-)

    To reproduce the bug, you must have a code object where the second byte is 72. It's not easy to control the generated bytecode from the Python code, so I decided to not write an unit test for this bug.

    @serhiy-storchaka
    Copy link
    Member

    pygen_yf-2.patch LGTM.

    Agreed about tests.

    @1st1
    Copy link
    Member

    1st1 commented Nov 24, 2016

    LGTM

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 24, 2016

    New changeset 303cedfb9e7a by Victor Stinner in branch '3.6':
    Fix _PyGen_yf()
    https://hg.python.org/cpython/rev/303cedfb9e7a

    @vstinner
    Copy link
    Member

    Thanks Martin Richard to continue to find (and report!) crazy corner cases ;-)

    @Martiusweb
    Copy link
    Member Author

    Thank you all for fixing this so quickly, it's been done amazingly fast!

    @vstinner
    Copy link
    Member

    Martin Richard: "Thank you all for fixing this so quickly, it's been done amazingly fast!"

    Even if I didn't write the commit a77756e480c2 which introduced the regression, I pushed it and so I felt responsible. It's a good motivation to fix it quickly. So nobody notice :-D

    @ned-deily
    Copy link
    Member

    (Victor also merged this into default branch for 3.7.0 in 6453ff3328b8)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants