Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio segfault when using threadpool and "_asyncio" native module #75244

Closed
thehesiod mannequin opened this issue Jul 27, 2017 · 23 comments
Closed

asyncio segfault when using threadpool and "_asyncio" native module #75244

thehesiod mannequin opened this issue Jul 27, 2017 · 23 comments
Labels
topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@thehesiod
Copy link
Mannequin

thehesiod mannequin commented Jul 27, 2017

BPO 31061
Nosy @methane, @1st1, @thehesiod
PRs
  • bpo-31061: fix crash in asyncio speedup module #2966
  • [3.6] bpo-31061: fix crash in asyncio speedup module #2984
  • Files
  • native___python_crash_stacks.txt: Python and native stack trace during segfault
  • python crash.txt
  • python_crash.py
  • python crash2.txt
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-08-02.16:21:59.872>
    created_at = <Date 2017-07-27.20:16:30.230>
    labels = ['type-crash', 'expert-asyncio']
    title = 'asyncio segfault when using threadpool and "_asyncio" native module'
    updated_at = <Date 2017-08-07.19:32:20.767>
    user = 'https://github.com/thehesiod'

    bugs.python.org fields:

    activity = <Date 2017-08-07.19:32:20.767>
    actor = 'thehesiod'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-08-02.16:21:59.872>
    closer = 'yselivanov'
    components = ['asyncio']
    creation = <Date 2017-07-27.20:16:30.230>
    creator = 'thehesiod'
    dependencies = []
    files = ['47043', '47049', '47050', '47051']
    hgrepos = []
    issue_num = 31061
    keywords = []
    message_count = 23.0
    messages = ['299346', '299347', '299349', '299421', '299422', '299426', '299427', '299512', '299515', '299516', '299517', '299522', '299574', '299581', '299583', '299590', '299591', '299644', '299647', '299662', '299663', '299792', '299863']
    nosy_count = 3.0
    nosy_names = ['methane', 'yselivanov', 'thehesiod']
    pr_nums = ['2966', '2984']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue31061'
    versions = ['Python 3.5', 'Python 3.6']

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 27, 2017

    I have a project in a prod environment which heavily uses asyncio and a threadpool. It uses the threadpool to run CPU heavy tasks (in this case populating a defaultdict) to avoid blocking the main thread (no async code in thread). For some time now my service has been randomly crashing at the same place in the thread which does the dict updating. I've finally got both the python and native stack traces, and based on the information presented it looked very similar to the issue found by the devs at home-assistant (home-assistant/core#7752 (comment), which points to home-assistant/core#7848). So I tried their fix of disabling the "_asyncio" module, and lo and behold python no longer segfaults.

    Per the stacktrace it's crashing in PyObject_GC_Del, and the only place this is used in the asyncio module seems to be here: https://github.com/python/cpython/blob/master/Modules/_asynciomodule.c#L996

    does anyone have any idea why it's crashing on this line? Are there thread protections missing in this file?

    I'm trying to reproduce this in a testcase but it's proving very difficult as I'm guessing it's timing related.

    @thehesiod thehesiod mannequin added topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump labels Jul 27, 2017
    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 27, 2017

    btw I've seen this issue in 3.5.2 + 3.6.2 on debian jessie + stretch

    @1st1
    Copy link
    Member

    1st1 commented Jul 27, 2017

    Interesting. A code to reproduce this crash would help tremendously. So far this is the first time I hear about this crash.

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 28, 2017

    so looks like disabling the _asyncio module just caused the crash to happen less often, closing and will continue investigating after a get a core file

    @thehesiod thehesiod mannequin closed this as completed Jul 28, 2017
    @1st1
    Copy link
    Member

    1st1 commented Jul 28, 2017

    so looks like disabling the _asyncio module just caused the crash to happen less often, closing and will continue investigating after a get a core file

    Interesting. So it crashes even with pure Python Future/Task implementations?

    Another idea to debug this is to try to see what happens when you run your code with CFuture and PyTask, and with PyFuture/CTask.

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 28, 2017

    the problem with this crash is that it only happens periodically in our prod environment :( If I try running the exact same docker container with the same inputs locally it doesn't reproduce, so frustrating. I've created a whole workflow now for deploying with a debug python to get a core file with symbols. Hopefully have some more info w/in a day. Thanks for the tips!

    @1st1
    Copy link
    Member

    1st1 commented Jul 28, 2017

    Thank you for trying to get to the bottom of this!

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 30, 2017

    ok got a full debug core file, let me know what other information I can provide.

    @thehesiod thehesiod mannequin reopened this Jul 30, 2017
    @1st1
    Copy link
    Member

    1st1 commented Jul 30, 2017

    ok got a full debug core file, let me know what other information I can provide.

    Thank you. Does the crash happen with python versions of Tasks and Future?

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 30, 2017

    this is the comment on the assert:

    /* Python's cyclic gc should never see an incoming refcount

    • of 0: if something decref'ed to 0, it should have been
    • deallocated immediately at that time.
    • Possible cause (if the assert triggers): a tp_dealloc
    • routine left a gc-aware object tracked during its teardown
    • phase, and did something-- or allowed something to happen --
    • that called back into Python. gc can trigger then, and may
    • see the still-tracked dying object. Before this assert
    • was added, such mistakes went on to allow gc to try to
    • delete the object again. In a debug build, that caused
    • a mysterious segfault, when _Py_ForgetReference tried
    • to remove the object from the doubly-linked list of all
    • objects a second time. In a release build, an actual
    • double deallocation occurred, which leads to corruption
    • of the allocator's internal bookkeeping pointers. That's
    • so serious that maybe this should be a release-build
    • check instead of an assert?
      */

    I've also attached a file that's similar to the code we run in production, however couldn't get it to reproduce the crash. In the datafile it uses it has some tuples like the following:
    StationTuple = namedtuple('StationTuple', ['stationname', 'stationsubtype', 's2id'])

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 30, 2017

    hmm, how would I do that? btw I'm not 100% sure this is due to asyncio.

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 30, 2017

    btw got slightly difference stacktrace on second core file

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 31, 2017

    oh, so this is looking like an asyncio issue, the "gc" that is causing the crash is:

    (gdb) print *FROM_GC(gc)->ob_type
    $8 = {ob_base = {ob_base = {_ob_next = 0x7f80c8aafc88, _ob_prev = 0x7f80c8aafd00, ob_refcnt = 7, ob_type = 0x7f80cd8c86e0 <PyType_Type>}, ob_size = 0}, tp_name = 0x7f80c8aa5c38 "_GatheringFuture", tp_basicsize = 104, tp_itemsize = 0,
    tp_dealloc = 0x7f80cd41bee7 <subtype_dealloc>, tp_print = 0x0, tp_getattr = 0x0, tp_setattr = 0x0, tp_as_async = 0x556ba4342d58, tp_repr = 0x7f80c8870a9b <FutureObj_repr>, tp_as_number = 0x556ba4342d70, tp_as_sequence = 0x556ba4342ea8,
    tp_as_mapping = 0x556ba4342e90, tp_hash = 0x7f80cd54c995 <_Py_HashPointer>, tp_call = 0x0, tp_str = 0x7f80cd423d05 <object_str>, tp_getattro = 0x7f80cd3fec8b <PyObject_GenericGetAttr>, tp_setattro = 0x7f80cd3ff0b5 <PyObject_GenericSetAttr>,
    tp_as_buffer = 0x556ba4342ef8, tp_flags = 808449, tp_doc = 0x7f80c8cd7380 "Helper for gather().\n\n This overrides cancel() to cancel all the children and act more\n like Task.cancel(), which doesn't immediately mark itself as\n cancelled.\n ",
    tp_traverse = 0x7f80cd41baae <subtype_traverse>, tp_clear = 0x7f80cd41bd5c <subtype_clear>, tp_richcompare = 0x7f80cd423d42 <object_richcompare>, tp_weaklistoffset = 96, tp_iter = 0x7f80c8871d4f <future_new_iter>,
    tp_iternext = 0x7f80cd3fe6d6 <_PyObject_NextNotImplemented>, tp_methods = 0x0, tp_members = 0x556ba4342f28, tp_getset = 0x0, tp_base = 0x7f80c8a7b3c0 <FutureType>, tp_dict = 0x7f80c8aafc88, tp_descr_get = 0x0, tp_descr_set = 0x0, tp_dictoffset = 88,
    tp_init = 0x7f80cd431000 <slot_tp_init>, tp_alloc = 0x7f80cd41b7f8 <PyType_GenericAlloc>, tp_new = 0x7f80cd41b9c6 <PyType_GenericNew>, tp_free = 0x7f80cd583d8b <PyObject_GC_Del>, tp_is_gc = 0x0, tp_bases = 0x7f80c8ab20c0, tp_mro = 0x7f80c8aafd00,
    tp_cache = 0x0, tp_subclasses = 0x0, tp_weaklist = 0x7f80c8aae5d8, tp_del = 0x0, tp_version_tag = 791, tp_finalize = 0x7f80c8870ddb <FutureObj_finalize>}

    note: it's a _GatheringFuture.

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 31, 2017

    another core had a different gc object:

    $1 = {ob_base = {ob_base = {ob_next = 0x7f801eac3158, _ob_prev = 0x7f801eab95a0, ob_refcnt = 41, ob_type = 0x7f80238e76e0 <PyType_Type>}, ob_size = 0}, tp_name = 0x7f801e8967af "_asyncio.Task", tp_basicsize = 128, tp_itemsize = 0,
    tp_dealloc = 0x7f801e8926e5 <TaskObj_dealloc>, tp_print = 0x0, tp_getattr = 0x0, tp_setattr = 0x0, tp_as_async = 0x7f801ea99720 <FutureType_as_async>, tp_repr = 0x7f801e88fa9b <FutureObj_repr>, tp_as_number = 0x0, tp_as_sequence = 0x0, tp_as_mapping = 0x0,
    tp_hash = 0x7f802356b995 <Py_HashPointer>, tp_call = 0x0, tp_str = 0x7f8023442d05 <object_str>, tp_getattro = 0x7f802341dc8b <PyObject_GenericGetAttr>, tp_setattro = 0x7f802341e0b5 <PyObject_GenericSetAttr>, tp_as_buffer = 0x0, tp_flags = 807937,
    tp_doc = 0x7f801ea98bc0 <asyncio_Task___init____doc
    > "Task(coro, *, loop=None)\n--\n\nA coroutine wrapped in a Future.", tp_traverse = 0x7f801e891658 <TaskObj_traverse>, tp_clear = 0x7f801e89150b <TaskObj_clear>,
    tp_richcompare = 0x7f8023442d42 <object_richcompare>, tp_weaklistoffset = 96, tp_iter = 0x7f801e890d4f <future_new_iter>, tp_iternext = 0x0, tp_methods = 0x7f801ea99b20 <TaskType_methods>, tp_members = 0x0, tp_getset = 0x7f801ea99d40 <TaskType_getsetlist>,
    tp_base = 0x7f801ea9a3c0 <FutureType>, tp_dict = 0x7f801eac2238, tp_descr_get = 0x0, tp_descr_set = 0x0, tp_dictoffset = 88, tp_init = 0x7f801e88d84d <asyncio_Task___init
    >, tp_alloc = 0x7f802343a7f8 <PyType_GenericAlloc>,
    tp_new = 0x7f802343a9c6 <PyType_GenericNew>, tp_free = 0x7f80235a2d8b <PyObject_GC_Del>, tp_is_gc = 0x0, tp_bases = 0x7f801eab95a0, tp_mro = 0x7f801eabc508, tp_cache = 0x0, tp_subclasses = 0x0, tp_weaklist = 0x7f801eac3458, tp_del = 0x0, tp_version_tag = 4303,
    tp_finalize = 0x7f801e8922fd <TaskObj_finalize>}

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Jul 31, 2017

    I'm hoping this is the fix:

    --- Modules/_asynciomodule.c.orig	2017-07-31 12:16:16.000000000 -0700
    +++ Modules/_asynciomodule.c	2017-07-31 13:08:52.000000000 -0700
    @@ -953,15 +953,18 @@
     FutureObj_dealloc(PyObject *self)
     {
         FutureObj *fut = (FutureObj *)self;
    +    PyObject_GC_UnTrack(self);
     
         if (Future_CheckExact(fut)) {
             /* When fut is subclass of Future, finalizer is called from
              * subtype_dealloc.
              */
    +        _PyObject_GC_TRACK(self);
             if (PyObject_CallFinalizerFromDealloc(self) < 0) {
                 // resurrected.
                 return;
             }
    +        _PyObject_GC_UNTRACK(self);
         }
     
         if (fut->fut_weakreflist != NULL) {
    @@ -1828,14 +1831,18 @@
     {
         TaskObj *task = (TaskObj *)self;
     
    +    PyObject_GC_UnTrack(self);
    +
         if (Task_CheckExact(self)) {
             /* When fut is subclass of Task, finalizer is called from
              * subtype_dealloc.
              */
    +         _PyObject_GC_TRACK(self);
             if (PyObject_CallFinalizerFromDealloc(self) < 0) {
                 // resurrected.
                 return;
             }
    +        _PyObject_GC_UNTRACK(self);
         }
     
         if (task->task_weakreflist != NULL) {

    @1st1
    Copy link
    Member

    1st1 commented Aug 1, 2017

    Alexander, I think this is it, you found it! Can you open a PR against master with a fix? I'll review & merge it.

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Aug 1, 2017

    ok, created: #2966

    there are some other deallocs in there, mind verifying the rest?

    @methane
    Copy link
    Member

    methane commented Aug 2, 2017

    New changeset de34cbe by INADA Naoki (Alexander Mohr) in branch 'master':
    bpo-31061: fix crash in asyncio speedup module (GH-2966)
    de34cbe

    @methane
    Copy link
    Member

    methane commented Aug 2, 2017

    New changeset f142e85 by INADA Naoki in branch '3.6':
    bpo-31061: fix crash in asyncio speedup module (GH-2984)
    f142e85

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Aug 2, 2017

    I've verified that this along with the changes in 31095 resolve the crashes I've been seeing in our production environment

    @1st1
    Copy link
    Member

    1st1 commented Aug 2, 2017

    Thank you, Alexander. Bugs like this are notoriously hard to fix, you saved us from hours, if not days, of debugging.

    @1st1 1st1 closed this as completed Aug 2, 2017
    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Aug 5, 2017

    bad news, I just got a crash in the same place (updating defaultdict) after running for a week with the fixes from this and inada naoki's patches. I think the threadpool may be leaking threads too as I had > 40 threads after running for a week when I use no more than ~10. I'm going to switch to debug build and will update when I get more details.

    @thehesiod
    Copy link
    Mannequin Author

    thehesiod mannequin commented Aug 7, 2017

    hmm, may be my fault due to docker image tagging issue. Will redeploy and update if the issue persists. If I don't reply again sorry for the noise.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants