classification
Title: asyncio segfault when using threadpool and "_asyncio" native module
Type: crash Stage: resolved
Components: asyncio Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: inada.naoki, thehesiod, yselivanov
Priority: normal Keywords:

Created on 2017-07-27 20:16 by thehesiod, last changed 2017-08-07 19:32 by thehesiod. This issue is now closed.

Files
File name Uploaded Description Edit
native___python_crash_stacks.txt thehesiod, 2017-07-27 20:16 Python and native stack trace during segfault
python crash.txt thehesiod, 2017-07-30 19:23
python_crash.py thehesiod, 2017-07-30 19:39
python crash2.txt thehesiod, 2017-07-30 21:28
Pull Requests
URL Status Linked Edit
PR 2966 merged thehesiod, 2017-08-01 01:25
PR 2984 merged inada.naoki, 2017-08-02 07:11
Messages (23)
msg299346 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-27 20:16
I have a project in a prod environment which heavily uses asyncio and a threadpool.  It uses the threadpool to run CPU heavy tasks (in this case populating a defaultdict) to avoid blocking the main thread (no async code in thread).  For some time now my service has been randomly crashing at the same place in the thread which does the dict updating.  I've finally got both the python and native stack traces, and based on the information presented it looked very similar to the issue found by the devs at home-assistant (https://github.com/home-assistant/home-assistant/issues/7752#issuecomment-305100009, which points to https://github.com/home-assistant/home-assistant/pull/7848).  So I tried their fix of disabling the "_asyncio" module, and lo and behold python no longer segfaults.

Per the stacktrace it's crashing in PyObject_GC_Del, and the only place this is used in the asyncio module seems to be here: https://github.com/python/cpython/blob/master/Modules/_asynciomodule.c#L996

does anyone have any idea why it's crashing on this line?  Are there thread protections missing in this file?

I'm trying to reproduce this in a testcase but it's proving very difficult as I'm guessing it's timing related.
msg299347 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-27 20:19
btw I've seen this issue in 3.5.2 + 3.6.2 on debian jessie + stretch
msg299349 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-07-27 20:24
Interesting. A code to reproduce this crash would help tremendously. So far this is the first time I hear about this crash.
msg299421 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-28 19:03
so looks like disabling the _asyncio module just caused the crash to happen less often, closing and will continue investigating after a get a core file
msg299422 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-07-28 19:11
> so looks like disabling the _asyncio module just caused the crash to happen less often, closing and will continue investigating after a get a core file

Interesting. So it crashes even with pure Python Future/Task implementations?

Another idea to debug this is to try to see what happens when you run your code with CFuture and PyTask, and with PyFuture/CTask.
msg299426 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-28 19:29
the problem with this crash is that it only happens periodically in our prod environment :( If I try running the exact same docker container with the same inputs locally it doesn't reproduce, so frustrating.  I've created a whole workflow now for deploying with a debug python to get a core file with symbols.  Hopefully have some more info w/in a day.  Thanks for the tips!
msg299427 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-07-28 19:30
Thank you for trying to get to the bottom of this!
msg299512 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-30 19:23
ok got a full debug core file, let me know what other information I can provide.
msg299515 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-07-30 19:39
>ok got a full debug core file, let me know what other information I can provide.


Thank you. Does the crash happen with python versions of Tasks and Future?
msg299516 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-30 19:39
this is the comment on the assert:

/* Python's cyclic gc should never see an incoming refcount 
 * of 0:  if something decref'ed to 0, it should have been 
 * deallocated immediately at that time. 
 * Possible cause (if the assert triggers):  a tp_dealloc  
 * routine left a gc-aware object tracked during its teardown 
 * phase, and did something-- or allowed something to happen -- 
 * that called back into Python.  gc can trigger then, and may 
 * see the still-tracked dying object.  Before this assert 
 * was added, such mistakes went on to allow gc to try to 
 * delete the object again.  In a debug build, that caused 
 * a mysterious segfault, when _Py_ForgetReference tried 
 * to remove the object from the doubly-linked list of all 
 * objects a second time.  In a release build, an actual 
 * double deallocation occurred, which leads to corruption 
 * of the allocator's internal bookkeeping pointers.  That's 
 * so serious that maybe this should be a release-build 
 * check instead of an assert?  
 */  

I've also attached a file that's similar to the code we run in production, however couldn't get it to reproduce the crash.  In the datafile it uses it has some tuples like the following:
StationTuple = namedtuple('StationTuple', ['stationname', 'stationsubtype', 's2id'])
msg299517 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-30 19:46
hmm, how would I do that? btw I'm not 100% sure this is due to asyncio.
msg299522 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-30 21:28
btw got slightly difference stacktrace on second core file
msg299574 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-31 18:15
oh, so this is looking like an asyncio issue, the "gc" that is causing the crash is:

(gdb) print *FROM_GC(gc)->ob_type
$8 = {ob_base = {ob_base = {_ob_next = 0x7f80c8aafc88, _ob_prev = 0x7f80c8aafd00, ob_refcnt = 7, ob_type = 0x7f80cd8c86e0 <PyType_Type>}, ob_size = 0}, tp_name = 0x7f80c8aa5c38 "_GatheringFuture", tp_basicsize = 104, tp_itemsize = 0,
  tp_dealloc = 0x7f80cd41bee7 <subtype_dealloc>, tp_print = 0x0, tp_getattr = 0x0, tp_setattr = 0x0, tp_as_async = 0x556ba4342d58, tp_repr = 0x7f80c8870a9b <FutureObj_repr>, tp_as_number = 0x556ba4342d70, tp_as_sequence = 0x556ba4342ea8,
  tp_as_mapping = 0x556ba4342e90, tp_hash = 0x7f80cd54c995 <_Py_HashPointer>, tp_call = 0x0, tp_str = 0x7f80cd423d05 <object_str>, tp_getattro = 0x7f80cd3fec8b <PyObject_GenericGetAttr>, tp_setattro = 0x7f80cd3ff0b5 <PyObject_GenericSetAttr>,
  tp_as_buffer = 0x556ba4342ef8, tp_flags = 808449, tp_doc = 0x7f80c8cd7380 "Helper for gather().\n\n    This overrides cancel() to cancel all the children and act more\n    like Task.cancel(), which doesn't immediately mark itself as\n    cancelled.\n    ",
  tp_traverse = 0x7f80cd41baae <subtype_traverse>, tp_clear = 0x7f80cd41bd5c <subtype_clear>, tp_richcompare = 0x7f80cd423d42 <object_richcompare>, tp_weaklistoffset = 96, tp_iter = 0x7f80c8871d4f <future_new_iter>,
  tp_iternext = 0x7f80cd3fe6d6 <_PyObject_NextNotImplemented>, tp_methods = 0x0, tp_members = 0x556ba4342f28, tp_getset = 0x0, tp_base = 0x7f80c8a7b3c0 <FutureType>, tp_dict = 0x7f80c8aafc88, tp_descr_get = 0x0, tp_descr_set = 0x0, tp_dictoffset = 88,
  tp_init = 0x7f80cd431000 <slot_tp_init>, tp_alloc = 0x7f80cd41b7f8 <PyType_GenericAlloc>, tp_new = 0x7f80cd41b9c6 <PyType_GenericNew>, tp_free = 0x7f80cd583d8b <PyObject_GC_Del>, tp_is_gc = 0x0, tp_bases = 0x7f80c8ab20c0, tp_mro = 0x7f80c8aafd00, 
  tp_cache = 0x0, tp_subclasses = 0x0, tp_weaklist = 0x7f80c8aae5d8, tp_del = 0x0, tp_version_tag = 791, tp_finalize = 0x7f80c8870ddb <FutureObj_finalize>}

note: it's a _GatheringFuture.
msg299581 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-31 20:00
another core had a different gc object:

$1 = {ob_base = {ob_base = {_ob_next = 0x7f801eac3158, _ob_prev = 0x7f801eab95a0, ob_refcnt = 41, ob_type = 0x7f80238e76e0 <PyType_Type>}, ob_size = 0}, tp_name = 0x7f801e8967af "_asyncio.Task", tp_basicsize = 128, tp_itemsize = 0,
  tp_dealloc = 0x7f801e8926e5 <TaskObj_dealloc>, tp_print = 0x0, tp_getattr = 0x0, tp_setattr = 0x0, tp_as_async = 0x7f801ea99720 <FutureType_as_async>, tp_repr = 0x7f801e88fa9b <FutureObj_repr>, tp_as_number = 0x0, tp_as_sequence = 0x0, tp_as_mapping = 0x0,
  tp_hash = 0x7f802356b995 <_Py_HashPointer>, tp_call = 0x0, tp_str = 0x7f8023442d05 <object_str>, tp_getattro = 0x7f802341dc8b <PyObject_GenericGetAttr>, tp_setattro = 0x7f802341e0b5 <PyObject_GenericSetAttr>, tp_as_buffer = 0x0, tp_flags = 807937,
  tp_doc = 0x7f801ea98bc0 <_asyncio_Task___init____doc__> "Task(coro, *, loop=None)\n--\n\nA coroutine wrapped in a Future.", tp_traverse = 0x7f801e891658 <TaskObj_traverse>, tp_clear = 0x7f801e89150b <TaskObj_clear>,
  tp_richcompare = 0x7f8023442d42 <object_richcompare>, tp_weaklistoffset = 96, tp_iter = 0x7f801e890d4f <future_new_iter>, tp_iternext = 0x0, tp_methods = 0x7f801ea99b20 <TaskType_methods>, tp_members = 0x0, tp_getset = 0x7f801ea99d40 <TaskType_getsetlist>,
  tp_base = 0x7f801ea9a3c0 <FutureType>, tp_dict = 0x7f801eac2238, tp_descr_get = 0x0, tp_descr_set = 0x0, tp_dictoffset = 88, tp_init = 0x7f801e88d84d <_asyncio_Task___init__>, tp_alloc = 0x7f802343a7f8 <PyType_GenericAlloc>,
  tp_new = 0x7f802343a9c6 <PyType_GenericNew>, tp_free = 0x7f80235a2d8b <PyObject_GC_Del>, tp_is_gc = 0x0, tp_bases = 0x7f801eab95a0, tp_mro = 0x7f801eabc508, tp_cache = 0x0, tp_subclasses = 0x0, tp_weaklist = 0x7f801eac3458, tp_del = 0x0, tp_version_tag = 4303,
  tp_finalize = 0x7f801e8922fd <TaskObj_finalize>}
msg299583 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-07-31 20:19
I'm hoping this is the fix:

--- Modules/_asynciomodule.c.orig	2017-07-31 12:16:16.000000000 -0700
+++ Modules/_asynciomodule.c	2017-07-31 13:08:52.000000000 -0700
@@ -953,15 +953,18 @@
 FutureObj_dealloc(PyObject *self)
 {
     FutureObj *fut = (FutureObj *)self;
+    PyObject_GC_UnTrack(self);
 
     if (Future_CheckExact(fut)) {
         /* When fut is subclass of Future, finalizer is called from
          * subtype_dealloc.
          */
+        _PyObject_GC_TRACK(self);
         if (PyObject_CallFinalizerFromDealloc(self) < 0) {
             // resurrected.
             return;
         }
+        _PyObject_GC_UNTRACK(self);
     }
 
     if (fut->fut_weakreflist != NULL) {
@@ -1828,14 +1831,18 @@
 {
     TaskObj *task = (TaskObj *)self;
 
+    PyObject_GC_UnTrack(self);
+
     if (Task_CheckExact(self)) {
         /* When fut is subclass of Task, finalizer is called from
          * subtype_dealloc.
          */
+         _PyObject_GC_TRACK(self);
         if (PyObject_CallFinalizerFromDealloc(self) < 0) {
             // resurrected.
             return;
         }
+        _PyObject_GC_UNTRACK(self);
     }
 
     if (task->task_weakreflist != NULL) {
msg299590 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-08-01 00:27
Alexander, I think this is it, you found it! Can you open a PR against master with a fix? I'll review & merge it.
msg299591 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-08-01 01:25
ok, created: https://github.com/python/cpython/pull/2966

there are some other deallocs in there, mind verifying the rest?
msg299644 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-08-02 06:31
New changeset de34cbe9cdaaf7b85fed86f99c2fd071e1a7b1d2 by INADA Naoki (Alexander Mohr) in branch 'master':
bpo-31061: fix crash in asyncio speedup module (GH-2966)
https://github.com/python/cpython/commit/de34cbe9cdaaf7b85fed86f99c2fd071e1a7b1d2
msg299647 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-08-02 07:50
New changeset f142e85d22ba135d5205280240f3a2fe1df2649f by INADA Naoki in branch '3.6':
bpo-31061: fix crash in asyncio speedup module (GH-2984)
https://github.com/python/cpython/commit/f142e85d22ba135d5205280240f3a2fe1df2649f
msg299662 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-08-02 16:18
I've verified that this along with the changes in 31095 resolve the crashes I've been seeing in our production environment
msg299663 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-08-02 16:21
Thank you, Alexander. Bugs like this are notoriously hard to fix, you saved us from hours, if not days, of debugging.
msg299792 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-08-05 18:51
bad news, I just got a crash in the same place (updating defaultdict) after running for a week with the fixes from this and inada naoki's patches.  I think the threadpool may be leaking threads too as I had > 40 threads after running for a week when I use no more than ~10.  I'm going to switch to debug build and will update when I get more details.
msg299863 - (view) Author: Alexander Mohr (thehesiod) * Date: 2017-08-07 19:32
hmm, may be my fault due to docker image tagging issue.  Will redeploy and update if the issue persists.  If I don't reply again sorry for the noise.
History
Date User Action Args
2017-08-07 19:32:20thehesiodsetmessages: + msg299863
2017-08-05 18:51:01thehesiodsetmessages: + msg299792
2017-08-02 16:21:59yselivanovsetstatus: open -> closed
resolution: fixed
messages: + msg299663

stage: resolved
2017-08-02 16:18:41thehesiodsetmessages: + msg299662
2017-08-02 07:50:43inada.naokisetmessages: + msg299647
2017-08-02 07:11:43inada.naokisetpull_requests: + pull_request3023
2017-08-02 06:31:09inada.naokisetmessages: + msg299644
2017-08-01 01:25:02thehesiodsetmessages: + msg299591
pull_requests: + pull_request3014
2017-08-01 00:27:15yselivanovsetmessages: + msg299590
2017-07-31 20:26:27pitrousetstage: resolved -> (no value)
2017-07-31 20:19:42thehesiodsetmessages: + msg299583
2017-07-31 20:00:00thehesiodsetmessages: + msg299581
2017-07-31 18:15:51thehesiodsetmessages: + msg299574
2017-07-30 21:28:44thehesiodsetfiles: + python crash2.txt

messages: + msg299522
2017-07-30 19:46:38thehesiodsetmessages: + msg299517
2017-07-30 19:39:53thehesiodsetfiles: + python_crash.py

messages: + msg299516
2017-07-30 19:39:28yselivanovsetmessages: + msg299515
2017-07-30 19:23:35thehesiodsetstatus: closed -> open
files: + python crash.txt
messages: + msg299512
2017-07-28 19:30:57yselivanovsetmessages: + msg299427
2017-07-28 19:29:58thehesiodsetmessages: + msg299426
2017-07-28 19:11:04yselivanovsetmessages: + msg299422
2017-07-28 19:03:46thehesiodsetstatus: open -> closed

messages: + msg299421
stage: resolved
2017-07-27 20:24:00yselivanovsetnosy: + inada.naoki
messages: + msg299349
2017-07-27 20:19:44thehesiodsetmessages: + msg299347
2017-07-27 20:16:30thehesiodcreate