classification
Title: Segfault in gc with cyclic trash
Type: crash Stage: resolved
Components: Versions: Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: Guido.van.Rossum, asvetlov, gvanrossum, haypo, inglesp, larry, pitrou, python-dev, tim.golden, tim.peters
Priority: normal Keywords: patch

Created on 2014-05-05 11:59 by inglesp, last changed 2014-05-10 08:55 by asvetlov. This issue is now closed.

Files
File name Uploaded Description Edit
problemreport.txt inglesp, 2014-05-05 11:59
finalize.patch tim.peters, 2014-05-06 23:22 Redo iteration logic in finalize_garbage(). review
finalize2.patch tim.peters, 2014-05-07 03:28 Cleaner patch, but same idea. review
finalize3.patch tim.peters, 2014-05-07 03:33 Restore incref/decref pair. review
finalize4.patch tim.peters, 2014-05-07 23:25 Typo repair and minor improvements review
xxx.py tim.peters, 2014-05-08 00:56 Self-contained (no imports) crasher
finalize42.patch tim.peters, 2014-05-08 21:35 Code changes + new test_gc.py test. review
Messages (32)
msg217918 - (view) Author: Peter Inglesby (inglesp) * Date: 2014-05-05 11:59
The following code causes a segfault when run under Python3.4+ on OSX10.9.

# segfaulter.py
import asyncio
 
class A:
    pass
 
class B:
    def __init__(self, future):
        self.future = future
 
    def __del__(self):
        self.a = None
 
@asyncio.coroutine
def do_work(future):
    a = A()
    b = B(asyncio.Future())
 
    a.b = b
    b.a = a
 
    future.set_result(None)
 
future = asyncio.Future()
asyncio.Task(do_work(future))
loop = asyncio.get_event_loop()
loop.run_until_complete(future)


It does matter that the Future is passed to b's __init__ method.

It doesn't matter whether the Future has been resolved.

Attached is the problem report generated by OSX on the crash.
msg217931 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-05 16:29
I can confirm that this crashes on Windows as well. Failure actually occurs in gcmodule.c:finalize_garbage. Adding Guido as it looks as it's tied to asyncio.
msg217936 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-05-05 17:23
Confirmed too (OSX 10.9 again, CPython 3.4 or 3.5; but it doesn't crash with CPython 3.3). There is no C code in asyncio, so I'm not sure how asyncio can be directly responsible for this crash. Probably some of the GC improvements have an edge case that is triggered by the example program. Adding Antoine and Victor, they probably understand that code better.
msg217956 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-05 22:30
Here is the gdb backtrace. The `gc` variable is equal to NULL.

0x000000000043d254 in finalize_garbage (collectable=0x7fffffffdc40, old=0x8fdae0 <generations+64>)
    at Modules/gcmodule.c:788
788	        if (!_PyGCHead_FINALIZED(gc) &&
(gdb) bt
#0  0x000000000043d254 in finalize_garbage (collectable=0x7fffffffdc40, old=0x8fdae0 <generations+64>)
    at Modules/gcmodule.c:788
#1  0x000000000043dac0 in collect (generation=2, n_collected=0x7fffffffdca8, n_uncollectable=0x7fffffffdcb0, nofail=0)
    at Modules/gcmodule.c:1009
#2  0x000000000043dffc in collect_with_callback (generation=2) at Modules/gcmodule.c:1128
#3  0x000000000043eda0 in PyGC_Collect () at Modules/gcmodule.c:1604
#4  0x000000000041f332 in Py_Finalize () at Python/pythonrun.c:607
#5  0x000000000043c401 in Py_Main (argc=2, argv=0x980020) at Modules/main.c:788
#6  0x000000000041af96 in main (argc=2, argv=0x7fffffffdfc8) at ./Modules/python.c:69

(gdb) p gc
$1 = (PyGC_Head *) 0x0
msg217959 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-05-06 01:06
So should we just add

    if (!gc)
        break;

at the top of the for-loop body?  That prevents the crash for me.  But why is it needed?
msg217960 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-06 01:25
Guido, that's no good.  The outer loop is traversing a doubly-linked circular list, and it should be flatly impossible for gc to ever be NULL - the list structure is insanely damaged if any gc_next or gc_prev field reachable from it is NULL (gc always comes from a gc_next or gc_prev field)

I'd wonder whether the `collectable` argument passed in was NULL to begin with, but Antoine's traceback shows that it's not.

So, like most crashes in gc, we're just seeing the end symptom of something that went wrong long before.  Guarding against gc != NULL is just hiding this particular symptom.
msg217961 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-06 01:36
Thought question:  suppose finalize_garbage() is called with a collectable list all of whose members are in fact going to be destroyed?

I don't see how the loop iteration logic could reliably work then.  For concreteness, suppose there's only object - A - in the list.  It's a circular list so gc starts as A.  A is finalized, we see it's refcount is 1, `gc = gc->gc.gc_prev` sets gc to A again, and A is destroyed by the Py_DECREF.  We go back to the top of the loop, and then `gc = gc->gc.gc_next` reads up trash (free'd) memory.

Or so it seems to me ;-)
msg217962 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-06 01:49
Noting that in a Windows debug build, at death gc is 0xdbdbdbdb, so we are in reading up free'd memory (0xdb is pymalloc's "dead byte" marker).
msg217963 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-06 02:08
Also noting that the loop "looks funny" because it appears never to process the first element in the input list (unless the input list contains only one element).  That is, the loop starts by processing the second element in the list, and exits as soon as gc makes it back to the first element in the list.
msg217964 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-06 02:15
Ah, fudge - I think I forgot that these circular lists also have dummy heads, in which case it's A Good Thing "the first element" is skipped, and  it should be impossible for one to become physically empty (the dummy header should always survive intact).  Never mind ;-)
msg217968 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-06 04:51
Sorry for the earlier noise.  I'm fighting a flu and my head is mush :-(

Anyway, this doesn't look obvious.  We get to this point:

            if (Py_REFCNT(op) == 1) {
                /* op will be destroyed */
                gc = gc->gc.gc_prev;
            }

and op is the type object for class B.  gc gets set to the previous object, a list.  Everything looks fine at this point.  But when we get back from:

            Py_DECREF(op);

the list's gc.gc_next field has been overwritten with NULL.  That's why gc gets set to NULL on the next trip through the loop.

I spaced out stepping through all the type deallocation code, and didn't find exactly when the list's gc_next is overwritten.  The list's gc_prev is still fine.  Perhaps some code called _PyObject_GC_UNTRACK on the list object (which NULLs out the gc_next pointer but not the gc_prev pointer).
msg217971 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-05-06 09:29
Similar(?) issue #20526:
Modules/gcmodule.c:379: visit_decref: Assertion `((gc)->gc.gc_refs >> (1)) != 0' failed.
msg218028 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-06 23:22
A bit more info:  recall that, when deleting a type object (for class B), the previous (list) object's gc_next got overwritten with NULL.  The new info:  its (the list object's) gc_refs also changed from GC_TENTATIVELY_UNREACHABLE to GC_UNTRACKED,  That the object became untracked is wholly consistent with that its gc_next became NULL but not its gc_prev.

I haven't tracked it down all the way to the offending code, but I wonder whether that's worth the bother.  What reason do we have to believe that

            Py_DECREF(op);

CANNOT cause other objects in the `collectable` list to become recognized as trash too?  We're in cyclic trash, and one decref can trigger an arbitrary number of other objects to become trash.

So I'll attach a patch that doesn't assume the Py_DECREF is harmless, by moving `collectable` objects one at a time to a temporary list, and then - at the end - moves all survivors back to `collectable`.  That makes the error go away, but I can't be certain it's address the real problem (since I stopped looking for the code that messed with the list as a side effect of doing a decref on the type object for class B).
msg218035 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-07 03:28
Attaching a marginally cleaner version of the patch, with comments.

I think it's clear this addresses _some_ real fatal problems, but they're rare:  for a problem to show up, it has to be the case that finalize() reduces the refcount of the current object to 0, and that this also causes "the immediately previous" object in the gc list to become free'd, or at least untracked from gc.  Then the next "current object" will be taken from an insane gc_next member, and all bets are off.
msg218036 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-07 03:33
Oh, fudge.  There are other failure modes in the test suite after I took out the seemingly redundant incref/decref pair.  Adding it back in finalize3.patch.
msg218037 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-05-07 03:36
Thanks Tim! I'm not sure who should review the patch, but it's not me. :-) I've verified that the patch stops the example program from segfaulting on OSX, and I assume you've verified it on Windows -- that's good enough for me. We don't see this happening much in asyncio, so it's a likely story that the demo program happens to arrange for the refcounts to be just so.
msg218038 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-07 03:55
Guido, it's best if Antoine reviewed it - he wrote the relevant PEP, and the code I'm changing.  I'm sure he'll agree it's a necessary change.

About the rarity of the bug, not only do the refcounts and cyclic structure have to be just right, we also need exactly the right parts of the cyclic trash to appear in exactly the right order in gc's generation list.  It really has nothing to do with asyncio.  But it's hard to provoke!  Even knowing the cause, I have yet to create a simple failing test case.
msg218041 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-07 08:37
Tim's analysis is spot on, and finalize3.patch looks good to me (there's some strange commenting style there - do the carets "^" mean something special?).

Still, I hope we can find a way to write a test case.
msg218042 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-07 08:39
> The new info:  its (the list object's) gc_refs also changed from
> GC_TENTATIVELY_UNREACHABLE to GC_UNTRACKED,  That the object became
> untracked is wholly consistent with that its gc_next became NULL but
> not its gc_prev.

Could that be the trashcan mecanism?
msg218043 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-07 08:42
Larry, once this patch is finalized, I think it is a good candidate for 3.4.1.
msg218072 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-07 19:13
Antoine, the carets are a symptom of my flu.  My fingers jump on the keyboard from sneezing and coughing, and my eyes are watery so I don't see the typos.  But I believe that can be fixed ;-)

I doubt the trashcan cruft is responsible, for several reasons:  I doubt the stack gets deep enough to trigger the trashcan in this little test;  the trashcan xxx_deposit_object() functions assert-fail unless the object's state is _already_ untracked (trashcan does not itself untrack anything); and the trashcan list is linked via the gc_prev member, not the gc_next member (trashcan doesn't touch gc_next, so could not have set gc_next to NULL).  Because of the second reason, even if the trashcan is involved the object must have gotten untracked earlier.

I'll do another debugger session after a nap ;-)
msg218089 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-07 22:47
OK!  This has nothing to do with the trashcan mechanism.

The list object whose gc_next gets stomped on is not itself in a cycle.  It's an empty list, and just happens to be a value in a dict, which in turn is a value in another dict.  Its refcount falls to 0 as an ordinary part of its containing dict getting deallocated, and that's why the list becomes untracked.

This was confusing me because the memory for the list object was apparently not deallocated:  if it had been, pymalloc would have sprayed 0xdb into most of it, and gc_next would have appeared to me as 0xdbdbdbdb, not as 0.  But after calling PyObject_GC_UnTrack on it (which sets gc_next to NULL), list_dealloc() just pushed the list object onto a free list, so no other kind of list destruction got done.

That pretty much explains everything.  Cute:  it so happens that the _entire_ `collectable` list gets cleared out as a side effect of a single

            finalize(op);

call.  The iteration approach in the patch is robust against that, but it's hard to imagine that anything simpler could be.
msg218091 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-07 23:25
finalize4.patch repairs the comment typos, adds a new comment, and removes the unused `old` argument.  I think the code is ready to ship with this.  

I still don't have a reasonably simple fails-before-works-after test case.    But that doesn't bother me much, since "the problem" is obvious once it's been seen, and the patch obviously fixes it, and any way of trying to provoke this from pure Python will rely on implementation accidents to get exactly the right pieces of cyclic trash in exactly the right order in gc's internal lists.  Still, hope springs eternal ;-)
msg218094 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-08 00:56
xxx.py provokes a closely related death on my box, in a debug build (where 0xdbdbdbdb happened to be an invalid memory address).  There are no imports so asyncio is definitely off the hook ;-)

The code is senseless, and changing just about anything makes the problem go away.  As mentioned before, provoking this class of error relies on all sorts of implementation accidents to get the internal gc lists into just the right state to fail.
msg218115 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-08 15:02
> finalize4.patch repairs the comment typos, adds a new comment, and
> removes the unused `old` argument.  I think the code is ready to ship 
> with this.

Thanks! So do I.
msg218116 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-05-08 15:08
I'm totally on board with you guys checking this in for 3.4.1.
msg218125 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-08 21:35
finalize42.patch includes a test case.  If nobody objects within a few hours, I'll commit it.
msg218134 - (view) Author: Roundup Robot (python-dev) Date: 2014-05-08 22:43
New changeset 64ba3f2de99c by Tim Peters in branch '3.4':
Issue #21435: Segfault in gc with cyclic trash
http://hg.python.org/cpython/rev/64ba3f2de99c

New changeset cb9a3985df00 by Tim Peters in branch 'default':
Merge from 3.4.
http://hg.python.org/cpython/rev/cb9a3985df00
msg218138 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2014-05-09 06:12
Thanks a lot!
The patch fixes crush dump issue with __del__ in aiohttp library tests also.
msg218194 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-09 18:57
@asvetlov, glad this fixes crashes in aiohttp library tests too, but I hadn't heard about that before.  Is there an open bug report about it on this tracker (so we can close it)?
msg218205 - (view) Author: Peter Inglesby (inglesp) * Date: 2014-05-09 22:54
It was actually through playing with aiohttp that I first hit this issue.  I think I originally hit the problem with something like:

import asyncio
import aiohttp

@asyncio.coroutine
def do_work(future):
    response = yield from aiohttp.request('get', 'http://google.com')
    future.set_result(None)

loop = asyncio.get_event_loop()
future = asyncio.Future()
asyncio.Task(do_work(future))
future.add_done_callback(print)
loop.run_forever()
msg218214 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2014-05-10 08:55
@Tim nothing to close, aiohttp is separate library based on asyncio.
It just uses constructions like self.attr = None in __del__
History
Date User Action Args
2014-05-10 08:55:38asvetlovsetmessages: + msg218214
2014-05-09 22:54:11inglespsetmessages: + msg218205
2014-05-09 18:57:29tim.peterssetmessages: + msg218194
2014-05-09 06:12:11asvetlovsetnosy: + asvetlov
messages: + msg218138
2014-05-08 22:46:10tim.peterssetstatus: open -> closed
resolution: fixed
stage: resolved
2014-05-08 22:43:40python-devsetnosy: + python-dev
messages: + msg218134
2014-05-08 21:35:25tim.peterssetfiles: + finalize42.patch

messages: + msg218125
2014-05-08 21:09:06tim.peterssettitle: Segfault with cyclic reference and asyncio.Future -> Segfault in gc with cyclic trash
2014-05-08 20:23:26tim.peterssetassignee: tim.peters
2014-05-08 15:08:27larrysetmessages: + msg218116
2014-05-08 15:02:52pitrousetmessages: + msg218115
2014-05-08 00:56:46tim.peterssetfiles: + xxx.py

messages: + msg218094
2014-05-07 23:25:15tim.peterssetfiles: + finalize4.patch

messages: + msg218091
2014-05-07 22:47:10tim.peterssetmessages: + msg218089
2014-05-07 19:13:34tim.peterssetmessages: + msg218072
2014-05-07 08:42:57pitrousetnosy: + larry
messages: + msg218043
2014-05-07 08:39:31pitrousetmessages: + msg218042
2014-05-07 08:37:27pitrousetmessages: + msg218041
2014-05-07 03:55:07tim.peterssetmessages: + msg218038
2014-05-07 03:36:33gvanrossumsetmessages: + msg218037
2014-05-07 03:33:52tim.peterssetfiles: + finalize3.patch

messages: + msg218036
2014-05-07 03:29:00tim.peterssetfiles: + finalize2.patch

messages: + msg218035
2014-05-06 23:22:19tim.peterssetfiles: + finalize.patch
keywords: + patch
messages: + msg218028
2014-05-06 09:29:29hayposetmessages: + msg217971
2014-05-06 04:51:51tim.peterssetmessages: + msg217968
2014-05-06 02:15:31tim.peterssetmessages: + msg217964
2014-05-06 02:08:16tim.peterssetmessages: + msg217963
2014-05-06 01:49:39tim.peterssetmessages: + msg217962
2014-05-06 01:36:19tim.peterssetmessages: + msg217961
2014-05-06 01:25:23tim.peterssetnosy: + tim.peters
messages: + msg217960
2014-05-06 01:06:15gvanrossumsetmessages: + msg217959
2014-05-05 22:30:30pitrousetmessages: + msg217956
2014-05-05 17:23:04gvanrossumsetnosy: + pitrou, gvanrossum, haypo
messages: + msg217936
2014-05-05 16:29:18tim.goldensetnosy: + tim.golden, Guido.van.Rossum
messages: + msg217931
2014-05-05 11:59:58inglespcreate