classification
Title: [3.5] crash in gen_traverse(): gi_frame.ob_type=NULL, called by subtract_refs() during a GC collection
Type: crash Stage:
Components: Interpreter Core Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: iwienand, pitrou, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords:

Created on 2017-09-12 08:35 by iwienand, last changed 2017-09-12 09:42 by vstinner.

Files
File name Uploaded Description Edit
crash-bt.txt iwienand, 2017-09-12 08:34 gdb backtrace of segfault
crash-py-bt.txt iwienand, 2017-09-12 08:35 GDB Python backtrace
Messages (4)
msg301943 - (view) Author: Ian Wienand (iwienand) Date: 2017-09-12 08:34
Using 3.5.2-2ubuntu0~16.04.3 (Xenial) we see an occasional segfault during garbage collection of a generator object

A full backtrace is attached, but the crash appears to be triggered inside gen_traverse during gc

---
(gdb) info args
gen = 0x7f22385f0150
visit = 0x50eaa0 <visit_decref>
arg = 0x0

(gdb) print *gen
$109 = {ob_base = {ob_refcnt = 1, ob_type = 0xa35760 <PyGen_Type>}, gi_frame = 0x386aed8, gi_running = 1 '\001', gi_code = <code at remote 0x7f223bb42f60>, gi_weakreflist = 0x0, gi_name = 'linesplit', gi_qualname = 'linesplit'}
---

I believe gen_traverse is doing the following

---
static int
gen_traverse(PyGenObject *gen, visitproc visit, void *arg)
{
    Py_VISIT((PyObject *)gen->gi_frame);
    Py_VISIT(gen->gi_code);
    Py_VISIT(gen->gi_name);
    Py_VISIT(gen->gi_qualname);
    return 0;
}
---

The problem here being that this generator's gen->gi_frame has managed to acquire a NULL object type but still has references

---
(gdb) print *gen->gi_frame
$112 = {ob_base = {ob_base = {ob_refcnt = 2, ob_type = 0x0}, ob_size = 0}, f_back = 0x0, f_code = 0xca3e4fd8950fef91, ...
---

Thus it gets visited and it doesn't go well.

I have attached the py-bt as well, it's very deep with ansible, multiprocessing forking, imp.load_source() importing ... basically a nightmare.  I have not managed to get it down to any sort of minimal test case unfortunately.  This happens fairly infrequently, so suggests a race.  The generator in question has a socket involved:

---
def linesplit(socket):
    buff = socket.recv(4096).decode("utf-8")
    buffering = True
    while buffering:
        if "\n" in buff:
            (line, buff) = buff.split("\n", 1)
            yield line + "\n"
        else:
            more = socket.recv(4096).decode("utf-8")
            if not more:
                buffering = False
            else:
                buff += more
    if buff:
        yield buff
---

Wild speculation but maybe something to do with finalizing generators with file-descriptors across fork()?

At this point we are trying a work-around of not having the above socket reading routine in a generator but just a "regular" loop.  As it triggers as part of a production roll-out I'm not sure we can do too much more debugging.  Unless this rings any immediate bells for people, we can probably just have this for tracking at this point.  [1] is the original upstream issue
 
[1] https://storyboard.openstack.org/#!/story/2001186#comment-17441
msg301944 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-12 09:22
Python 3.5 moved to security only fixes recently, it doesn't accept bug fixes anymore:
https://devguide.python.org/#status-of-python-branches

It would be nice to Python 3.5.4 at least, or better: Python 3.6.x.


> (gdb) print *gen->gi_frame
> $112 = {ob_base = {ob_base = {ob_refcnt = 2, ob_type = 0x0}, ob_size = 0}, f_back = 0x0, f_code = 0xca3e4fd8950fef91, ...

ob_type should never be NULL for an object still reachable and with a reference count different than zero. It seems like a bug in a C extension. It would help to test your application on a Python compiled in debug mode.
msg301945 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-12 09:31
I pointed bpo-26617 to Ian since Python 3.5.2 contains this GC crash, but it seems like it's not the same bug.
msg301946 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-12 09:36
> I pointed bpo-26617 to Ian since Python 3.5.2 contains this GC crash, but it seems like it's not the same bug.

Ah, I found an issue which had bpo-26617 in subtract_refs():
https://stackoverflow.com/questions/39990934/debugging-python-segmentation-faults-in-garbage-collection

So it's not only update_refs() called during GC collection.
History
Date User Action Args
2017-09-12 09:42:55vstinnersettitle: [3.5] gen_traverse(): gi_frame.ob_type=NULL when called by subtract_refs() during a GC collection -> [3.5] crash in gen_traverse(): gi_frame.ob_type=NULL, called by subtract_refs() during a GC collection
2017-09-12 09:36:19vstinnersetmessages: + msg301946
2017-09-12 09:34:02vstinnersetnosy: + yselivanov
2017-09-12 09:33:35vstinnersettitle: Segfault during GC of generator object; invalid gi_frame? -> [3.5] gen_traverse(): gi_frame.ob_type=NULL when called by subtract_refs() during a GC collection
2017-09-12 09:31:19vstinnersetnosy: + pitrou, serhiy.storchaka
messages: + msg301945
2017-09-12 09:22:31vstinnersetnosy: + vstinner
messages: + msg301944
2017-09-12 08:35:51iwienandsetfiles: + crash-py-bt.txt
2017-09-12 08:35:09iwienandcreate