classification
Title: [3.5] crash in gen_traverse(): gi_frame.ob_type=NULL, called by subtract_refs() during a GC collection
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: iwienand, mcepl, pitrou, serhiy.storchaka, vstinner, vzhestkov, yselivanov
Priority: normal Keywords:

Created on 2017-09-12 08:35 by iwienand, last changed 2021-08-04 15:32 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
crash-bt.txt iwienand, 2017-09-12 08:34 gdb backtrace of segfault
crash-py-bt.txt iwienand, 2017-09-12 08:35 GDB Python backtrace
gdb-bt-full.txt vzhestkov, 2021-07-20 07:34 gdb bt full
py-bt.txt vzhestkov, 2021-07-20 07:35
gbd-bt-brief.txt vzhestkov, 2021-07-20 07:35
Messages (6)
msg301943 - (view) Author: Ian Wienand (iwienand) * Date: 2017-09-12 08:34
Using 3.5.2-2ubuntu0~16.04.3 (Xenial) we see an occasional segfault during garbage collection of a generator object

A full backtrace is attached, but the crash appears to be triggered inside gen_traverse during gc

---
(gdb) info args
gen = 0x7f22385f0150
visit = 0x50eaa0 <visit_decref>
arg = 0x0

(gdb) print *gen
$109 = {ob_base = {ob_refcnt = 1, ob_type = 0xa35760 <PyGen_Type>}, gi_frame = 0x386aed8, gi_running = 1 '\001', gi_code = <code at remote 0x7f223bb42f60>, gi_weakreflist = 0x0, gi_name = 'linesplit', gi_qualname = 'linesplit'}
---

I believe gen_traverse is doing the following

---
static int
gen_traverse(PyGenObject *gen, visitproc visit, void *arg)
{
    Py_VISIT((PyObject *)gen->gi_frame);
    Py_VISIT(gen->gi_code);
    Py_VISIT(gen->gi_name);
    Py_VISIT(gen->gi_qualname);
    return 0;
}
---

The problem here being that this generator's gen->gi_frame has managed to acquire a NULL object type but still has references

---
(gdb) print *gen->gi_frame
$112 = {ob_base = {ob_base = {ob_refcnt = 2, ob_type = 0x0}, ob_size = 0}, f_back = 0x0, f_code = 0xca3e4fd8950fef91, ...
---

Thus it gets visited and it doesn't go well.

I have attached the py-bt as well, it's very deep with ansible, multiprocessing forking, imp.load_source() importing ... basically a nightmare.  I have not managed to get it down to any sort of minimal test case unfortunately.  This happens fairly infrequently, so suggests a race.  The generator in question has a socket involved:

---
def linesplit(socket):
    buff = socket.recv(4096).decode("utf-8")
    buffering = True
    while buffering:
        if "\n" in buff:
            (line, buff) = buff.split("\n", 1)
            yield line + "\n"
        else:
            more = socket.recv(4096).decode("utf-8")
            if not more:
                buffering = False
            else:
                buff += more
    if buff:
        yield buff
---

Wild speculation but maybe something to do with finalizing generators with file-descriptors across fork()?

At this point we are trying a work-around of not having the above socket reading routine in a generator but just a "regular" loop.  As it triggers as part of a production roll-out I'm not sure we can do too much more debugging.  Unless this rings any immediate bells for people, we can probably just have this for tracking at this point.  [1] is the original upstream issue
 
[1] https://storyboard.openstack.org/#!/story/2001186#comment-17441
msg301944 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-12 09:22
Python 3.5 moved to security only fixes recently, it doesn't accept bug fixes anymore:
https://devguide.python.org/#status-of-python-branches

It would be nice to Python 3.5.4 at least, or better: Python 3.6.x.


> (gdb) print *gen->gi_frame
> $112 = {ob_base = {ob_base = {ob_refcnt = 2, ob_type = 0x0}, ob_size = 0}, f_back = 0x0, f_code = 0xca3e4fd8950fef91, ...

ob_type should never be NULL for an object still reachable and with a reference count different than zero. It seems like a bug in a C extension. It would help to test your application on a Python compiled in debug mode.
msg301945 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-12 09:31
I pointed bpo-26617 to Ian since Python 3.5.2 contains this GC crash, but it seems like it's not the same bug.
msg301946 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-12 09:36
> I pointed bpo-26617 to Ian since Python 3.5.2 contains this GC crash, but it seems like it's not the same bug.

Ah, I found an issue which had bpo-26617 in subtract_refs():
https://stackoverflow.com/questions/39990934/debugging-python-segmentation-faults-in-garbage-collection

So it's not only update_refs() called during GC collection.
msg397858 - (view) Author: Victor Zhestkov (vzhestkov) Date: 2021-07-20 07:34
It seems I have the same segfault, but with 3.6.13 python shipped with SLE15SP2. It's salt-api process under intensive usage. I'm able to reproduce it, but can't isolate due to the service complexity. In some cases it takes about 5 minutes to be crashed, but in others it could run with no crash for about an hour or more (I keep the workload on this service with a kind of stress test).
msg398905 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-04 15:32
This bug report mentions Python 3.5 and 3.6 which no longer accept bugfixes. Since nobody reported issues on Python 3.9 and newer (which still accept bugfixes), I close the issue as out of date.

Victor Zhestkov:
> It seems I have the same segfault, but with 3.6.13 python shipped with SLE15SP2. It's salt-api process under intensive usage. I'm able to reproduce it, but can't isolate due to the service complexity. In some cases it takes about 5 minutes to be crashed, but in others it could run with no crash for about an hour or more (I keep the workload on this service with a kind of stress test).

See my notes to debug crashes happening during GC collections:
https://pythondev.readthedocs.io/debug_tools.html#debug-crash-in-garbage-collection-visit-decref

You can try to use a way smaller GC threshold: call gc.set_threshold(5) at the very beginning of your application.

I strongly advice you to use a debug mode of Python, since it includes way more debug modes.

I also strongly advice you to upgrade Python. I added many debug checks for object consistency in the GC in recent Python releases (3.8, 3.9, 3.10) and when a bug arises, Python dumps way more information about the faulty Python object.

Good luck for debug it. But please don't comment this closed issue. Python 3.6 is no longer supported.
History
Date User Action Args
2021-08-04 15:32:05vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg398905

stage: resolved
2021-07-20 08:50:43mceplsetnosy: + mcepl

versions: + Python 3.6
2021-07-20 07:35:35vzhestkovsetfiles: + gbd-bt-brief.txt
2021-07-20 07:35:18vzhestkovsetfiles: + py-bt.txt
2021-07-20 07:35:00vzhestkovsetfiles: + gdb-bt-full.txt
nosy: + vzhestkov
messages: + msg397858

2017-09-12 09:42:55vstinnersettitle: [3.5] gen_traverse(): gi_frame.ob_type=NULL when called by subtract_refs() during a GC collection -> [3.5] crash in gen_traverse(): gi_frame.ob_type=NULL, called by subtract_refs() during a GC collection
2017-09-12 09:36:19vstinnersetmessages: + msg301946
2017-09-12 09:34:02vstinnersetnosy: + yselivanov
2017-09-12 09:33:35vstinnersettitle: Segfault during GC of generator object; invalid gi_frame? -> [3.5] gen_traverse(): gi_frame.ob_type=NULL when called by subtract_refs() during a GC collection
2017-09-12 09:31:19vstinnersetnosy: + pitrou, serhiy.storchaka
messages: + msg301945
2017-09-12 09:22:31vstinnersetnosy: + vstinner
messages: + msg301944
2017-09-12 08:35:51iwienandsetfiles: + crash-py-bt.txt
2017-09-12 08:35:09iwienandcreate