classification
Title: [C API] Heap types (PyType_FromSpec) must fully implement the GC protocol
Type: Stage: patch review
Components: C API Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: corona10, erlendaasland, shihai1991, vstinner
Priority: normal Keywords: patch

Created on 2021-01-19 21:54 by vstinner, last changed 2021-02-01 08:34 by erlendaasland.

Pull Requests
URL Status Linked Edit
PR 23428 erlendaasland, 2021-02-01 08:34
Messages (3)
msg385297 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 21:54
Copy of my email sent to python-dev:
https://mail.python.org/archives/list/python-dev@python.org/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/

Hi,

In the Python stdlib, many heap types currently don't "properly"
(fully?) implement the GC protocol which can prevent to destroy these
types at Python exit. As a side effect, some other Python objects can
also remain alive, and so are not destroyed neither.

There is an on-going effect to destroy all Python objects at exit
(bpo-1635741). This problem is getting worse when subinterpreters are
involved: Refleaks buildbots failures which prevent to spot other
regressions, and so these "leaks" / "GC bugs" must be fixed as soon as
possible. In my experience, many leaks spotted by tests using
subinterpreters were quite old, it's just that they were ignored
previously.

It's an hard problem and I don't see any simple/obvious solution right
now, except of workarounds that I dislike. Maybe the only good
solution is to fix all heap types, one by one.

== Only the Python stdlib should be affected ==

PyType_FromSpec() was added to Python 3.2 by the PEP 384 to define
"heap types" in C, but I'm not sure if it's popular in practice (ex:
Cython doesn't use it, but defines static types). I expect that most
types to still be defined the old style (static types) in a vas
majority of third party extension modules.

To be clear, static types are not affected by this email.

Third party extension modules using the limited C API (to use the
stable ABI) and PyType_FromSpec() can be affected (if they don't fully
implement the GC protocol).

== Heap type instances now stores a strong reference to their type ==

In March 2019, the PyObject_Init() function was modified in bpo-35810
to keep a strong reference (INCREF) to the type if the type is a heap
type. The fixed problem was that heap types could be destroyed before
the last instance is destroyed.

== GC and heap types ==

The new problem is that most heap types don't collaborate well with
the garbage collector. The garbage collector doesn't know anything
about Python objects, types, reference counting or anything. It only
uses the PyGC_Head header and the traverse functions. If an object
holds a strong reference to an object but its type does not define a
traverse function, the GC cannot guess/infer this reference.

A heap type must respect the following 3 conditions to collaborate with the GC:

    Have the Py_TPFLAGS_HAVE_GC flag;
    Define a traverse function (tp_traverse) which visits the type: Py_VISIT(Py_TYPE(self));
    Instances must be tracked by the GC.

If one of these conditions is not met, the GC can fail to destroy a
type during a GC collection. If an instance is kept alive late while a
Python interpreter is being deleted, it's possible that the type is
never deleted, which can keep indirectly many objects alive and so
don't delete them neither.

In practice, when a type is not deleted, a test using subinterpreter
starts to fail on Refleaks buildbot since it leaks references. Without
subinterpreters, such leak is simply ignored, whereas this is an
on-going effect to delete Python objects at exit (bpo-1635741).

== Boring traverse functions ==

Currently, there is no default traverse implementation which visits the type.

For example, I had the implement the following function for _thread.LockType:

static int
lock_traverse(lockobject self, visitproc visit, void arg)
{
    Py_VISIT(Py_TYPE(self));
    return 0;
}

It's a little bit annoying to have to implement the GC protocol
whereas a lock cannot contain other Python objects, it's not a
container. It's just a thin wrapper to a C lock.

There is exactly one strong reference: to the type.

== Workaround: loop on gc.collect() ==

A workaround is to run gc.collect() in a loop until it returns 0 (no
object was collected).

== Traverse automatically? Nope. ==

Pablo Galindo attempts to automatically visit the type in the traverse function:

https://bugs.python.org/issue40217
https://github.com/python/cpython/commit/0169d3003be3d072751dd14a5c84748ab63...

Moreover, What's New in Python 3.9 contains a long section suggesting
to implement a traverse function for this problem, but it doesn't
suggest to track instances:
https://docs.python.org/dev/whatsnew/3.9.html#changes-in-the-c-api

This solution causes too many troubles, and so instead, traverse
functions were defined on heap types to visit the type.

Currently in the master branch, 89 types are defined as heap types on
a total of 206 types (117 types are defined statically). I don't think
that these 89 heap types respect the 3 conditions to collaborate with
the GC.

== How should we address this issue? ==

I'm not sure what should be done. Working around the issue by
triggering multiple GC collections? Emit a warning in development mode
if a heap type doesn't collaborate well with the GC?

If core developers miss these bugs and have troubles to debug them, I
expect that extension module authors would suffer even more.

== GC+heap type bugs became common  ==

I'm fixing such GC issue for 1 year as part as the work on cleaning
Python objects at exit, and also indirectly related to
subinterpreters. The behavior is surprising, it's really hard to dig
into GC internals and understand what's going on. I wrote an article
on this kind of "GC bugs":
https://vstinner.github.io/subinterpreter-leaks.html

Today, I learnt the hard way that defining a traverse is not enough.
The type constructor (tp_new) must also track instances! See my fix
for _multibytecodec related to CJK codecs:

https://github.com/python/cpython/commit/11ef53aefbecfac18b63cee518a7184f771...
https://bugs.python.org/issue42866

== Reference cycles are common ==

The GC only serves to break reference cycles. But reference cycles are
rare, right? Well...

First of all, most types create reference cycles involing themselves.
For example, a type __mro__ tuple contains the type which already
creates a ref cycle. Type methods can also contain a reference to the
type.

=> The GC must break the cycle, otherwise the type cannot be destroyed

When a function is defined in a Python module, the function
__globals__ is the module namespace (module.__dict__) which...
contains the function. Defining a function in a Python module also
creates a reference cycle which prevents to delete the module
namespace.

If a function is used as a callback somewhere, the whole module
remains "alive" until the reference to the callback is cleared.
Example. os.register_at_fork() and codecs.register() callbacks are
cleared really late during Python finalization. Currently, it's
basically the last objects which are cleared at Python exit. After
that, there is exactly one final GC collection.

=> The GC

== Debug GC issues ==

    gc.get_referents() and gc.get_referrers() can be used to check traverse functions.
    gc.is_tracked() can be used to check if the GC tracks an object.
    Using the gdb debugger on gc_collect_main() helps to see which objects are collected. See for example the finalize_garbage() functions which calls finalizers on unreachable objects.
    The solution is usually a missing traverse functions or a missing Py_VISIT() in an existing traverse function.

== __del__ hack for debugging ==

If you want to play with the issue or if you have to debug a GC issue,
you can use an object which logs a message when it's being deleted:

class VerboseDel:
    def __del__(self):
        print("DELETE OBJECT")
obj = VerboseDel()

Warning: creating such object in a module also prevents to destroy the
module namespace when the last reference to the module is deleted!
__del__.__globals__ contains a reference to the module namespace, and
obj.__class__ contains a reference to the type... Yeah, ref cycle and
GC issues are fun!

== Long email ==

Yeah, I like to put titles in my long emails. Enjoy. Happy hacking!
Victor

--
Night gathers, and now my watch begins. It shall not end until my death
msg385299 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 21:56
In June 2020, I create PR 20983 to attempt to automatically traverse the type:
"Provide a default tp_traverse implementation for the base object
type for heap types which have no tp_traverse function. The
traverse function visits the type if the type is a heap type."

I abandoned my PR.

I marked bpo-41036 as a duplicate of this issue.
msg385883 - (view) Author: Erlend Egeberg Aasland (erlendaasland) * Date: 2021-01-28 20:45
Should we proceed with fixing GC for all heap types before continuing work with bpo-40077?
History
Date User Action Args
2021-02-01 08:34:15erlendaaslandsetkeywords: + patch
stage: patch review
pull_requests: + pull_request23226
2021-01-28 20:45:54erlendaaslandsetmessages: + msg385883
2021-01-21 04:34:28shihai1991setnosy: + shihai1991
2021-01-20 20:23:51erlendaaslandsetnosy: + erlendaasland
2021-01-20 14:06:43corona10setnosy: + corona10
2021-01-19 21:56:39vstinnersetmessages: + msg385299
2021-01-19 21:55:23vstinnerlinkissue41036 superseder
2021-01-19 21:54:26vstinnercreate