Message 385297 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	vstinner
Date	2021-01-19.21:54:25
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1611093266.11.0.721387186389.issue42972@roundup.psfhosted.org>
In-reply-to

Content
Copy of my email sent to python-dev: https://mail.python.org/archives/list/python-dev@python.org/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/ Hi, In the Python stdlib, many heap types currently don't "properly" (fully?) implement the GC protocol which can prevent to destroy these types at Python exit. As a side effect, some other Python objects can also remain alive, and so are not destroyed neither. There is an on-going effect to destroy all Python objects at exit (bpo-1635741). This problem is getting worse when subinterpreters are involved: Refleaks buildbots failures which prevent to spot other regressions, and so these "leaks" / "GC bugs" must be fixed as soon as possible. In my experience, many leaks spotted by tests using subinterpreters were quite old, it's just that they were ignored previously. It's an hard problem and I don't see any simple/obvious solution right now, except of workarounds that I dislike. Maybe the only good solution is to fix all heap types, one by one. == Only the Python stdlib should be affected == PyType_FromSpec() was added to Python 3.2 by the PEP 384 to define "heap types" in C, but I'm not sure if it's popular in practice (ex: Cython doesn't use it, but defines static types). I expect that most types to still be defined the old style (static types) in a vas majority of third party extension modules. To be clear, static types are not affected by this email. Third party extension modules using the limited C API (to use the stable ABI) and PyType_FromSpec() can be affected (if they don't fully implement the GC protocol). == Heap type instances now stores a strong reference to their type == In March 2019, the PyObject_Init() function was modified in bpo-35810 to keep a strong reference (INCREF) to the type if the type is a heap type. The fixed problem was that heap types could be destroyed before the last instance is destroyed. == GC and heap types == The new problem is that most heap types don't collaborate well with the garbage collector. The garbage collector doesn't know anything about Python objects, types, reference counting or anything. It only uses the PyGC_Head header and the traverse functions. If an object holds a strong reference to an object but its type does not define a traverse function, the GC cannot guess/infer this reference. A heap type must respect the following 3 conditions to collaborate with the GC: Have the Py_TPFLAGS_HAVE_GC flag; Define a traverse function (tp_traverse) which visits the type: Py_VISIT(Py_TYPE(self)); Instances must be tracked by the GC. If one of these conditions is not met, the GC can fail to destroy a type during a GC collection. If an instance is kept alive late while a Python interpreter is being deleted, it's possible that the type is never deleted, which can keep indirectly many objects alive and so don't delete them neither. In practice, when a type is not deleted, a test using subinterpreter starts to fail on Refleaks buildbot since it leaks references. Without subinterpreters, such leak is simply ignored, whereas this is an on-going effect to delete Python objects at exit (bpo-1635741). == Boring traverse functions == Currently, there is no default traverse implementation which visits the type. For example, I had the implement the following function for _thread.LockType: static int lock_traverse(lockobject self, visitproc visit, void arg) { Py_VISIT(Py_TYPE(self)); return 0; } It's a little bit annoying to have to implement the GC protocol whereas a lock cannot contain other Python objects, it's not a container. It's just a thin wrapper to a C lock. There is exactly one strong reference: to the type. == Workaround: loop on gc.collect() == A workaround is to run gc.collect() in a loop until it returns 0 (no object was collected). == Traverse automatically? Nope. == Pablo Galindo attempts to automatically visit the type in the traverse function: https://bugs.python.org/issue40217 https://github.com/python/cpython/commit/0169d3003be3d072751dd14a5c84748ab63... Moreover, What's New in Python 3.9 contains a long section suggesting to implement a traverse function for this problem, but it doesn't suggest to track instances: https://docs.python.org/dev/whatsnew/3.9.html#changes-in-the-c-api This solution causes too many troubles, and so instead, traverse functions were defined on heap types to visit the type. Currently in the master branch, 89 types are defined as heap types on a total of 206 types (117 types are defined statically). I don't think that these 89 heap types respect the 3 conditions to collaborate with the GC. == How should we address this issue? == I'm not sure what should be done. Working around the issue by triggering multiple GC collections? Emit a warning in development mode if a heap type doesn't collaborate well with the GC? If core developers miss these bugs and have troubles to debug them, I expect that extension module authors would suffer even more. == GC+heap type bugs became common == I'm fixing such GC issue for 1 year as part as the work on cleaning Python objects at exit, and also indirectly related to subinterpreters. The behavior is surprising, it's really hard to dig into GC internals and understand what's going on. I wrote an article on this kind of "GC bugs": https://vstinner.github.io/subinterpreter-leaks.html Today, I learnt the hard way that defining a traverse is not enough. The type constructor (tp_new) must also track instances! See my fix for _multibytecodec related to CJK codecs: https://github.com/python/cpython/commit/11ef53aefbecfac18b63cee518a7184f771... https://bugs.python.org/issue42866 == Reference cycles are common == The GC only serves to break reference cycles. But reference cycles are rare, right? Well... First of all, most types create reference cycles involing themselves. For example, a type __mro__ tuple contains the type which already creates a ref cycle. Type methods can also contain a reference to the type. => The GC must break the cycle, otherwise the type cannot be destroyed When a function is defined in a Python module, the function __globals__ is the module namespace (module.__dict__) which... contains the function. Defining a function in a Python module also creates a reference cycle which prevents to delete the module namespace. If a function is used as a callback somewhere, the whole module remains "alive" until the reference to the callback is cleared. Example. os.register_at_fork() and codecs.register() callbacks are cleared really late during Python finalization. Currently, it's basically the last objects which are cleared at Python exit. After that, there is exactly one final GC collection. => The GC == Debug GC issues == gc.get_referents() and gc.get_referrers() can be used to check traverse functions. gc.is_tracked() can be used to check if the GC tracks an object. Using the gdb debugger on gc_collect_main() helps to see which objects are collected. See for example the finalize_garbage() functions which calls finalizers on unreachable objects. The solution is usually a missing traverse functions or a missing Py_VISIT() in an existing traverse function. == __del__ hack for debugging == If you want to play with the issue or if you have to debug a GC issue, you can use an object which logs a message when it's being deleted: class VerboseDel: def __del__(self): print("DELETE OBJECT") obj = VerboseDel() Warning: creating such object in a module also prevents to destroy the module namespace when the last reference to the module is deleted! __del__.__globals__ contains a reference to the module namespace, and obj.__class__ contains a reference to the type... Yeah, ref cycle and GC issues are fun! == Long email == Yeah, I like to put titles in my long emails. Enjoy. Happy hacking! Victor -- Night gathers, and now my watch begins. It shall not end until my death

Copy of my email sent to python-dev:
https://mail.python.org/archives/list/python-dev@python.org/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/

Hi,

In the Python stdlib, many heap types currently don't "properly"
(fully?) implement the GC protocol which can prevent to destroy these
types at Python exit. As a side effect, some other Python objects can
also remain alive, and so are not destroyed neither.

There is an on-going effect to destroy all Python objects at exit
(bpo-1635741). This problem is getting worse when subinterpreters are
involved: Refleaks buildbots failures which prevent to spot other
regressions, and so these "leaks" / "GC bugs" must be fixed as soon as
possible. In my experience, many leaks spotted by tests using
subinterpreters were quite old, it's just that they were ignored
previously.

It's an hard problem and I don't see any simple/obvious solution right
now, except of workarounds that I dislike. Maybe the only good
solution is to fix all heap types, one by one.

== Only the Python stdlib should be affected ==

PyType_FromSpec() was added to Python 3.2 by the PEP 384 to define
"heap types" in C, but I'm not sure if it's popular in practice (ex:
Cython doesn't use it, but defines static types). I expect that most
types to still be defined the old style (static types) in a vas
majority of third party extension modules.

To be clear, static types are not affected by this email.

Third party extension modules using the limited C API (to use the
stable ABI) and PyType_FromSpec() can be affected (if they don't fully
implement the GC protocol).

== Heap type instances now stores a strong reference to their type ==

In March 2019, the PyObject_Init() function was modified in bpo-35810
to keep a strong reference (INCREF) to the type if the type is a heap
type. The fixed problem was that heap types could be destroyed before
the last instance is destroyed.

== GC and heap types ==

The new problem is that most heap types don't collaborate well with
the garbage collector. The garbage collector doesn't know anything
about Python objects, types, reference counting or anything. It only
uses the PyGC_Head header and the traverse functions. If an object
holds a strong reference to an object but its type does not define a
traverse function, the GC cannot guess/infer this reference.

A heap type must respect the following 3 conditions to collaborate with the GC:

    Have the Py_TPFLAGS_HAVE_GC flag;
    Define a traverse function (tp_traverse) which visits the type: Py_VISIT(Py_TYPE(self));
    Instances must be tracked by the GC.

If one of these conditions is not met, the GC can fail to destroy a
type during a GC collection. If an instance is kept alive late while a
Python interpreter is being deleted, it's possible that the type is
never deleted, which can keep indirectly many objects alive and so
don't delete them neither.

In practice, when a type is not deleted, a test using subinterpreter
starts to fail on Refleaks buildbot since it leaks references. Without
subinterpreters, such leak is simply ignored, whereas this is an
on-going effect to delete Python objects at exit (bpo-1635741).

== Boring traverse functions ==

Currently, there is no default traverse implementation which visits the type.

For example, I had the implement the following function for _thread.LockType:

static int
lock_traverse(lockobject self, visitproc visit, void arg)
{
    Py_VISIT(Py_TYPE(self));
    return 0;
}

It's a little bit annoying to have to implement the GC protocol
whereas a lock cannot contain other Python objects, it's not a
container. It's just a thin wrapper to a C lock.

There is exactly one strong reference: to the type.

== Workaround: loop on gc.collect() ==

A workaround is to run gc.collect() in a loop until it returns 0 (no
object was collected).

== Traverse automatically? Nope. ==

Pablo Galindo attempts to automatically visit the type in the traverse function:

https://bugs.python.org/issue40217
https://github.com/python/cpython/commit/0169d3003be3d072751dd14a5c84748ab63...

Moreover, What's New in Python 3.9 contains a long section suggesting
to implement a traverse function for this problem, but it doesn't
suggest to track instances:
https://docs.python.org/dev/whatsnew/3.9.html#changes-in-the-c-api

This solution causes too many troubles, and so instead, traverse
functions were defined on heap types to visit the type.

Currently in the master branch, 89 types are defined as heap types on
a total of 206 types (117 types are defined statically). I don't think
that these 89 heap types respect the 3 conditions to collaborate with
the GC.

== How should we address this issue? ==

I'm not sure what should be done. Working around the issue by
triggering multiple GC collections? Emit a warning in development mode
if a heap type doesn't collaborate well with the GC?

If core developers miss these bugs and have troubles to debug them, I
expect that extension module authors would suffer even more.

== GC+heap type bugs became common  ==

I'm fixing such GC issue for 1 year as part as the work on cleaning
Python objects at exit, and also indirectly related to
subinterpreters. The behavior is surprising, it's really hard to dig
into GC internals and understand what's going on. I wrote an article
on this kind of "GC bugs":
https://vstinner.github.io/subinterpreter-leaks.html

Today, I learnt the hard way that defining a traverse is not enough.
The type constructor (tp_new) must also track instances! See my fix
for _multibytecodec related to CJK codecs:

https://github.com/python/cpython/commit/11ef53aefbecfac18b63cee518a7184f771...
https://bugs.python.org/issue42866

== Reference cycles are common ==

The GC only serves to break reference cycles. But reference cycles are
rare, right? Well...

First of all, most types create reference cycles involing themselves.
For example, a type __mro__ tuple contains the type which already
creates a ref cycle. Type methods can also contain a reference to the
type.

=> The GC must break the cycle, otherwise the type cannot be destroyed

When a function is defined in a Python module, the function
__globals__ is the module namespace (module.__dict__) which...
contains the function. Defining a function in a Python module also
creates a reference cycle which prevents to delete the module
namespace.

If a function is used as a callback somewhere, the whole module
remains "alive" until the reference to the callback is cleared.
Example. os.register_at_fork() and codecs.register() callbacks are
cleared really late during Python finalization. Currently, it's
basically the last objects which are cleared at Python exit. After
that, there is exactly one final GC collection.

=> The GC

== Debug GC issues ==

    gc.get_referents() and gc.get_referrers() can be used to check traverse functions.
    gc.is_tracked() can be used to check if the GC tracks an object.
    Using the gdb debugger on gc_collect_main() helps to see which objects are collected. See for example the finalize_garbage() functions which calls finalizers on unreachable objects.
    The solution is usually a missing traverse functions or a missing Py_VISIT() in an existing traverse function.

== __del__ hack for debugging ==

If you want to play with the issue or if you have to debug a GC issue,
you can use an object which logs a message when it's being deleted:

class VerboseDel:
    def __del__(self):
        print("DELETE OBJECT")
obj = VerboseDel()

Warning: creating such object in a module also prevents to destroy the
module namespace when the last reference to the module is deleted!
__del__.__globals__ contains a reference to the module namespace, and
obj.__class__ contains a reference to the type... Yeah, ref cycle and
GC issues are fun!

== Long email ==

Yeah, I like to put titles in my long emails. Enjoy. Happy hacking!
Victor

--
Night gathers, and now my watch begins. It shall not end until my death

History
Date	User	Action	Args
2021-01-19 21:54:26	vstinner	set	recipients: + vstinner
2021-01-19 21:54:26	vstinner	set	messageid: <1611093266.11.0.721387186389.issue42972@roundup.psfhosted.org>
2021-01-19 21:54:26	vstinner	link	issue42972 messages
2021-01-19 21:54:25	vstinner	create