classification
Title: The garbage collector doesn't take in account that objects of heap allocated types hold a strong reference to their type
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: corona10, lukasz.langa, miss-islington, pablogsal, petr.viktorin, serhiy.storchaka, shihai1991, tim.peters, vstinner
Priority: release blocker Keywords: patch

Created on 2020-04-07 16:23 by vstinner, last changed 2020-06-01 07:35 by petr.viktorin. This issue is now closed.

Files
File name Uploaded Description Edit
reproducer.c petr.viktorin, 2020-05-22 18:53
Pull Requests
URL Status Linked Edit
PR 19414 merged pablogsal, 2020-04-07 19:16
PR 19733 merged pablogsal, 2020-04-27 13:01
PR 20264 merged pablogsal, 2020-05-20 16:28
PR 20433 closed petr.viktorin, 2020-05-26 15:47
PR 20490 merged miss-islington, 2020-05-28 14:43
Messages (44)
msg365911 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-07 16:23
The bpo-35810 modified the object allocate to hold a *strong* reference to the type in PyObject.ob_type, whereas PyObject.ob_type is a *borrowed* references if the type is statically allocated.

commit 364f0b0f19cc3f0d5e63f571ec9163cf41c62958
Author: Eddie Elizondo <eduardo.elizondorueda@gmail.com>
Date:   Wed Mar 27 07:52:18 2019 -0400

    bpo-35810: Incref heap-allocated types in PyObject_Init (GH-11661)
    
    * Incref heap-allocated types in PyObject_Init
    * Add documentation and porting notes to What's New


The problem is now in some corner cases, the GC fails to visit all referrer of a type and so considers that the type is still alive.

bpo-40149 is a concrete example of such bug.

I propose to modify the GC to take bpo-35810 in account.

... or maybe I just misunderstood bpo-40149 bug :-)
msg365918 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 16:54
> I propose to modify the GC to take bpo-35810 in account.

What? The GC is agnostic of what it receives. It works with objects in general that implement the gc support, but it does not care about what those objects are. The only specal case are weakreferences and because those have inherit GC semantics.

I am not sure about what you are proposing, could you elaborate?
msg365921 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 17:20
Also, heap types (created with type_new()) are already taking into account the type when visiting references:

https://github.com/python/cpython/blob/master/Objects/typeobject.c#L1111
msg365922 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-04-07 17:22
If object A owns a strong reference to object B, and A participates in cyclic gc, and B may be part of a cycle, then it's necessary and sufficient that A's type's tp_traverse implementation invoke Py_VISIT() passing A's pointer to B.

It would be a Really Bad Idea to add special cases to the gc module to spare some specific type(s) from following that (currently) utterly uniform rule.
msg365923 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-07 17:28
It would be inconvenient to require adding Py_VISIT(Py_TYPE(self)) in all tp_visit implementations of heap allocated types (including third-party extensions). Since tp_visit is GC specific thing, I think it wold be more convenient make a special case for object types.
msg365924 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 17:35
>>> Since tp_visit is GC specific thing, I think it wold be more convenient make a special case for object types.

I don't think that follows: The gc defers the logic of what and what should not be visited to the tp_traverse implementation of every object. The GC must be agnostic of the types it receives and heap types here are not special.

--

Also, if we make an exception in the GC for this special case, all tp_traverse of all those functions will be incorrect. Someone reading those tp_traverse can say "Oh, why these functions do not visit the type if they own s reference to it, this looks incorrect" and then we need to explain that that is an exception in the GC just because we were lazy to implement them correctly.
msg365925 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-07 17:36
Interesting, this issue may be related to issue24379. The problem with the proposed implementation of subscript was that it created a reference loop, and not all links in this loop were visible by GC.
msg365926 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 17:37
Also, as I mentioned, you don't need to modify all objects tp_traverse, only it's type.tp_traverse slot. For instance, all python objects know how to traverse stuff because they share the same tp_traverse:

https://github.com/python/cpython/blob/master/Objects/typeobject.c#L1082

So unless I am missing something, if you want to affect all heap types you just need to modify one tp_traverse in one place: the superclass.
msg365927 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 17:41
Hummm, I think we may just be missing a Py_VISIT(Py_TYPE(self))here:

https://github.com/python/cpython/blob/master/Objects/typeobject.c#L3562

A class owns references to its type (type) or another metaclass

Tim, could you confirm that that is the case?
msg365928 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 17:47
> Hummm, I think we may just be missing a Py_VISIT(Py_TYPE(self))here:

Checking it more closely, that is incorrect, so we are not missing that visitation :(
msg365929 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-07 17:48
Recently many static allocated types were converted to heap allocated types (using PyType_FromSpec). Now you need to add Py_VISIT(Py_TYPE(self)) in all corresponding tp_visit implementations.

And since even official example for heap allocated types do not contain it (and it was not needed before 3.9), I am sure that all existing code do not contain it.

We cannot change all user code, so we should change the interpreter code so that it will work correctly with existing user code.
msg365930 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 17:52
> We cannot change all user code, so we should change the interpreter code so that it will work correctly with existing user code.

If we made a change that make all user code suddenly incorrect, that change should be reverted. The GC has clear rules about what tp_traverse should and should not do, and we should not violate those rules and make special cases in the gc just because we forced some classes to be incorrect. This will make much more difficult to reason about GC bugs, the tp_traverse implementation of classes and much difficult to maintain the GC itself.
msg365931 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-07 18:03
The problem is that we suddenly changed rules. It was not required that the object's type should be visited in tp_visit. It was incorrect to visit it, because object did not have strong reference to its type. User never created it, and it was not created implicitly.

Now we changed rules. A strong reference is created implicitly. Who is responsible to manage a strong reference? Whose who created it, ant it is the interpreter, not the user. User does not know anything about it. If we pass the responsibility for the strong reference to the type on the user, we makes all user code incorrect, and we add a burden of fixing it and maintaining compatibility with incompatible Python versions on the user. I think it is very bad.
msg365932 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 18:15
> Now we changed rules. A strong reference is created implicitly. Who is responsible to manage a strong reference? Whose who created it, ant it is the interpreter, not the user.

Even if we decide that the patch that introduced the new rules should not be reverted, then what we should do is wrap the tp_traverse of the user in something that also calls Py_VISIT(Py_TYPE(self)) so basically the tp_traverse of the type created by PyType_FromSpec will do

static int
PyType_FromSpec_tp_traverse(_abc_data *self, visitproc visit, void *arg)
{
    Py_VISIT(Py_TYPE(self))
    return self->user_provided_tp_traverse(self, visit, arg);
}

That way, we can still reason about what the tp_traverse should do, we don't break more rules and we don't need to make maintaining the GC even more difficult.
msg365933 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-07 19:17
In https://github.com/python/cpython/pull/19414 I have attached a proof of concept of a wrapper for tp_traverse that only affects types created by PyType_FromSpec that will call Py_VISIT(Py_TYPE(self)) and then the user function.
msg365963 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-08 00:03
I have made some investigation, and I think a form of this bug was there from a long time and does not really relate to heap types.
For instance consider this code on Python3.7 (so commit 364f0b0f19cc3f0d5e63f571ec9163cf41c62958 is not present). 

If you consider the more simple heap type:

>>> class A:
...    ...
... 
>>> A().__class__
<class '__main__.A'>

As the class instance refers to its type, it must visit it in the gc and we can see that indeed, that is the case:

>>> import gc
>>> gc.get_referents(A())
[<class '__main__.A'>]

But for instance, consider ast.AST, which in 3.7 is a static type created with PyType_Ready:


>>> import ast
>>> x = ast.AST()
>>> x.__class__
<class '_ast.AST'>

we can see that the object refers to its type as the other one but....oh no, the gc does not know about that link:

>>> import gc
>>> gc.get_referents(x)
[]

This is because its traverse function does not visit the type:

static int
ast_traverse(AST_object *self, visitproc visit, void *arg)
{
     Py_VISIT(self->dict);
     return 0;
}


This is not a problem if the type is *static* because __although is technically an error__ because the GC cannot resolve
cycles that go through the type, as the type is eternal those cycles will never be collected. 

The problem appears when the type is not eternal and can be destroyed. For instance, to understand this consider a function
in the _testcapi that creates a heap type using PyType_FromSpec:

>>> import gc, _testcapi
>>> import weakref
>>> import sys

# Create a new heap type with custom traverse function
>>> x = _testcapi.construct_new_gc_type()
>>> sys.getrefcount(x)
5
>>> new_obj = x()
>>> sys.getrefcount(x)
6
# We know that now there should be a link between new_obj and x because one points to the other
>>> x in gc.get_referents(new_obj)
False
# Oh no, there is no link
>>> class A:
...    def __del__(self):
...       print("Ouch")
... 
>>> x_w = weakref.ref(x)
# Create a reference cycle  a -> new_obj -> x -> a
>>> a = A()
>>> 
>>> x.a = a
>>> a.y = new_obj
>>> a.x = x
>>> gc.collect()
0
>>> del x,a
>>> gc.collect()
0
>>> sys.getrefcount(x_w())
6
>>> del new_obj
# At this point all variables are gone and the collector should clean everything
>>> gc.collect()
0
# Oh, no! The type is still alive
>>> sys.getrefcount(x_w())
6


If we do the same experiment after PR19314:

>>> import sys, gc, _testcapi
>>> import weakref
>>> x = _testcapi.construct_new_gc_type()
>>> sys.getrefcount(x)
5
>>> new_obj = x()
>>> sys.getrefcount(x)
6
# We know that now there should be a link between new_obj and x because one points to the other
>>> x in gc.get_referents(new_obj)
True
# Good!
>>> class A:
...   def __del__(self):
...      print("Ouch")
... 

>>> x_w = weakref.ref(x)
# Create a reference cycle  a -> new_obj -> x -> a
>>> a = A()
>>> x.a = a
>>> a.y = new_obj
>>> a.x = x
>>> gc.collect()
Traversed!
Traversed!
36
>>> del x,a
>>> gc.collect()
Traversed!
Traversed!
0
>>> sys.getrefcount(x_w())
6
>>> del new_obj
# At this point all variables are gone and the collector should clean everything
>>> gc.collect()
Ouch
8
# Nice, the collector cleaned the cycle


-------

So the conclusion:

* This bug affects all types but is only really relevant for types that are not eternal (because eternal types are already "leaked").
* The only real problem and leaks will appear for heap types that are not eternal with custom traverse functions.
* After commit 364f0b0f19cc3f0d5e63f571ec9163cf41c62958 now all heap types that are not eternal, for instance 
  the ones created with PyType_FromSpec, need to traverse the type because they own it. Falining to do this can create leaks in the GC.
* The *correct* thing to do is modify the tp_traverse of all non-eternal heap types, but sadly the only way to do this in a backwards-compatible
  without modifying all user functions is injecting automatically the behaviour as PR 19414 is doing.
msg365965 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-08 00:05
Sorry when I said:

> If we do the same experiment after PR19314:

I meant:

If we do the same experiment after PR 19414:
msg367119 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-23 14:54
I'm not comfortable with requesting authors of all C extensions to modify their tp_traverse function. As Pablo explained, the type is already visited on subtypes instances. I approved PR 19414.
msg367120 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-23 14:56
Tests should be added, maybe based on msg365963 examples. But I would prefer to get this bug fixed in 3.9.0a6 (if possible), tests can be added later.
msg367122 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-23 15:18
I do not think it is right. Please wait, I'll submit an alternate solution.
msg367124 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-23 15:33
There are limited options to fix this issue:

(A) Revert commit 364f0b0f19cc3f0d5e63f571ec9163cf41c62958

It would reintroduced bpo-35810 bug. Moreover, C extension modules which have been modified to call Py_DECREF(Py_TYPE(self)) in tp_dealloc which suddenly crash.

I don't think that anyone wants this option.


(B) Require all C extension modules authors to modify their tp_traverse function.

It requires to communicate well that all C extension modules which use PyType_FromSpec() need to modify their tp_traverse method to add code specific to Python 3.8 or newer.

Serhiy and me are against this option.


(C) Modify gcmodule.c to traverse the type.

This option is not trivial since *subtype* are already traversed. Moreover, Pablo and Tim are strongly against this option.


(D) Modify PyType_FromSpec() to hook into tp_traverse and transparently visit the type, as already done for subtypes.

Option chosen by PR 19414.

Honestly, I'm unhappy that we have to hook into tp_traverse, but this option sounds like the least bad option to me.

Developers don't have to modify their C extensions code, the bug is fixed.
msg367125 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-23 15:35
> I do not think it is right. Please wait, I'll submit an alternate solution.

What is your idea? How would it be different than PR 19414? Refleak buildbots are broken for almost one month (bpo-40149), it would be nice to get a fix soon. In Python 3.9.0a6 if possible.
msg367150 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-04-23 21:53
There is no possible world in which the best answer is "hack gcmodule.c" ;-)

I haven't tried to dig into the details here, but Pablo's approach looked spot-on to me:  put the solution near the source of the problem.  The problem is specific to a relatively tiny number of objects, and all gc asks is that they play by the rules.
msg367270 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-25 10:16
Serhiy: would it be possible to fix this issue in alpha6? Can we merge PR 19414? In the lack of reply, I think that the best to merge PR 19414.
msg367412 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 12:22
New changeset 0169d3003be3d072751dd14a5c84748ab63a249f by Pablo Galindo in branch 'master':
bpo-40217: Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (GH-19414)
https://github.com/python/cpython/commit/0169d3003be3d072751dd14a5c84748ab63a249f
msg367413 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 12:26
> Serhiy: would it be possible to fix this issue in alpha6? Can we merge PR 19414? In the lack of reply, I think that the best to merge PR 19414.

I asked Lukasz (Python 3.9 release manager) in private and he is ok to merge this PR. He plans to wait until Refleak buildbots turn green again before tagging Python 3.9.0alpha6.


Serhiy:
> I do not think it is right. Please wait, I'll submit an alternate solution.

Serhiy: I merged Pablo's PR to unblock Refleak buildbots, but I'm still curious to see what you have to propose. We can still change the code before 3.9.0beta1.

Don't hesitate to propose your PR! (your PR will have to revert commit 0169d3003be3d072751dd14a5c84748ab63a249f).

--

Pablo: would you mind to write a NEWS entry (in the C API category) for your change? IMHO the commit title is good enough to be reused as the NEWS entry. I added "skip news" label just to unblock buildbots.
msg367416 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 13:01
FYI I closed bpo-40149: the commit 0169d3003be3d072751dd14a5c84748ab63a249f fixed test_threading leak.

--

For master, it would be nice to have a NEWS entry, and maybe also a What's New in Python 3.9 entry, maybe in the "Changes in the C API" section.

--

Now, what about Python 3.8? The commit 364f0b0f19cc3f0d5e63f571ec9163cf41c62958 is part of Python 3.8.0. Is it ok to backport the commit 0169d3003be3d072751dd14a5c84748ab63a249f into Python 3.8.3?
msg367424 - (view) Author: Ɓukasz Langa (lukasz.langa) * (Python committer) Date: 2020-04-27 14:02
Please backport to 3.8, then it will become part of 3.8.3rc1 which I'll be releasing tomorrow.
msg367425 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-27 14:24
New changeset 91a5ae18351027867e99c96db5ea235d9c42e47a by Pablo Galindo in branch 'master':
bpo-40217: Clean code in PyType_FromSpec_Alloc and add NEWS entry (GH-19733)
https://github.com/python/cpython/commit/91a5ae18351027867e99c96db5ea235d9c42e47a
msg367681 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-29 17:26
> Please backport to 3.8, then it will become part of 3.8.3rc1 which I'll be releasing tomorrow.

I propose to *not* fix this bug in Python 3.8:

* Python 3.8 stdlib doesn't seem to be impacted by this bug
* The number of third party C extension modules impated by this bug is very low or even zero
* The bug severity is minor *in practice*
* A backport can cause a C extension to crash
* The implementation may still change before Python 3.9.0 final release

--

The worst thing which can happen with this bug is that a C extension module creates many types and these types are never deleted. Well, it's annoying, but it's unlikely to happen. Usually, types are created exactly once in C extensions.

--

I'm not 100% comfortable with backporting the fix. Python 3.8.0, 3.8.1 and 3.8.2 have been released with the bug, and this issue is the first report. But I only saw the bug because many C extension modules of the Python 3.9 stdlib were converted to PyModuleDef_Init() and PyType_FromSpec().

--

I'm not comfortable to change the GC behavior in a subtle way in a 3.8.x minor release.

If a C extension module was impacted by this bug and decided to explicitly visit the type in its traverse function, this fix will crash the C extension crash...

--

At April 23, Serhiy proposed an alternative fix. But then he didn't propose a PR.

--

The bug only impacts C extension modules which use PyModuleDef_Init() and PyType_FromSpec().

Python 3.8 only uses PyModuleDef_Init() in the following modules:

vstinner@apu$ grep -l PyModuleDef_Init Modules/*.c
Modules/arraymodule.c
Modules/atexitmodule.c
Modules/binascii.c
Modules/_testmultiphase.c
Modules/xxlimited.c
Modules/xxmodule.c
Modules/xxsubtype.c

Python 3.8 only uses PyType_FromSpec() in the following modules:

$ grep -l PyType_FromSpec Modules/*.c
Modules/_curses_panel.c
Modules/_ssl.c
Modules/_testcapimodule.c
Modules/_testmultiphase.c
Modules/_tkinter.c
Modules/xxlimited.c

Intersection of the two sets: _testmultiphase and xxlimited, which are modules only used by the Python test suite.

It means that Python 3.8 standard library is *not* impacted by this bug.

Only third party C extension modules which use PyModuleDef_Init() *and* PyType_FromSpec() are impacted. Most C extension modules use statically allocated types and so are not impacted.
msg368813 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-14 01:21
The issue has been fixed in the master branch, thanks Pablo! I close the issue.
msg369460 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2020-05-20 15:00
It looks like the fix breaks types that override Py_tp_alloc but not Py_tp_traverse (as some types in PySide do.)

I'll investigate more and work on fixing that if someone doesn't beat me to it.
msg369462 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-20 15:27
Re-opening the issue
msg369464 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-20 15:35
Maybe we should revert this change and modify the "how to port to ..." section as we did in bpo-35810 to tell users that they need to manually visit the parent in classes created with PyType_FromSpec
msg369532 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-21 18:51
Actually, I have been thinking about this more and I cannot really trace how the bug could happen. We control the type allocation, but Py_tp_alloc controls the instance allocation and that can be anything, the current patch should not interfere with that.

Petr, do you have a reproducer? I tried to create a reproducer of a type that overrides Py_tp_alloc but not Py_tp_traverse (therefore it does not implement gc support) and I cannot make it crash.
msg369599 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2020-05-22 11:48
Indeed, it seems my initial analysis was incorrect. I can't reproduce this outside of PySide. 
I'll close this bug and invesatigate more.

Sorry for the noise.
msg369632 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2020-05-22 18:52
Ha! I think I found the issue in PySide now. It's different, but it's still a CPython issue. It's actually a variant of the "defining class" problem from PEP 573.


It's legal to get the value of a slot, using PyType_GetSlot.
It's reasonable to assume that what you put in the tp_traverse slot is what you'll get out using PyType_GetSlot.
If a type's tp_traverse tries to call its superclass's tp_traverse, you can get an infinite loop.
See the situation below or run attached reproducer (build the module, import it and exit
the interpreter).

Getting one's superclass is a bit tricky when dealing entirely with FromSpec
types, but it is definitely *possible*. There's a lot of assumptions the code
can reasonably make. In my reproducer, I stored the superclass in a C static
variable (which is perfectly valid if the module cleanly refuses to work with
subinterpreters or interpreter reload).


Base:
- real tp_traverse is PyType_FromSpec_tp_traverse
- user_provided_tp_traverse is the base's original

Subclass:
- real tp_traverse is PyType_FromSpec_tp_traverse
- user_provided_tp_traverse is the subclass' original

So when the subclass is traversed:
- subclass real tp_traverse calls the subclass user_provided_tp_traverse
- subclass user_provided_tp_traverse calls PyType_GetSlot(base, tp_traverse)
  - that is, the base's real tp_traverse
- the base's real tp_traverse calls the **SUBCLASS** user_provided_tp_traverse,
  since that's what's recorded in type(self)



Another issue is that `PyType_FromSpec_Alloc`ated types lie about their size:
`tp->tp_basicsize + Py_SIZE(self) * tp->itemsize` is not the actual
allocated amount. This could technically be solved by redefining `__sizeof__`,
but I'm more worried that it's another edge case of a hack, and There will
probably be other edge cases. Since this is API we want to build on, I'd sleep
easier if it's kept clean.


What would be the downsides of reverting and documenting that tp_traverse
must visit Py_TYPE(self)?
It seems that you visit Py_TYPE(self), it could just end up with unbreakable
reference cycles. We're dealing with modules and classes, which are usually
effectively immortal (and if not, the author probably knows what they're doing).
I don't think it would hurt that much.
msg369636 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-22 20:10
> What would be the downsides of reverting and documenting that tp_traverse must visit Py_TYPE(self)?

Not much IMHO, I actually would prefer this solution over any automatic "hacks". The reason is that any hack will be technically violating the contract: if I read the tp_traverse of any of these classes I would say "wait a minute....this is wrong: the class owns a strong reference to its parent so...why is not visiting it?". But the answer would be that we are hacking because we didn't want to force users to migrate once we added the strong reference.


Notice that the root of this problem was bpo-35810. In that issue we added a guide to the docs: https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-c-api. In  there it says that you should decref now the parent on tp_dealloc:

static void
foo_dealloc(foo_struct *instance) {
    PyObject *type = Py_TYPE(instance);
    PyObject_GC_Del(instance);
#if PY_VERSION_HEX >= 0x03080000
    // This was not needed before Python 3.8 (Python issue 35810)
    Py_DECREF(type);
#endif
}

but it forgot about that you should also visit the parent in tp_traverse.

So, I propose to do the following:

* Revert the hack.
* Fix the tp_traverse of all the relevant classes in the stdlib to correctly visit its parent.
* Modify the documentation of master, 3.9 and 3.8 to add the missing information: You MUST visit the parent in tp_traverse:

What do you think?
msg369668 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-22 23:29
I have been playing with the reproducer and I am a bit confused: The reproducer crashes in the same way even after reverting PR 19414 so it does not seem related to it. This is what I get:

>>> import reproducer
>>> reproducer.Modules/gcmodule.c:114: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
Enable tracemalloc to get the memory block allocation traceback

object address  : 0x55cf429e4080
object refcount : 4
object type     : 0x55cf418e9c60
object type name: type
object repr     : <class 'reproducer.Base'>

Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized

Current thread 0x00007fc65d2f8740 (most recent call first):
  File "<frozen importlib._bootstrap>", line 917 in _find_spec
  File "<frozen importlib._bootstrap>", line 982 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
  File "/home/pablogsal/github/python/master/Lib/re.py", line 124 in <module>
  File "<frozen importlib._bootstrap>", line 228 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 790 in exec_module
  File "<frozen importlib._bootstrap>", line 680 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 986 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
  File "/home/pablogsal/github/python/master/Lib/rlcompleter.py", line 142 in attr_matches
  File "/home/pablogsal/github/python/master/Lib/rlcompleter.py", line 89 in complete
[1]    162397 abort (core dumped)  ./python

This seems to indicate that that reproducer is already implementing incorrectly tp_traverse.
msg369669 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-22 23:57
Ok, I found the problem. The problem is that the reproduced does not correctly work the reference count of base_class because when construction get tuple of bases:

PyObject *bases = PyTuple_New(1);
result = PyTuple_SetItem(bases, 0, base_class);
if (result) return -1;

PyObject *subclass = PyType_FromModuleAndSpec(m, &subclass_spec, bases);

"PyTuple_SetItem" steals a reference to base_class but "PyModule_AddObject" also does the same, and the refcount is incorrect.

If you add a Py_INCREF before, the crash disappears.
msg369675 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-23 00:30
> If you add a Py_INCREF before, the crash disappears.

To be clear: the other crash is still in the reproduced: the one that Petr describes in his comment. In PR 20264 I have prepared the changes that I proposed previously (including the revert of the hack). Petr, do you mind reviewing it?
msg369713 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2020-05-23 10:56
Thank you for responding so quickly!

> Petr, do you mind reviewing it?

Yes, but I'll only get to it next week.
msg370057 - (view) Author: miss-islington (miss-islington) Date: 2020-05-27 09:03
New changeset 1cf15af9a6f28750f37b08c028ada31d38e818dd by Pablo Galindo in branch 'master':
bpo-40217:  Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (reverts GH-19414) (GH-20264)
https://github.com/python/cpython/commit/1cf15af9a6f28750f37b08c028ada31d38e818dd
msg370225 - (view) Author: miss-islington (miss-islington) Date: 2020-05-28 15:12
New changeset bcbe5c59dde5fcb9ad21991c2afd91837b14bbd5 by Miss Islington (bot) in branch '3.9':
bpo-40217:  Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (reverts GH-19414) (GH-20264)
https://github.com/python/cpython/commit/bcbe5c59dde5fcb9ad21991c2afd91837b14bbd5
History
Date User Action Args
2020-06-01 07:35:10petr.viktorinsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-05-28 15:12:30miss-islingtonsetmessages: + msg370225
2020-05-28 14:43:25miss-islingtonsetpull_requests: + pull_request19739
2020-05-27 09:03:46miss-islingtonsetnosy: + miss-islington
messages: + msg370057
2020-05-26 15:47:48petr.viktorinsetstage: resolved -> patch review
pull_requests: + pull_request19690
2020-05-23 10:56:50petr.viktorinsetmessages: + msg369713
2020-05-23 00:30:39pablogsalsetmessages: + msg369675
2020-05-22 23:57:26pablogsalsetmessages: + msg369669
2020-05-22 23:29:32pablogsalsetmessages: + msg369668
2020-05-22 20:10:55pablogsalsetmessages: + msg369636
2020-05-22 18:53:15petr.viktorinsetfiles: + reproducer.c
2020-05-22 18:52:46petr.viktorinsetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg369632
2020-05-22 11:48:39petr.viktorinsetstatus: open -> closed
resolution: fixed
messages: + msg369599

stage: patch review -> resolved
2020-05-21 18:51:31pablogsalsetmessages: + msg369532
2020-05-20 16:28:27pablogsalsetstage: resolved -> patch review
pull_requests: + pull_request19550
2020-05-20 15:35:56pablogsalsetmessages: + msg369464
2020-05-20 15:34:20pablogsalsetmessages: - msg369463
2020-05-20 15:32:04pablogsalsetpriority: normal -> release blocker

messages: + msg369463
2020-05-20 15:27:50pablogsalsetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg369462
2020-05-20 15:00:08petr.viktorinsetnosy: + petr.viktorin
messages: + msg369460
2020-05-14 01:21:29vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg368813

stage: patch review -> resolved
2020-04-29 17:26:02vstinnersetmessages: + msg367681
2020-04-27 14:24:38pablogsalsetmessages: + msg367425
2020-04-27 14:02:36lukasz.langasetnosy: + lukasz.langa
messages: + msg367424
2020-04-27 13:01:59pablogsalsetpull_requests: + pull_request19054
2020-04-27 13:01:46vstinnersetmessages: + msg367416
2020-04-27 12:26:06vstinnersetmessages: + msg367413
2020-04-27 12:22:26vstinnersetmessages: + msg367412
2020-04-25 10:16:33vstinnersetmessages: + msg367270
2020-04-23 21:53:05tim.peterssetmessages: + msg367150
2020-04-23 15:35:15vstinnersetmessages: + msg367125
2020-04-23 15:33:56vstinnersetmessages: + msg367124
2020-04-23 15:18:03serhiy.storchakasetmessages: + msg367122
2020-04-23 14:56:00vstinnersetmessages: + msg367120
2020-04-23 14:54:00vstinnersetmessages: + msg367119
2020-04-08 00:05:40pablogsalsetmessages: + msg365965
2020-04-08 00:03:48pablogsalsetmessages: + msg365963
2020-04-07 19:17:48pablogsalsetmessages: + msg365933
2020-04-07 19:16:32pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request18774
2020-04-07 18:15:28pablogsalsetmessages: + msg365932
2020-04-07 18:03:58serhiy.storchakasetmessages: + msg365931
2020-04-07 17:52:01pablogsalsetmessages: + msg365930
2020-04-07 17:48:08serhiy.storchakasetmessages: + msg365929
2020-04-07 17:47:11pablogsalsetmessages: + msg365928
2020-04-07 17:41:28pablogsalsetmessages: + msg365927
2020-04-07 17:37:32pablogsalsetmessages: + msg365926
2020-04-07 17:36:01serhiy.storchakasetmessages: + msg365925
2020-04-07 17:35:41pablogsalsetmessages: + msg365924
2020-04-07 17:28:47serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg365923
2020-04-07 17:22:39tim.peterssetnosy: + tim.peters
messages: + msg365922
2020-04-07 17:20:27pablogsalsetmessages: + msg365921
2020-04-07 16:54:33pablogsalsetmessages: + msg365918
2020-04-07 16:49:45shihai1991setnosy: + shihai1991
2020-04-07 16:30:43corona10setnosy: + corona10
2020-04-07 16:23:41vstinnercreate