Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The garbage collector doesn't take in account that objects of heap allocated types hold a strong reference to their type #84398

Closed
vstinner opened this issue Apr 7, 2020 · 44 comments
Labels
3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker

Comments

@vstinner
Copy link
Member

vstinner commented Apr 7, 2020

BPO 40217
Nosy @tim-one, @vstinner, @encukou, @ambv, @serhiy-storchaka, @corona10, @pablogsal, @miss-islington, @shihai1991
PRs
  • bpo-40217: Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types #19414
  • bpo-40217: Clean code in PyType_FromSpec_Alloc and add NEWS entry #19733
  • bpo-40217: Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (reverts GH-19414) #20264
  • bpo-40217: Make heap types always visit Py_TYPE(self) from tp_traverse #20433
  • [3.9] bpo-40217: Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (reverts GH-19414) (GH-20264) #20490
  • Files
  • reproducer.c
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-06-01.07:35:10.755>
    created_at = <Date 2020-04-07.16:23:41.763>
    labels = ['interpreter-core', '3.9', 'release-blocker']
    title = "The garbage collector doesn't take in account that objects of heap allocated types hold a strong reference to their type"
    updated_at = <Date 2020-06-01.07:35:10.755>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-06-01.07:35:10.755>
    actor = 'petr.viktorin'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-06-01.07:35:10.755>
    closer = 'petr.viktorin'
    components = ['Interpreter Core']
    creation = <Date 2020-04-07.16:23:41.763>
    creator = 'vstinner'
    dependencies = []
    files = ['49182']
    hgrepos = []
    issue_num = 40217
    keywords = ['patch']
    message_count = 44.0
    messages = ['365911', '365918', '365921', '365922', '365923', '365924', '365925', '365926', '365927', '365928', '365929', '365930', '365931', '365932', '365933', '365963', '365965', '367119', '367120', '367122', '367124', '367125', '367150', '367270', '367412', '367413', '367416', '367424', '367425', '367681', '368813', '369460', '369462', '369464', '369532', '369599', '369632', '369636', '369668', '369669', '369675', '369713', '370057', '370225']
    nosy_count = 9.0
    nosy_names = ['tim.peters', 'vstinner', 'petr.viktorin', 'lukasz.langa', 'serhiy.storchaka', 'corona10', 'pablogsal', 'miss-islington', 'shihai1991']
    pr_nums = ['19414', '19733', '20264', '20433', '20490']
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue40217'
    versions = ['Python 3.9']

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 7, 2020

    The bpo-35810 modified the object allocate to hold a *strong* reference to the type in PyObject.ob_type, whereas PyObject.ob_type is a *borrowed* references if the type is statically allocated.

    commit 364f0b0
    Author: Eddie Elizondo <eduardo.elizondorueda@gmail.com>
    Date: Wed Mar 27 07:52:18 2019 -0400

    bpo-35810: Incref heap-allocated types in PyObject_Init (GH-11661)
    
    * Incref heap-allocated types in PyObject_Init
    * Add documentation and porting notes to What's New
    

    The problem is now in some corner cases, the GC fails to visit all referrer of a type and so considers that the type is still alive.

    bpo-40149 is a concrete example of such bug.

    I propose to modify the GC to take bpo-35810 in account.

    ... or maybe I just misunderstood bpo-40149 bug :-)

    @vstinner vstinner added 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Apr 7, 2020
    @pablogsal
    Copy link
    Member

    I propose to modify the GC to take bpo-35810 in account.

    What? The GC is agnostic of what it receives. It works with objects in general that implement the gc support, but it does not care about what those objects are. The only specal case are weakreferences and because those have inherit GC semantics.

    I am not sure about what you are proposing, could you elaborate?

    @pablogsal
    Copy link
    Member

    Also, heap types (created with type_new()) are already taking into account the type when visiting references:

    https://github.com/python/cpython/blob/master/Objects/typeobject.c#L1111

    @tim-one
    Copy link
    Member

    tim-one commented Apr 7, 2020

    If object A owns a strong reference to object B, and A participates in cyclic gc, and B may be part of a cycle, then it's necessary and sufficient that A's type's tp_traverse implementation invoke Py_VISIT() passing A's pointer to B.

    It would be a Really Bad Idea to add special cases to the gc module to spare some specific type(s) from following that (currently) utterly uniform rule.

    @serhiy-storchaka
    Copy link
    Member

    It would be inconvenient to require adding Py_VISIT(Py_TYPE(self)) in all tp_visit implementations of heap allocated types (including third-party extensions). Since tp_visit is GC specific thing, I think it wold be more convenient make a special case for object types.

    @pablogsal
    Copy link
    Member

    >> Since tp_visit is GC specific thing, I think it wold be more convenient make a special case for object types.

    I don't think that follows: The gc defers the logic of what and what should not be visited to the tp_traverse implementation of every object. The GC must be agnostic of the types it receives and heap types here are not special.

    --

    Also, if we make an exception in the GC for this special case, all tp_traverse of all those functions will be incorrect. Someone reading those tp_traverse can say "Oh, why these functions do not visit the type if they own s reference to it, this looks incorrect" and then we need to explain that that is an exception in the GC just because we were lazy to implement them correctly.

    @serhiy-storchaka
    Copy link
    Member

    Interesting, this issue may be related to bpo-24379. The problem with the proposed implementation of subscript was that it created a reference loop, and not all links in this loop were visible by GC.

    @pablogsal
    Copy link
    Member

    Also, as I mentioned, you don't need to modify all objects tp_traverse, only it's type.tp_traverse slot. For instance, all python objects know how to traverse stuff because they share the same tp_traverse:

    https://github.com/python/cpython/blob/master/Objects/typeobject.c#L1082

    So unless I am missing something, if you want to affect all heap types you just need to modify one tp_traverse in one place: the superclass.

    @pablogsal
    Copy link
    Member

    Hummm, I think we may just be missing a Py_VISIT(Py_TYPE(self))here:

    https://github.com/python/cpython/blob/master/Objects/typeobject.c#L3562

    A class owns references to its type (type) or another metaclass

    Tim, could you confirm that that is the case?

    @pablogsal
    Copy link
    Member

    Hummm, I think we may just be missing a Py_VISIT(Py_TYPE(self))here:

    Checking it more closely, that is incorrect, so we are not missing that visitation :(

    @serhiy-storchaka
    Copy link
    Member

    Recently many static allocated types were converted to heap allocated types (using PyType_FromSpec). Now you need to add Py_VISIT(Py_TYPE(self)) in all corresponding tp_visit implementations.

    And since even official example for heap allocated types do not contain it (and it was not needed before 3.9), I am sure that all existing code do not contain it.

    We cannot change all user code, so we should change the interpreter code so that it will work correctly with existing user code.

    @pablogsal
    Copy link
    Member

    We cannot change all user code, so we should change the interpreter code so that it will work correctly with existing user code.

    If we made a change that make all user code suddenly incorrect, that change should be reverted. The GC has clear rules about what tp_traverse should and should not do, and we should not violate those rules and make special cases in the gc just because we forced some classes to be incorrect. This will make much more difficult to reason about GC bugs, the tp_traverse implementation of classes and much difficult to maintain the GC itself.

    @serhiy-storchaka
    Copy link
    Member

    The problem is that we suddenly changed rules. It was not required that the object's type should be visited in tp_visit. It was incorrect to visit it, because object did not have strong reference to its type. User never created it, and it was not created implicitly.

    Now we changed rules. A strong reference is created implicitly. Who is responsible to manage a strong reference? Whose who created it, ant it is the interpreter, not the user. User does not know anything about it. If we pass the responsibility for the strong reference to the type on the user, we makes all user code incorrect, and we add a burden of fixing it and maintaining compatibility with incompatible Python versions on the user. I think it is very bad.

    @pablogsal
    Copy link
    Member

    Now we changed rules. A strong reference is created implicitly. Who is responsible to manage a strong reference? Whose who created it, ant it is the interpreter, not the user.

    Even if we decide that the patch that introduced the new rules should not be reverted, then what we should do is wrap the tp_traverse of the user in something that also calls Py_VISIT(Py_TYPE(self)) so basically the tp_traverse of the type created by PyType_FromSpec will do

    static int
    PyType_FromSpec_tp_traverse(_abc_data *self, visitproc visit, void *arg)
    {
        Py_VISIT(Py_TYPE(self))
        return self->user_provided_tp_traverse(self, visit, arg);
    }

    That way, we can still reason about what the tp_traverse should do, we don't break more rules and we don't need to make maintaining the GC even more difficult.

    @pablogsal
    Copy link
    Member

    In #19414 I have attached a proof of concept of a wrapper for tp_traverse that only affects types created by PyType_FromSpec that will call Py_VISIT(Py_TYPE(self)) and then the user function.

    @pablogsal
    Copy link
    Member

    I have made some investigation, and I think a form of this bug was there from a long time and does not really relate to heap types.
    For instance consider this code on Python3.7 (so commit 364f0b0 is not present).

    If you consider the more simple heap type:

    >>> class A:
    ...    ...
    ... 
    >>> A().__class__
    <class '__main__.A'>

    As the class instance refers to its type, it must visit it in the gc and we can see that indeed, that is the case:

    >>> import gc
    >>> gc.get_referents(A())
    [<class '__main__.A'>]

    But for instance, consider ast.AST, which in 3.7 is a static type created with PyType_Ready:

    >>> import ast
    >>> x = ast.AST()
    >>> x.__class__
    <class '_ast.AST'>

    we can see that the object refers to its type as the other one but....oh no, the gc does not know about that link:

    >>> import gc
    >>> gc.get_referents(x)
    []

    This is because its traverse function does not visit the type:

    static int
    ast_traverse(AST_object *self, visitproc visit, void *arg)
    {
         Py_VISIT(self->dict);
         return 0;
    }

    This is not a problem if the type is *static* because __although is technically an error__ because the GC cannot resolve
    cycles that go through the type, as the type is eternal those cycles will never be collected.

    The problem appears when the type is not eternal and can be destroyed. For instance, to understand this consider a function
    in the _testcapi that creates a heap type using PyType_FromSpec:

    >> import gc, _testcapi
    >> import weakref
    >> import sys

    # Create a new heap type with custom traverse function
    >>> x = _testcapi.construct_new_gc_type()
    >>> sys.getrefcount(x)
    5
    >>> new_obj = x()
    >>> sys.getrefcount(x)
    6
    # We know that now there should be a link between new_obj and x because one points to the other
    >>> x in gc.get_referents(new_obj)
    False
    # Oh no, there is no link
    >>> class A:
    ...    def __del__(self):
    ...       print("Ouch")
    ... 
    >>> x_w = weakref.ref(x)
    # Create a reference cycle  a -> new_obj -> x -> a
    >>> a = A()
    >>> 
    >>> x.a = a
    >>> a.y = new_obj
    >>> a.x = x
    >>> gc.collect()
    0
    >>> del x,a
    >>> gc.collect()
    0
    >>> sys.getrefcount(x_w())
    6
    >>> del new_obj
    # At this point all variables are gone and the collector should clean everything
    >>> gc.collect()
    0
    # Oh, no! The type is still alive
    >>> sys.getrefcount(x_w())
    6

    If we do the same experiment after PR19314:

    >>> import sys, gc, _testcapi
    >>> import weakref
    >>> x = _testcapi.construct_new_gc_type()
    >>> sys.getrefcount(x)
    5
    >>> new_obj = x()
    >>> sys.getrefcount(x)
    6
    # We know that now there should be a link between new_obj and x because one points to the other
    >>> x in gc.get_referents(new_obj)
    True
    # Good!
    >>> class A:
    ...   def __del__(self):
    ...      print("Ouch")
    ... 
    
    >>> x_w = weakref.ref(x)
    # Create a reference cycle  a -> new_obj -> x -> a
    >>> a = A()
    >>> x.a = a
    >>> a.y = new_obj
    >>> a.x = x
    >>> gc.collect()
    Traversed!
    Traversed!
    36
    >>> del x,a
    >>> gc.collect()
    Traversed!
    Traversed!
    0
    >>> sys.getrefcount(x_w())
    6
    >>> del new_obj
    # At this point all variables are gone and the collector should clean everything
    >>> gc.collect()
    Ouch
    8
    # Nice, the collector cleaned the cycle

    So the conclusion:

    • This bug affects all types but is only really relevant for types that are not eternal (because eternal types are already "leaked").
    • The only real problem and leaks will appear for heap types that are not eternal with custom traverse functions.
    • After commit 364f0b0 now all heap types that are not eternal, for instance
      the ones created with PyType_FromSpec, need to traverse the type because they own it. Falining to do this can create leaks in the GC.
    • The *correct* thing to do is modify the tp_traverse of all non-eternal heap types, but sadly the only way to do this in a backwards-compatible
      without modifying all user functions is injecting automatically the behaviour as PR 19414 is doing.

    @pablogsal
    Copy link
    Member

    Sorry when I said:

    If we do the same experiment after PR19314:

    I meant:

    If we do the same experiment after PR 19414:

    @vstinner
    Copy link
    Member Author

    I'm not comfortable with requesting authors of all C extensions to modify their tp_traverse function. As Pablo explained, the type is already visited on subtypes instances. I approved PR 19414.

    @vstinner
    Copy link
    Member Author

    Tests should be added, maybe based on msg365963 examples. But I would prefer to get this bug fixed in 3.9.0a6 (if possible), tests can be added later.

    @serhiy-storchaka
    Copy link
    Member

    I do not think it is right. Please wait, I'll submit an alternate solution.

    @vstinner
    Copy link
    Member Author

    There are limited options to fix this issue:

    (A) Revert commit 364f0b0

    It would reintroduced bpo-35810 bug. Moreover, C extension modules which have been modified to call Py_DECREF(Py_TYPE(self)) in tp_dealloc which suddenly crash.

    I don't think that anyone wants this option.

    (B) Require all C extension modules authors to modify their tp_traverse function.

    It requires to communicate well that all C extension modules which use PyType_FromSpec() need to modify their tp_traverse method to add code specific to Python 3.8 or newer.

    Serhiy and me are against this option.

    (C) Modify gcmodule.c to traverse the type.

    This option is not trivial since *subtype* are already traversed. Moreover, Pablo and Tim are strongly against this option.

    (D) Modify PyType_FromSpec() to hook into tp_traverse and transparently visit the type, as already done for subtypes.

    Option chosen by PR 19414.

    Honestly, I'm unhappy that we have to hook into tp_traverse, but this option sounds like the least bad option to me.

    Developers don't have to modify their C extensions code, the bug is fixed.

    @vstinner
    Copy link
    Member Author

    I do not think it is right. Please wait, I'll submit an alternate solution.

    What is your idea? How would it be different than PR 19414? Refleak buildbots are broken for almost one month (bpo-40149), it would be nice to get a fix soon. In Python 3.9.0a6 if possible.

    @tim-one
    Copy link
    Member

    tim-one commented Apr 23, 2020

    There is no possible world in which the best answer is "hack gcmodule.c" ;-)

    I haven't tried to dig into the details here, but Pablo's approach looked spot-on to me: put the solution near the source of the problem. The problem is specific to a relatively tiny number of objects, and all gc asks is that they play by the rules.

    @vstinner
    Copy link
    Member Author

    Serhiy: would it be possible to fix this issue in alpha6? Can we merge PR 19414? In the lack of reply, I think that the best to merge PR 19414.

    @vstinner
    Copy link
    Member Author

    New changeset 0169d30 by Pablo Galindo in branch 'master':
    bpo-40217: Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (GH-19414)
    0169d30

    @vstinner
    Copy link
    Member Author

    Serhiy: would it be possible to fix this issue in alpha6? Can we merge PR 19414? In the lack of reply, I think that the best to merge PR 19414.

    I asked Lukasz (Python 3.9 release manager) in private and he is ok to merge this PR. He plans to wait until Refleak buildbots turn green again before tagging Python 3.9.0alpha6.

    Serhiy:

    I do not think it is right. Please wait, I'll submit an alternate solution.

    Serhiy: I merged Pablo's PR to unblock Refleak buildbots, but I'm still curious to see what you have to propose. We can still change the code before 3.9.0beta1.

    Don't hesitate to propose your PR! (your PR will have to revert commit 0169d30).

    --

    Pablo: would you mind to write a NEWS entry (in the C API category) for your change? IMHO the commit title is good enough to be reused as the NEWS entry. I added "skip news" label just to unblock buildbots.

    @ambv
    Copy link
    Contributor

    ambv commented Apr 27, 2020

    Please backport to 3.8, then it will become part of 3.8.3rc1 which I'll be releasing tomorrow.

    @pablogsal
    Copy link
    Member

    New changeset 91a5ae1 by Pablo Galindo in branch 'master':
    bpo-40217: Clean code in PyType_FromSpec_Alloc and add NEWS entry (GH-19733)
    91a5ae1

    @vstinner
    Copy link
    Member Author

    Please backport to 3.8, then it will become part of 3.8.3rc1 which I'll be releasing tomorrow.

    I propose to *not* fix this bug in Python 3.8:

    • Python 3.8 stdlib doesn't seem to be impacted by this bug
    • The number of third party C extension modules impated by this bug is very low or even zero
    • The bug severity is minor *in practice*
    • A backport can cause a C extension to crash
    • The implementation may still change before Python 3.9.0 final release

    --

    The worst thing which can happen with this bug is that a C extension module creates many types and these types are never deleted. Well, it's annoying, but it's unlikely to happen. Usually, types are created exactly once in C extensions.

    --

    I'm not 100% comfortable with backporting the fix. Python 3.8.0, 3.8.1 and 3.8.2 have been released with the bug, and this issue is the first report. But I only saw the bug because many C extension modules of the Python 3.9 stdlib were converted to PyModuleDef_Init() and PyType_FromSpec().

    --

    I'm not comfortable to change the GC behavior in a subtle way in a 3.8.x minor release.

    If a C extension module was impacted by this bug and decided to explicitly visit the type in its traverse function, this fix will crash the C extension crash...

    --

    At April 23, Serhiy proposed an alternative fix. But then he didn't propose a PR.

    --

    The bug only impacts C extension modules which use PyModuleDef_Init() and PyType_FromSpec().

    Python 3.8 only uses PyModuleDef_Init() in the following modules:

    vstinner@apu$ grep -l PyModuleDef_Init Modules/*.c
    Modules/arraymodule.c
    Modules/atexitmodule.c
    Modules/binascii.c
    Modules/_testmultiphase.c
    Modules/xxlimited.c
    Modules/xxmodule.c
    Modules/xxsubtype.c

    Python 3.8 only uses PyType_FromSpec() in the following modules:

    $ grep -l PyType_FromSpec Modules/*.c
    Modules/_curses_panel.c
    Modules/_ssl.c
    Modules/_testcapimodule.c
    Modules/_testmultiphase.c
    Modules/_tkinter.c
    Modules/xxlimited.c

    Intersection of the two sets: _testmultiphase and xxlimited, which are modules only used by the Python test suite.

    It means that Python 3.8 standard library is *not* impacted by this bug.

    Only third party C extension modules which use PyModuleDef_Init() *and* PyType_FromSpec() are impacted. Most C extension modules use statically allocated types and so are not impacted.

    @vstinner
    Copy link
    Member Author

    The issue has been fixed in the master branch, thanks Pablo! I close the issue.

    @encukou
    Copy link
    Member

    encukou commented May 20, 2020

    It looks like the fix breaks types that override Py_tp_alloc but not Py_tp_traverse (as some types in PySide do.)

    I'll investigate more and work on fixing that if someone doesn't beat me to it.

    @pablogsal
    Copy link
    Member

    Re-opening the issue

    @pablogsal
    Copy link
    Member

    Maybe we should revert this change and modify the "how to port to ..." section as we did in bpo-35810 to tell users that they need to manually visit the parent in classes created with PyType_FromSpec

    @pablogsal
    Copy link
    Member

    Actually, I have been thinking about this more and I cannot really trace how the bug could happen. We control the type allocation, but Py_tp_alloc controls the instance allocation and that can be anything, the current patch should not interfere with that.

    Petr, do you have a reproducer? I tried to create a reproducer of a type that overrides Py_tp_alloc but not Py_tp_traverse (therefore it does not implement gc support) and I cannot make it crash.

    @encukou
    Copy link
    Member

    encukou commented May 22, 2020

    Indeed, it seems my initial analysis was incorrect. I can't reproduce this outside of PySide.
    I'll close this bug and invesatigate more.

    Sorry for the noise.

    @encukou
    Copy link
    Member

    encukou commented May 22, 2020

    Ha! I think I found the issue in PySide now. It's different, but it's still a CPython issue. It's actually a variant of the "defining class" problem from PEP-573.

    It's legal to get the value of a slot, using PyType_GetSlot.
    It's reasonable to assume that what you put in the tp_traverse slot is what you'll get out using PyType_GetSlot.
    If a type's tp_traverse tries to call its superclass's tp_traverse, you can get an infinite loop.
    See the situation below or run attached reproducer (build the module, import it and exit
    the interpreter).

    Getting one's superclass is a bit tricky when dealing entirely with FromSpec
    types, but it is definitely *possible*. There's a lot of assumptions the code
    can reasonably make. In my reproducer, I stored the superclass in a C static
    variable (which is perfectly valid if the module cleanly refuses to work with
    subinterpreters or interpreter reload).

    Base:

    • real tp_traverse is PyType_FromSpec_tp_traverse
    • user_provided_tp_traverse is the base's original

    Subclass:

    • real tp_traverse is PyType_FromSpec_tp_traverse
    • user_provided_tp_traverse is the subclass' original

    So when the subclass is traversed:

    • subclass real tp_traverse calls the subclass user_provided_tp_traverse
    • subclass user_provided_tp_traverse calls PyType_GetSlot(base, tp_traverse)
      • that is, the base's real tp_traverse
    • the base's real tp_traverse calls the **SUBCLASS** user_provided_tp_traverse,
      since that's what's recorded in type(self)

    Another issue is that PyType_FromSpec_Allocated types lie about their size:
    tp->tp_basicsize + Py_SIZE(self) * tp->itemsize is not the actual
    allocated amount. This could technically be solved by redefining __sizeof__,
    but I'm more worried that it's another edge case of a hack, and There will
    probably be other edge cases. Since this is API we want to build on, I'd sleep
    easier if it's kept clean.

    What would be the downsides of reverting and documenting that tp_traverse
    must visit Py_TYPE(self)?
    It seems that you visit Py_TYPE(self), it could just end up with unbreakable
    reference cycles. We're dealing with modules and classes, which are usually
    effectively immortal (and if not, the author probably knows what they're doing).
    I don't think it would hurt that much.

    @encukou encukou reopened this May 22, 2020
    @encukou encukou reopened this May 22, 2020
    @pablogsal
    Copy link
    Member

    What would be the downsides of reverting and documenting that tp_traverse must visit Py_TYPE(self)?

    Not much IMHO, I actually would prefer this solution over any automatic "hacks". The reason is that any hack will be technically violating the contract: if I read the tp_traverse of any of these classes I would say "wait a minute....this is wrong: the class owns a strong reference to its parent so...why is not visiting it?". But the answer would be that we are hacking because we didn't want to force users to migrate once we added the strong reference.

    Notice that the root of this problem was bpo-35810. In that issue we added a guide to the docs: https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-c-api. In there it says that you should decref now the parent on tp_dealloc:

    static void
    foo_dealloc(foo_struct *instance) {
        PyObject *type = Py_TYPE(instance);
        PyObject_GC_Del(instance);
    #if PY_VERSION_HEX >= 0x03080000
        // This was not needed before Python 3.8 (Python issue 35810)
        Py_DECREF(type);
    #endif
    }

    but it forgot about that you should also visit the parent in tp_traverse.

    So, I propose to do the following:

    • Revert the hack.
    • Fix the tp_traverse of all the relevant classes in the stdlib to correctly visit its parent.
    • Modify the documentation of master, 3.9 and 3.8 to add the missing information: You MUST visit the parent in tp_traverse:

    What do you think?

    @pablogsal
    Copy link
    Member

    I have been playing with the reproducer and I am a bit confused: The reproducer crashes in the same way even after reverting PR 19414 so it does not seem related to it. This is what I get:

    >>> import reproducer
    >>> reproducer.Modules/gcmodule.c:114: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
    Enable tracemalloc to get the memory block allocation traceback

    object address : 0x55cf429e4080
    object refcount : 4
    object type : 0x55cf418e9c60
    object type name: type
    object repr : <class 'reproducer.Base'>

    Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
    Python runtime state: initialized

    Current thread 0x00007fc65d2f8740 (most recent call first):
    File "<frozen importlib._bootstrap>", line 917 in _find_spec
    File "<frozen importlib._bootstrap>", line 982 in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
    File "/home/pablogsal/github/python/master/Lib/re.py", line 124 in <module>
    File "<frozen importlib._bootstrap>", line 228 in _call_with_frames_removed
    File "<frozen importlib._bootstrap_external>", line 790 in exec_module
    File "<frozen importlib._bootstrap>", line 680 in _load_unlocked
    File "<frozen importlib._bootstrap>", line 986 in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
    File "/home/pablogsal/github/python/master/Lib/rlcompleter.py", line 142 in attr_matches
    File "/home/pablogsal/github/python/master/Lib/rlcompleter.py", line 89 in complete
    [1] 162397 abort (core dumped) ./python

    This seems to indicate that that reproducer is already implementing incorrectly tp_traverse.

    @pablogsal
    Copy link
    Member

    Ok, I found the problem. The problem is that the reproduced does not correctly work the reference count of base_class because when construction get tuple of bases:

    PyObject *bases = PyTuple_New(1);
    result = PyTuple_SetItem(bases, 0, base_class);
    if (result) return -1;
    
    PyObject *subclass = PyType_FromModuleAndSpec(m, &subclass_spec, bases);

    "PyTuple_SetItem" steals a reference to base_class but "PyModule_AddObject" also does the same, and the refcount is incorrect.

    If you add a Py_INCREF before, the crash disappears.

    @pablogsal
    Copy link
    Member

    If you add a Py_INCREF before, the crash disappears.

    To be clear: the other crash is still in the reproduced: the one that Petr describes in his comment. In PR 20264 I have prepared the changes that I proposed previously (including the revert of the hack). Petr, do you mind reviewing it?

    @encukou
    Copy link
    Member

    encukou commented May 23, 2020

    Thank you for responding so quickly!

    Petr, do you mind reviewing it?

    Yes, but I'll only get to it next week.

    @miss-islington
    Copy link
    Contributor

    New changeset 1cf15af by Pablo Galindo in branch 'master':
    bpo-40217: Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (reverts #63613) (GH-20264)
    1cf15af

    @miss-islington
    Copy link
    Contributor

    New changeset bcbe5c5 by Miss Islington (bot) in branch '3.9':
    bpo-40217: Ensure Py_VISIT(Py_TYPE(self)) is always called for PyType_FromSpec types (reverts #63613) (GH-20264)
    bcbe5c5

    @encukou encukou closed this as completed Jun 1, 2020
    @encukou encukou closed this as completed Jun 1, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants