Author nascheme
Recipients Mark.Shannon, christian.heimes, inada.naoki, jdemeyer, lukasz.langa, miss-islington, nascheme, pablogsal, petr.viktorin, pitrou, tim.peters, vstinner
Date 2019-09-27.20:51:48
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1569617509.39.0.0878980170276.issue38006@roundup.psfhosted.org>
In-reply-to
Content
A few comments from the "peanut gallery".  Thanks to Victor and Pablo for doing the hard work of investigating this bug.

First, it has been a long time since Tim and I first developed gcmodule.c.  So, some of my understanding may be inaccurate due to code changes or old age memory loss. ;-P

If I understand Victor's test case correctly, the problem is caused if you have an extension type that implements tp_traverse but not tp_clear and that there is also a weakref involved in the trash cycle.  In my original design of the GC, not implementing tp_clear should have been okay.  It would mean that the GC might find garbage reference cycles that it couldn't cleanup.  Those would leak memory but would not crash Python.

I'm not sure what has changed since to require that tp_clear actually successfully clears the cycle.  Tim was the origin of the code that handles weakrefs. The GC is (relatively) simple if not for handling weakrefs and finalizers.  Tim did almost all of the difficult and brain exploding work on that stuff.  There was a big re-org some years ago to handle "legacy finalizers" and PEP 442 finalizers.  That made things more complicated yet.  Maybe the requirement for a working tp_clear came into existence then?  I added Tim to the nosy list since he might have insight.

To me, it seems problematic that we would require a type to have a tp_clear method that actually breaks cycles.  For mutable objects like dicts and lists, we can have tp_clear do its thing while leaving the object in a valid state.  The tp_clear added for functions was not like that because other code expected structure fields to be non-NULL.  At least that's my understanding.

Is the behavior of tp_clear the key to this bug?  In the original design GC, calling tp_clear was not guaranteed to cause objects in the cylce to be freed.  So, after tp_clear, the object still needs to be in some kind of valid state.  That is one of the reasons why the tuple type doesn't have a tp_clear (also that they can't normally be directly be used to create cycles).

One last observation: the reference cycle GC is now being relied upon in ways that were never originally envisioned.  Originally the cyclic GC was an optional feature and you could build Python without it.  Reference cycles were supposed to exist in rare cases and only be created by user code.  In that world, having tp_clear on the function type is be unneeded.  Now, CPython is creating reference cycles itself and the GC is being relied on much more to free memory.  I'm afraid we are evolving into a complicated and low performance GC (refcount + cycle collection).  I don't have any good suggestions on a better approach.  Maintaining the C-API ties us to reference counting.  I mention it because it might be helpful to have a high-level view of the evolution of this part of CPython.
History
Date User Action Args
2019-09-27 20:51:49naschemesetrecipients: + nascheme, tim.peters, pitrou, vstinner, christian.heimes, petr.viktorin, inada.naoki, lukasz.langa, Mark.Shannon, jdemeyer, pablogsal, miss-islington
2019-09-27 20:51:49naschemesetmessageid: <1569617509.39.0.0878980170276.issue38006@roundup.psfhosted.org>
2019-09-27 20:51:49naschemelinkissue38006 messages
2019-09-27 20:51:48naschemecreate