classification
Title: imprecise handling of weakref callbacks
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: postponed
Dependencies: Superseder:
Assigned To: Nosy List: konrad.schwarz, mark.dickinson, pitrou, tim.peters
Priority: normal Keywords:

Created on 2021-03-03 04:48 by konrad.schwarz, last changed 2021-03-04 13:10 by konrad.schwarz. This issue is now closed.

Messages (7)
msg387999 - (view) Author: Konrad Schwarz (konrad.schwarz) Date: 2021-03-03 04:48
I am seeing the following non-deterministic behavior:

My code processes DeviceTree, a tree-based specification format for hardware descriptions, that includes cross-references ("phandles").  For all intents and purposes, this format is similar to XML; phandles are analog to ID/IDREFS.

To prevent reference cycles and avoid the need for garbage collection, my code uses weakref.proxy for parent pointers and weakref.ref for cross-references.

My goal is to provide a "projection" operation on a DeviceTree: creating derived DeviceTrees that model subsets of the hardware (this is to partition the hardware into multiple independent sub-machines).  The projection is specified by newly introduced nodes and attributes (aka properties) in the tree; phandles are used to indicate which part belongs to which partition.

Python weak references provide a callback interface to indicate the demise of their referents and my code uses that to prune the tree: e.g., if a node modeling a partition is deleted, nodes that reference that node (i.e., indicate they belong to that partition) are deleted in the corresponding weakref callback.  So technically, the code implicitly uses the interpreters list of weak referrers (__weakref__) to find and execute code on them when the referent's state changes.

This works exactly as envisioned when single-stepping in PDB.

When running at full speed however, I see that weak reference callbacks are being triggered after the corresponding weak reference has been deleted with del (the weak reference is a value of a Python dict holding a node's attributes.)

I suspect that this is because of some batching or deferred processing in the Python interpreter.

Ultimately, this is a violation of the semantics and must be classified as a bug.

However, in my case, it would suffice to have a "memory barrier" type of operation that flushes the queue of deferred deletions before continuing.  Something like that must exist, because single stepping in PDB is successful.

Initial tests of calling the garbage collector to this end were inconclusive, unfortunately.
msg388030 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-03 16:39
If I understand correctly, the reported bug is that you're seeing a weakref callback being executed after the weakref.ref instance it's attached to is deleted. Is that correct?

Do you have a minimal example we can use to reproduce the effect? Without such an example, there's not much of a realistic path to moving this forward.

Without the code: I can't see a plausible mechanism by which the callback could execute after the weakref.ref has ceased to exist.

However, my suspicion is that that's not what's actually happening here. I suspect that when you say "after the corresponding weak reference has been deleted with del", that the "del" statement you refer to is not actually deleting the last reference to the weakref.ref object, so the weakref still exists after the "del". One easy way that this could happen is if the weakref is part of a reference cycle (and I know from personal experience that it's horribly easy to accidentally *create* reference cycles via weakref callbacks, especially if those callbacks refer to instance methods).

Furthermore, if the weakref exists as part of a cycle and that cycle is being collected by the garbage collector, I could see how the callback could be sometimes executed and sometimes not depending on the exact order in which the cycle is cleaned up.
msg388039 - (view) Author: Konrad Schwarz (konrad.schwarz) Date: 2021-03-03 18:31
> If I understand correctly, the reported bug is that you're seeing a weakref callback being executed
> after the weakref.ref instance it's attached to is deleted. Is that correct?

Exactly.  I del what should be the only reference to the weakref.ref/proxy, then del the weakref.ref's referent.
The weakref.ref's callback is executed.

> Do you have a minimal example we can use to reproduce the effect? Without such an example, there's not
> much of a realistic path to moving this forward.

Unfortunately, not at the moment.

> However, my suspicion is that that's not what's actually happening here. I suspect that when you say
> "after the corresponding weak reference has been deleted with del", that the "del" statement you refer
> to is not actually deleting the last reference to the weakref.ref object, so the weakref still exists
> after the "del". One easy way that this could happen is if the weakref is part of a reference cycle
> (and I know from personal experience that it's horribly easy to accidentally *create* reference cycles
> via weakref callbacks, especially if those callbacks refer to instance methods).

I tried to be as punctilious as I could to prevent this sort of thing from happening.
The weird thing is that the error does not occur when I single step the code in PDB.
It must be something internal to CPython, some sort of optimization.
Are local variables/stack frames cleaned lazily?  Is the last accessed value of a
dictionary cached internally, increasing its reference count?

I've worked around the problem by double checking in the callback method that it actually should execute;
i.e. in my case, that the object has not been removed (del) from a dictionary (which should be the
single reference to the object).

If the problem does indeed lie with CPython, but has too much of a performance/maintenance impact,
perhaps a caveat could be added to the documentation to the effect of "The lifetime of an object
may be longer than what a programmer expects causing a weakref's callback to be invoked at
a surprising time; it is good practice to program the callback defensively."
msg388041 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-03 18:45
Have you tried doing a "gc.get_referrers" on your weakref, to verify that there aren't any references that you don't expect to be there?

Another diagnostic would be to run with cyclic garbage collection disabled (via gc.disable). If you're seeing different behaviour with gc disabled versus enabled, that would at least confirm the presence of reference cycles.

> Are local variables/stack frames cleaned lazily? Is the last accessed
> value of a dictionary cached internally, increasing its reference count?

My knowledge of current Python internals isn't as deep as it used to be, but I'd be quite surprised if either of these were true.
msg388042 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-03 18:49
One more thing: I'm assuming that everything in sight is single-threaded? With more than one Python thread (even if that thread is auxiliary to anything you're doing - e.g., if you're working at an IPython prompt, there will likely be at least a couple of extra Python threads around), there's always the exciting possibility that the cyclic garbage collection can end up running on an essentially arbitrary Python thread; it would be good to rule that possibility out.
msg388058 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-03-03 21:07
This won't go anywhere without code (preferably minimal) we can run to reproduce the complaint. If there were a "general principle" at work here, someone else would surely have reported it over the last few decades ;-)

To the contrary, the common confusion is in the _other_ direction: a weakref callback _not_ getting invoked when a programmer thinks it "should be". The cause for that is always the same: the weakref object died before the weakly referenced object died. That was the primary motivation for introducing `weakref.finalize()`.

CPython does not, in general, "batch" (or in any other way delay) object destruction. An object is destroyed - immediately - when its refcount falls to 0. In a technical sense, that's also true in the case of cycles (in which case the gc module artificially drives refcounts to 0, based on liveness and reachability analysis). With very few exceptions, neither does it hang on to "hidden" references. The primary exception to that is in interactive shells, where the module-global identifier "_" is typically bound, by magic, to the object most recently displayed. In the case of exceptions, it's also possible for programs to accidentally hang on to the exception object, from which the entire chain of stack frames back to the source of the exception can be reached.

So, based on what you said, this is the best attempt I can make:

    import sys, weakref

    class C:
        def __del__(self):
            print("__del__ called")

    def wrcb(x):
        print("weakref callback called")

    c = C()
    d = {"x" : weakref.ref(c, wrcb)}
    print(sys.getrefcount(d["x"]))

    #del d["x"]
    del c

which displays:

    2
    __del__ called
    weakref callback called

Note the 2! If the moral equivalent in your code displays a number larger than 2, then there are more references to the weakref object than just as a dict value (that's one; the other reference comes from temporarily incrementing the refcount of `d["x"]` to pass it as an argument to `getrefcount()`).

If I uncomment the penultimate line, to destroy the weakref before `c` is destroyed, the output changes to:

    2
    __del__ called

So it's all as expected. Can you change that code to demonstrate your case?
msg388083 - (view) Author: Konrad Schwarz (konrad.schwarz) Date: 2021-03-04 13:09
Unfortunately, my management has impressed other priorities upon me; I can't delve deeper into this subject at the moment.

My takeaway is that the error very likely lies on my side; maybe I need to re-check local variables and del them explicitly.  In any case, I can work around the situation by being extra careful in the callback.  I don't know how well all of this would work in a truly multi-threaded environment, but the application doesn't require that at all.

In any case, thank you for the insightful comments and the willingness to understand my problem!  I certainly have learned a lot.
History
Date User Action Args
2021-03-04 13:10:31konrad.schwarzsetstatus: open -> closed
resolution: postponed
stage: resolved
2021-03-04 13:09:54konrad.schwarzsetmessages: + msg388083
2021-03-03 21:07:23tim.peterssetnosy: + tim.peters
messages: + msg388058
2021-03-03 18:49:46mark.dickinsonsetmessages: + msg388042
2021-03-03 18:45:20mark.dickinsonsetmessages: + msg388041
2021-03-03 18:31:01konrad.schwarzsetmessages: + msg388039
2021-03-03 16:39:26mark.dickinsonsetnosy: + mark.dickinson, pitrou
messages: + msg388030
2021-03-03 04:48:00konrad.schwarzcreate