classification
Title: weakref callback vs gc vs threads
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: nascheme, tim.peters
Priority: critical Keywords:

Created on 2004-10-28 01:58 by tim.peters, last changed 2004-10-30 23:13 by tim.peters. This issue is now closed.

Files
File name Uploaded Description Edit
temp2a.py tim.peters, 2004-10-28 01:58 ZODB 3.4a1 reduced to 60 lines
temp2b.py tim.peters, 2004-10-28 02:28 2 objects and 1 thread
temp2c.py tim.peters, 2004-10-28 08:06 This hurts. A lot.
temp2d.py tim.peters, 2004-10-28 08:56 __del__ hosage; no weakrefs with callbacks
patch.txt tim.peters, 2004-10-29 06:54 First stab at a fix
patch-n1.txt nascheme, 2004-10-29 17:00 Another stab
patch2.txt tim.peters, 2004-10-30 03:23 Reworked patch.txt; no issues known
patch3.txt tim.peters, 2004-10-30 07:55 patch2.txt + comment typo repair
patch3-1.txt nascheme, 2004-10-30 15:04 patch3.txt + Neil's fixes
patch4.txt nascheme, 2004-10-30 15:27
patch42.txt tim.peters, 2004-10-30 22:49 Possibly the end of this -- for this year
Messages (25)
msg22895 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-28 01:58
Oh oh.  It's that time of year again.  I'm sure the 
attached (temp2a.py) can be simplified, perhaps down 
to two objects and one thread.  As is, I *think* it 
demonstrates that invoking a weakref callback can do 
fatal damage, not necessarily because of what the 
callback does, but because simply calling it while gc is 
running can allow other threads to run during gc too, 
and resurrect a piece of cyclic trash T that's already 
been tp_clear()'ed, via invoking a still-living weakref to 
T.

If so, this isn't new in 2.4.  It's a real problem since 
temp2a.py is what's left of ZODB 3.4a1 <wink>.
msg22896 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-28 02:28
Logged In: YES 
user_id=31435

Yup, temp2b.py probably shows the same problem, with 2 
objects and one thread.  This one is definitely more strained, 
though, since the weakref callback does the damage 
directly.  In temp2a.py, nothing is trying to fool anything, and 
the only damage done by the wr callbacks there is simply in 
releasing the GIL (thus allowing other threads to do perfectly 
ordinary things with weakrefs).
msg22897 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-28 02:37
Logged In: YES 
user_id=31435

Noting that temp2b.py fails in current CVS, 2.3.4, and 2.2.3.  
That's all the Pythons I have handy right now.
msg22898 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-28 08:06
Logged In: YES 
user_id=31435

temp2c.py may be as bad as it gets.  It shows that the 
problem can occur on a gc collection that doesn't see *any* 
object that has a weakref with a callback.
msg22899 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-28 08:56
Logged In: YES 
user_id=31435

temp2d.py shows that we don't need any weakrefs with 
callbacks to get in trouble -- a __del__ method can hose us 
too.
msg22900 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-28 20:55
Logged In: YES 
user_id=31435

A fix is in progress, as sketched on Python-Dev.  I expect it 
to land Friday (or tonight yet, if I'm lucky).
msg22901 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-29 06:54
Logged In: YES 
user_id=31435

Neil, can you make some time to eyeball this (patch.txt)?  If 
not, please assign to Fred.  The Python -uall suite passes in 
debug and release builds.

I just ran the -all Zope3 test suite in a debug buld, and that 
passes as well as it does with Python 2.3.4 on Windows, but 
I see many thousands of lines like:

Exception exceptions.KeyError: <weakref at 104EA620; dead> 
in <function remove at 0x01595560> ignored

So there's something still wrong here.  I'm not sure who's 
writing those msgs; I expect it's in the guts of gc, when 
invoking delayed weakref callbacks, and triggered by the 
weak dictionary implementations (which haven't changed).
msg22902 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-29 09:11
Logged In: YES 
user_id=31435

I strongly suspect that abusing the weakref struct's `hash` 
member is responsible for the gazillions of KeyErrors.  Dang.  I 
need one lousy bit ...
msg22903 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2004-10-29 14:56
Logged In: YES 
user_id=35752

I don't see why you need the extra bit.  When you find
weakly referenced objects, clear the wr_object pointer of
the weakref.  Move trash weakref objects with callbacks to
the wr_callbacks list as we always did.  When the trash goes
away then PyObject_ClearWeakRefs() will invoke the callbacks
normally (we only cleared the wr_object pointer, the
backpointer list is still intact).

I'm going to see if I can make this work.  In the process I
will probably discover what I have been missing. :-)
msg22904 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-29 15:18
Logged In: YES 
user_id=31435

Excellent!  I think you're right about this.  It never occurred 
to me that just setting wr_object to None would be as 
effective at disabling a weakref as calling clear_weakref().  If 
that's really so (& I can't see why not offhand), it would be 
better in oh-so-many ways.
msg22905 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-29 16:01
Logged In: YES 
user_id=31435

Hmm.  Just about anything fixes the current rash of bugs.  
It's keeping the old bugs from coming back that's been the 
hangup -- don't forget to run test_weakref.py too <wink>.  
The last batch of bugs was really much subtler than this 
batch!

test_callback_in_cycle_1 is a bitch -- I think we have to stop 
its callback from ever getting invoked, not just prevent I.wr() 
from returning J ...
msg22906 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-29 16:41
Logged In: YES 
user_id=31435

Bingo.  This is a bit delicate <heh>.  It's still necessary to call 
_PyWeakref_ClearRef() on a trash weakref with a callback, to 
prevent the callback from ever triggering (that was the key 
to fixing the previous pile of bugs).  For all other weakrefs to 
trash, I think you're right that just setting wr_object to None 
is conceptually enough.  But ...

What I pissed away the last half hour discovering is that if 
you set wr_object to None *before* calling 
_PyWeakref_ClearRef(), then the latter doesn't do anything, 
because clear_weakref() doesn't do anything when wr_object 
is None.

So that leaves me a little worried:  if we just set wr_object 
to None on some weakrefs, then PyObject_ClearWeakrefs() 
will never (and for the same reason) remove such a weakref 
from its doubly-linked list either.  Doesn't look like the 
weakref code intended that this be possible, and I'm not yet 
entirely convinced it can't hurt ...
msg22907 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2004-10-29 17:00
Logged In: YES 
user_id=35752

I had to change _PyWeakref_ClearRef() since it was also
clearing the weakref list of the trash object.  Now it just
sets wr_object to Py_None.  I also made some serious
simplifications to gcmodule by just treating trash weakref
objects with callbacks the same as objects with __del__
methods (i.e. move them to the finalizers list and then do
the transitive closure of that set).

I'm attaching a patch.  It's a work in progress.
msg22908 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-29 17:13
Logged In: YES 
user_id=31435

> I had to change _PyWeakref_ClearRef() since it was also
> clearing the weakref list of the trash object.

That was really its *purpose*.  If a trash weakref with a 
callback isn't removed from the referent's list of weakrefs, 
then the callback will trigger when PyObject_ClearWeakRefs() 
is invoked on the referent.  The purpose of 
_PyWeakref_ClearRef() was to ensure that the callback never 
triggers.

> Now it just sets wr_object to Py_None.

That won't stop the callback from triggering.  It also means 
(see earlier comment) that PyObject_ClearWeakRefs() will 
never removed the weakref from the list either, although I'm 
not sure that does real harm.

> I also made some serious simplifications to gcmodule by
> just treating trash weakref objects with callbacks the same
> as objects with __del__ methods (i.e. move them to the
> finalizers list and then do the transitive closure of that set).

Does that mean they can end up in gc.garbage too?  If so, I 
don't think that's sellable.  See the end of gc_weakref.txt:

    An alternative would have been to treat objects with
    callbacks like objects with __del__ methods, refusing to
    collect them, appending them to gc.garbage instead.
    That would have been much easier.  Jim Fulton gave a
    strong argument against that (on Python-Dev):
    ...

> I'm attaching a patch.  It's a work in progress.

Aren't we all <wink>.
msg22909 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 03:23
Logged In: YES 
user_id=31435

patch2.txt stops abusing weakref `hash`, and seems to have 
no issues w/ Python's or Zope3's -uall/--all tests, in debug or 
release builds.  weakrefs are handled entirely in 
move_troublemakers() now; collect() doesn't even have its 
old wr_callbacks list anymore.

If there's a more elegant approach coming, great, but if it's 
not coming soon I'm going to check this in, to get as much 
developer test coverage as possible before 2.4.b1.
msg22910 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 07:55
Logged In: YES 
user_id=31435

patch3.txt repairs some typos in comments; semantically, it's 
the same as patch2.txt.
msg22911 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2004-10-30 15:04
Logged In: YES 
user_id=35752

I think your patch is about as elegant as it can be.  The
approach looks sound but I believe you forgot to clear the
wr_callback pointer.  I'm attaching an incremental patch
(relative to patch3.txt).

Regarding gc_weakref.txt, I think we should add a note at
the top explaining the critical rule.  I.e. after we can
computed the unreachable set (aka trash), it's okay to
invoke code that can access non-trash but we cannot allow
code to run that can reference trash.

There is only one minor issue that I can think of.  We
decide which callbacks to invoke inside
move_troublemakers().  It's possible that unreachable
weakrefs become reachable when move_finalizer_reachable() is
called (i.e. weakrefs that would be trash except that they
got revived because they were referenced by an object with a
__del__ method).  However, since those weakrefs would have
been trash I think it's okay not to invoke their callbacks.
msg22912 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2004-10-30 15:27
Logged In: YES 
user_id=35752

Attaching another patch.  This one splits
move_troublemakers() into move_finalizers() and
handle_weakrefs().  The code is a little easier to
understand and arguably more correct.  The down side is that
it adds another pass over `unreachable`.
msg22913 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 17:20
Logged In: YES 
user_id=31435

> The approach looks sound but I believe you forgot to clear
> the wr_callback pointer.  I'm attaching an incremental patch
> (relative to patch3.txt).

Some of the comments in that incremental patch are good 
additions, but I don't agree with the code changes.  A 
weakref object owns a (strong) reference to its wr_callback, 
and the first pass gives move_troublemakers its own new 
strong reference to each weakref object.  Because of the 
latter, nothing any callback does can cause any of these 
weakrefs to go away, so nothing can happen to cause their 
wr_callback slots to become NULL either.  The assertion that 
wr_callback isn't NULL is appropriate.  If you still think it's 
possible for wr_callback to become NULL, please concoct a 
test showing it happen.

The passes after the first give up move_troublemaker's 
references to the weakref objects.  Then they'll go away, 
when their refcounts hit 0.  Sometimes that happens right 
away, sometimes not.  In either case, it's weakref_dealloc's 
job to decrement the refcount on w->wr_callback when w 
goes away, and there's no point (that I can see) to 
duplicating that logic too inside gcmodule.

Note that the comments before _PyWeakref_ClearRef() 
explain that the weakref's tp_dealloc is expected to finish the 
job (clear the wr_callback slot) -- that's been true since 
_PyWeakref_ClearRef() was first introduced.
msg22914 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 17:35
Logged In: YES 
user_id=31435

> Regarding gc_weakref.txt, I think we should add a note at
> the top explaining the critical rule. 

An excellent idea -- I'll do that.
msg22915 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2004-10-30 17:51
Logged In: YES 
user_id=35752

Ah,the detail that I was missing was that wr_callback is not
invoked if the weakref list has been cleared.  I.e., calling
_PyWeakref_ClearRef() is enough ensure that the callback is
never implicitly envoked.  That's cleaner than explicitly
clearing the wr_callback pointer and also seems to a
behavior that can be depended on (it's pretty fundamental to
how weakrefs are implemented).

patch3.txt looks like a winner.
msg22916 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 18:28
Logged In: YES 
user_id=31435

> Ah,the detail that I was missing was that wr_callback is not
> invoked if the weakref list has been cleared.  I.e., calling
> _PyWeakref_ClearRef() is enough ensure that the callback is
> never implicitly envoked

Indeed, ensuring that the callback never triggers by magic is 
a primary *purpose* of _PyWeakref_ClearRef().

I should add something here:  this is so delicate in part 
because I didn't (during the *last* round of bug fixes) want 
to allocate any heap space during gc -- if malloc couldn't find 
more heap space when gc needed it, then the only thing I 
could do is kill the Python process.

So the deeper and somewhat ugly truth here is that gc is (re)
using the (already-allocated) weakref object as a container 
*just* to remember its wr_callback pointer.  It would have 
been easier to understand if, instead, gc could have saved 
away the wr_callback pointers in its own PyList object, and 
used weakrefobject.c's clear_weakref() instead.  Alas, there's 
no limit on how big such a list may need to become.

As things turned out, it's a very good thing I couldn't do 
that.  Fred subsequently made weakrefs subclassable, and 
now there are actual uses for passing the weakref object to 
the callback (e.g., weak value dictionaries use a weakref 
subclass now, and an instance of this subclass has the 
associated dict key as an attribute), so now a weakref object 
has to be kept around as long as its callback may still get 
invoked.
msg22917 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 18:47
Logged In: YES 
user_id=31435

> There is only one minor issue that I can think of.  We
> decide which callbacks to invoke inside
> move_troublemakers().  It's possible that unreachable
> weakrefs become reachable when
> move_finalizer_reachable() is called (i.e. weakrefs that
> would be trash except that they got revived because they
> were referenced by an object with a __del__ method).
>
> However, since those weakrefs would have been trash I
> think it's okay not to invoke their callbacks.

I sure agree there are grey areas here!  Since I hate __del__ 
methods, I'm afraid I tend to have a "oh, who gives a shit?" 
attitude toward them.

There are also grey areas here outside of cyclic gc.  Take a 
look at subtype_dealloc:  that calls PyObject_ClearWeakRefs
() before calling a __del__ method, so invokes weakref 
callbacks even if the object doesn't actually go away (the 
__del__ method could still resurrect it).  Is that right or 
wrong?  Beats me.

I need to think more about this.  Offhand I'm inclined to agree 
that your reworking to crawl over unreachable twice is the 
more principled approach.  I only cringe because it *is* 
twice, and so probably a little slower.

Can you write a test where it makes a visible difference?
msg22918 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 22:49
Logged In: YES 
user_id=31435

patch42.txt is what I'll check in (and backport to 2.3 maint) 
provided my remaining tests all pass.  This does two passes 
over `unreachable` as in Neil's patch4.txt, and incorporates 
various improvements to the comments.
msg22919 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-10-30 23:13
Logged In: YES 
user_id=31435

Thanks for the eyeballs and brain cells, Neil!

patch42 was checked in:

Include/weakrefobject.h 1.6
Lib/test/test_gc.py 1.31
Misc/NEWS 1.1179
Modules/gc_weakref.txt 2.2
Modules/gcmodule.c 2.77
Objects/weakrefobject.c 1.20

I'll backport to 2.3 next, and that's it, so closing this report.
History
Date User Action Args
2004-10-28 01:58:53tim.peterscreate