This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Memory fault on complex weakref/weakkeydict delete
Type: Stage:
Components: Interpreter Core Versions: Python 2.2
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: gvanrossum, jacobs99, jepler, mcfletch, nnorwitz, tim.peters
Priority: normal Keywords:

Created on 2003-05-24 22:29 by mcfletch, last changed 2022-04-10 16:08 by admin. This issue is now closed.

File name Uploaded Description Edit mcfletch, 2003-05-24 22:29 Cache implementation file. mcfletch, 2003-05-24 22:30 Driver script for test-case nnorwitz, 2003-05-24 23:31 updated test to be minimal tim.peters, 2003-05-28 23:26 an even smaller test case
Messages (15)
msg16138 - (view) Author: Mike C. Fletcher (mcfletch) Date: 2003-05-24 22:29
Attached find two modules which together form a
test-case.  The file is ripped out of a
production system (OpenGLContext), and I am seeing
memory faults under both Python 2.2.2 and 2.2.3 when I
run the code.  Under 2.2.2 while single-stepping
through the code I was able to provoke an error-message:

Fatal Python error: GC object already in linked list

The error message doesn't show up under 2.2.3, but the
memory-fault does.

Modules here don't use any extension modules, so there
shouldn't be any loose memory references or the like. 
Note, you'll likely need to make weakkeydictionary's
__delitem__ use keys instead of iterkeys to even get to
the crashing.
msg16139 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2003-05-24 23:31
Logged In: YES 

I cut out a lot of stuff from the test.  The new file is
much more minimal, but still crashes for me in a 2.3 debug
build.  You only need the one file too (not both files).

There is an issue with new style classes.  If Items doesn't
derive from object, I don't get a crash.
msg16140 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2003-05-25 02:53
Logged In: YES 

Outstanding, Neal -- thanks!  I can confirm that this crashes 
in a current 2.3 debug build on Windows too.  I'm feeling 
sick and won't pursue it now, though.  When cleaning up 
from the call to makeSome() (the body of makeSome() has 
completed, and we're cleaning up its execution frame, 
decref'ing the locals), we're dying in _Py_ForgetReference 
with a NULL-pointer derefernce.  The refcount on an Items 
object has just fallen to 0, and the code is trying to verify 
that the object is in the debug-build "list of all objects".  But 
its prev and next pointers are both NULL -- it's not in the 
list, and simply trying to check that it is blows up.

I don't have a theory for what's causing this, but it's 
probably not a good thing <heh>.
msg16141 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2003-05-25 04:49
Logged In: YES 

Assigned to Guido, because I suspect the problem lies in 
the order subtype_dealloc does things.

With reference to Neal's whittled-down case:  when 
makeSome() ends, we decref i and then decref item.  item's 
refcount hits 0 then.  There's a weakref remaining to item 
(in CacheHolder.client), but subtype_dealloc doesn't clear 
the weakref at this point.  First it clears item's instance 
dict.  That contains the last strong reference to i.  
subtype_dealloc is inovked again, and clears i's instance 
dict, and then deals with the weak reference to i.  The 
weakref to i has a callback associated with it, and 
CacheHolder.__call__() is invoked.  That invokes self.client
(), still a weakref to item, and because item's weakrefs still 
haven't been dealt with, self.client() returns item.

Now we're hosed.  item *had* a refcount of 0 at this point, 
and is still in the process of getting cleaned out by the first 
call to subtype_dealloc (it still thinks it's clearing item's 
instance dict).  We already called _Py_ForgetReference on 
item when its refcount fell to 0.  Its refcount gets boosted 
back to 1 by virtue of item getting returned by the 
self.client() weakref call.  Cleaning out the frame for 
CacheHolder.__call__() knocks the refcount down to 0 
again, and the second attempt to call _Py_ForgetReference 
on it blows up.

In a release build, nothing visibly bad happens when I try 
it.  It crashes if I add

        print client.items

at the end of __call__ in a release-build run, though.  Looks 
like that's just the luck of the draw after things have gone 

I note that instance_dealloc deals with weakrefs much 
earlier in the process, so that when Neal makes Items a 
classic class instead, the eventual call to self.client() 
returns None (instead of items), and nothing bad happens.. 
msg16142 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-05-28 22:16
Logged In: YES 

Tim, let's look at this when you're back in the office. My
head spins from just reading the analysis below.

Note that this is a 2.2 and 2.3 bug. I don't necessarily
want to hold up the 2.2.3 release until this is fixed,
unless we have a quick breakthrough.
msg16143 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2003-05-28 23:26
Logged In: YES 

That's cool.  The analysis is much easier to follow if you run 
Neal's little program, which crashes right away in a debug 
build.  Then the analysis just follows the call stack, and it's 
instantly obvious (if you know to look for it <wink>) that 
we're trying to deallocate a single object twice.

One thing subtype_delloc does that instance_dealloc 
doesn't is clear the instance dict before clearing the 
weakrefs.  Clearing the instance dict can end up executing 
anything, and in this case does, in particular materializing a 
strong reference to the object we're in the middle of 
deallocating (via a weakref that hasn't been cleared).  That 
seems to be the key point.

Mike found the problem in 2.2.2, I believe in his real-life 
OpenGL code.  That's why I'm keen to see it fixed for 2.2.3.

I'll be in the office Thursday, BTW.

Ah, here, I'll attach a simpler program that has the same 
kind of problem (
msg16144 - (view) Author: Mike C. Fletcher (mcfletch) Date: 2003-05-29 01:16
Logged In: YES 

To give timeline: This is from real-life code (PyOpenGL's
OpenGLContext demo/teaching module).  It's currently about
to go beta, with a release date target of end-of-summer. 
I've worked around the problem for the most common case w/in
the system (exiting from the VRML browser), but that
work-around's not usable for end-user long-running projects
until the users can deallocate the scenegraph to load
another (i.e. go from world to world within a single run of
the application).
msg16145 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-05-29 14:35
Logged In: YES 

Thanks to Tim for the analysis, to Neal for the simplified
test, and to Mike for the bug report! The fix was actually
simple: clear the weak ref list *before* calling __del__ or
clearing __dict__. This is the same order as used by classic
classes, so can't be wrong. Applied to 2.2.3 branch and 2.3
trunk, with test case.
msg16146 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2003-05-29 14:53
Logged In: YES 

Just changed Resolution to Fixed.  Thanks everyone!
msg16147 - (view) Author: Jeff Epler (jepler) Date: 2003-05-29 20:47
Logged In: YES 

Just one question: does the final test-case cover the
original bug?  I have a redhat 9 system with Python 2.2.2,
and runs just fine:
$ python2 
<__main__.Oops object at 0x8162ecc>

The Python version is:
Python 2.2.2 (#1, Feb 24 2003, 19:13:11) 
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2

I have not tried any other version of the test-case.  I also
didn't try a non-redhat version, so it's barely possible
that some patch they apply affected this issue.
msg16148 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2003-05-29 21:35
Logged In: YES 

2.3 works for me with the original, but didn't before. 
Redhat 9.
msg16149 - (view) Author: Mike C. Fletcher (mcfletch) Date: 2003-05-31 19:21
Logged In: YES 

2.2.3 fixes the original problem on Win2k.
msg16150 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2003-06-01 05:21
Logged In: YES 

Which service pack <wink>?  Thanks for reporting this in 
time for 2.2.3, Mike.
msg16151 - (view) Author: Kevin Jacobs (jacobs99) Date: 2003-06-10 17:05
Logged In: YES 

The fix checked in to solve this problem causes a potentially 
more serious one in Bug 751998.  In short, it seems that 
data descriptors become unavailable in __del__  methods,
preventing finalization from completing.  I have yet to figure 
out why, though I have isolated the problem to revision 2.234
of typeobject.c.  I've not looked to see if this also affects
Python 2.2.3, though I wouldn't be too suprised if it does.
msg16152 - (view) Author: Kevin Jacobs (jacobs99) Date: 2003-06-10 18:25
Logged In: YES 

See patch attached to bug 751998 for an alternative,
friendlier fix for this bug.
Date User Action Args
2022-04-10 16:08:54adminsetgithub: 38547
2003-05-24 22:29:35mcfletchcreate