classification
Title: test_multiprocessing dictionary changed size errors and hang
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: 7105 7318 Superseder:
Assigned To: jnoller Nosy List: asksol, benjamin.peterson, gregory.p.smith, jnoller, jon, kristjan.jonsson, pitrou, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2009-10-05 01:40 by r.david.murray, last changed 2018-06-27 21:30 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
test_multiprocessing.log r.david.murray, 2009-10-05 01:40
Messages (30)
msg93570 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-10-05 01:40
test_multiprocessing is producing tracebacks and hanging on py3k trunk.
 I think this started happening fairly recently, as it seemed to work in
one not-too-old checkout until I did an svn up and then it started
failing.  3.1 seems fine.  I've attached the tracebacks, including the
tracebacks I get when I ctrl-C to break the hang.  My system is Gentoo
Linux.
msg93571 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-10-05 02:14
r75241 seems to be the cause.  Tests passes on r75240, but gives the
dict error and hang on r75241.
msg93601 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2009-10-05 13:18
Adding gps - I haven't had a chance to grok the changes, and me getting to 
a fix is going to take a few.
msg93605 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2009-10-05 15:16
It's passing for me on trunk/py3k at least on OS/X - I'm going to need to 
dredge up a linuxvm
msg93815 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-10-10 00:55
Benjamin tried reverting the weakref patch, but that didn't fix it. 
Turns out the problem is the other patch in that merge, that adds saving
of the exception to AssertRaises.  Adding Kristjan as that was his patch.
(The tests pass with the weakref patch applied and the unittest patch
reverted, by the way.)
msg93825 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-10-10 11:04
Interesting.  I don't have a linux machine to debug this.  My patch 
appears harmless enough.  We are only storing an exception _object_, not 
any tracebacks or such.
If this were happening on my windows machine I would put in breakpoints 
where the ignored assertion is being raised and find out what's going on.
msg93854 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-11 12:03
It was a bad idea to revert the change, because we may now forget about
the problem rather than fix it.

It is likely that the "dictionary changed size during iteration" occurs
because a GC collection gets triggered during the iteration on the
weakdict. A GC collection can destroy some objects in that case because
keeping an exception around (which the unittest change does) usually
creates reference cycles through the frame's locals.

It seems that WeakValueDictionary.items() isn't very robust in py3k ;)
We probably need to add list-returning variants, e.g. a listitems()
method (and, probably, listkeys() and listvalues() too)?
msg93855 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-10-11 12:06
The unittest only keeps an exception _object_ around, not the associated 
traceback.  There should be no frames and thus reference cycles associated 
with this.
I agree that the problem should be tackled, rather than swept under the 
carpet like a revert does.
msg93856 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-11 12:14
> The unittest only keeps an exception _object_ around, not the associated 
> traceback.

In py3k, the traceback is attached to the exception (on the
__traceback__ attribute) ;)
msg93857 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-10-11 12:32
I disagree about reverting it. We have this bug report, and that change
is sitting on the merge queue again...
msg93859 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-10-11 14:08
I didn't realize that the traceback was attached to the exception object 
in py3k.  This makes the use of such objects more dangerous because of the  
circular references.  The recommended practice of
exc_type, exc_obj = sys.exc_info()[:2]
is precisely to avoid any "accidents" with the tracebacks.
See also http://mail.python.org/pipermail/python-dev/2005-
August/055251.html
So, can one just clear the __traceback__ member?  That would make sense 
for the unittests, but it would make this manifestation of a GC bug go 
into hiding again.
msg94413 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-24 12:59
Kristján:
> So, can one just clear the __traceback__ member?

Yes, or use `exc_value.with_traceback(None)`.
msg95173 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-11-13 02:37
If this is not ported by the time we get to a 2.7 or 3.2 release, I'd
like to back it out.
msg95175 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-11-13 09:16
I can easily port it again, and remove the traceback from the exception.
But won't we be sweeping that GC problem under the carpet then?
msg95178 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-13 15:04
Please note there's a patch in issue7105 to allow safer iteration of
weak dicts.
msg95182 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-11-13 16:10
Ok, I ported the change again in revision 76238, removing the traceback 
from the stored exception object.
msg95183 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-13 16:12
> Kristján Valur Jónsson <kristjan@ccpgames.com> added the comment:
> 
> Ok, I ported the change again in revision 76238, removing the traceback 
> from the stored exception object.

"versionchanged" should be 3.2, not 3.1.
Thanks!
msg95184 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-11-13 16:16
Thanks. Fixed in revision 76239.
msg95188 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-13 17:04
I don't know if it's your last change, but test_multiprocessing is now
hanging on all buildbots.
msg95194 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-11-13 17:35
It is.  TestZZZNumberOfObjects hangs at r76238, and does not hang at
r76237 on my Gentoo Linux box.
msg95195 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-13 18:04
It seems the error is different, though. This is what I get after Ctrl-C
(I omit the tracebacks from the other processes):

Traceback (most recent call last):
  File "/home/antoine/py3k/__svn__/Lib/test/test_multiprocessing.py",
line 1071, in test_number_of_objects
    refs = self.manager._number_of_objects()
  File "Z:\py3k\__svn__\lib\multiprocessing\managers.py", line 569, in
_number_of_objects
    conn = self._Client(self._address, authkey=self._authkey)
  File "Z:\py3k\__svn__\lib\multiprocessing\connection.py", line 134, in
Client
    c = SocketClient(address)
  File "Z:\py3k\__svn__\lib\multiprocessing\connection.py", line 253, in
SocketClient
    s.connect(address)
socket.error: [Errno 2] No such file or directory
msg95196 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-13 19:04
Ok, the problem is that a finalizer calls _decref which tries to open an
XMLRPC connection (!) to an address on which nobody listens anymore.
Then stupid SocketClient (in multiprocessing.connection) loops endlessly
because it is designed to do so when it gets ECONNREFUSED.

(when you Ctrl-C, you don't see all this because it happens in a
finalizer, and only a terse message is displayed)

So there are two problems:
- why does test_multiprocessing think it should open an XMLRPC
connection to a closed endpoint
- why doesn't SocketClient have some kind of timeout rather than looping
stupidly
msg95203 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-11-13 21:39
Yes, this is quite different.
After seeing this, I created issue 7314.
I didn't realize that it had anything to do with my patch but rather 
blamed it on my brand new windows 7.
msg95204 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-13 21:41
> Yes, this is quite different.
> After seeing this, I created issue 7314.
> I didn't realize that it had anything to do with my patch but rather 
> blamed it on my brand new windows 7.

I don't think it has anything to do with your patch. While diagnosing
the problem, I noticed that adding some random print() lines would
sometimes make the problem disappear. So it is really timing-dependent
and your patch happens to trigger the right (or wrong ;-)) timing.
msg95205 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-11-13 21:42
See my analysis from issue 7314:
The other processes are all waiting to ReadFile().  Maybe they have yet 
to run to the point where they open the socket to which the master 
process is trying to connect to.

The mystery is also: Why does the harmless act of setting an instance 
attribute on a context manager (and without any traceback) trigger this 
behaviour?
msg95206 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-11-13 21:44
Ah, okay then.  Let's focus on solving 7414 then.
msg95372 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-17 10:07
I don't think this is a blocker anymore. Kristján's patch has been
ported and the test_multiprocessing freezes have been "fixed" by adding
a timeout.
As for the original purpose of this ticket, issue7105 is still (part
of?) the answer.
msg203249 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-18 08:13
I never seen this issue, can we close it?

#7105 has been fixed in Python 3, not in Python 2.7.
msg205226 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-04 15:09
Okay, let's say it is fixed. Adding Richard to nosy so that he can review the issue if he's interested.
msg320627 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-27 21:30
> RuntimeError: dictionary changed size during iteration

I didn't see this bug the last 5 years, so I consider that the issue has been fixed in the meanwhile :-)
History
Date User Action Args
2018-06-27 21:30:55vstinnersetstatus: open -> closed
resolution: out of date -> fixed
messages: + msg320627
2013-12-04 15:09:49pitrousetresolution: out of date
messages: + msg205226
stage: needs patch -> resolved
2013-11-18 08:13:54vstinnersetnosy: + vstinner
messages: + msg203249
2010-08-27 14:13:22asksolsetnosy: + asksol
2009-11-17 10:07:19pitrousetpriority: release blocker -> normal

messages: + msg95372
2009-11-13 21:44:04kristjan.jonssonsetmessages: + msg95206
2009-11-13 21:42:47kristjan.jonssonsetmessages: + msg95205
2009-11-13 21:41:24pitrousetmessages: + msg95204
2009-11-13 21:39:10kristjan.jonssonsetmessages: + msg95203
2009-11-13 19:29:14pitrousetdependencies: + multiprocessing should not wait endlessly
2009-11-13 19:04:46pitrousetmessages: + msg95196
2009-11-13 18:04:33pitrousetmessages: + msg95195
2009-11-13 17:35:39r.david.murraysetmessages: + msg95194
2009-11-13 17:35:22r.david.murraysetmessages: - msg95193
2009-11-13 17:34:57r.david.murraysetmessages: + msg95193
2009-11-13 17:34:35r.david.murraysetmessages: - msg95190
2009-11-13 17:26:16r.david.murraysetmessages: + msg95190
2009-11-13 17:04:40pitrousetmessages: + msg95188
2009-11-13 16:16:30kristjan.jonssonsetmessages: + msg95184
2009-11-13 16:12:57pitrousetmessages: + msg95183
2009-11-13 16:10:58kristjan.jonssonsetmessages: + msg95182
2009-11-13 15:04:44pitrousetmessages: + msg95178
2009-11-13 09:16:22kristjan.jonssonsetmessages: + msg95175
2009-11-13 02:37:21benjamin.petersonsetpriority: normal -> release blocker

messages: + msg95173
2009-10-24 12:59:50pitrousetmessages: + msg94413
2009-10-11 17:32:42jonsetnosy: + jon
2009-10-11 17:29:50pitrousetdependencies: + weak dict iterators are fragile because of unpredictable GC runs
2009-10-11 14:08:33kristjan.jonssonsetmessages: + msg93859
2009-10-11 12:32:10benjamin.petersonsetmessages: + msg93857
2009-10-11 12:14:41pitrousetmessages: + msg93856
2009-10-11 12:06:47kristjan.jonssonsetmessages: + msg93855
2009-10-11 12:03:10pitrousetnosy: + pitrou, benjamin.peterson
messages: + msg93854
2009-10-10 11:04:36kristjan.jonssonsetmessages: + msg93825
2009-10-10 00:55:10r.david.murraysetnosy: + kristjan.jonsson
messages: + msg93815
2009-10-05 15:16:21jnollersetmessages: + msg93605
2009-10-05 13:18:11jnollersetnosy: + gregory.p.smith
messages: + msg93601
2009-10-05 02:14:53r.david.murraysetmessages: + msg93571
2009-10-05 01:41:17r.david.murraysetnosy: jnoller, r.david.murray
components: + Library (Lib)
2009-10-05 01:40:55r.david.murraycreate