classification
Title: multiprocessing module is racy
Type: Stage:
Components: Tests Versions: Python 3.0, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jnoller Nosy List: benjamin.peterson, donmez, jnoller, mark.dickinson, roudkerk
Priority: release blocker Keywords: needs review, patch

Created on 2008-07-19 20:00 by donmez, last changed 2008-09-02 19:12 by jnoller. This issue is now closed.

Files
File name Uploaded Description Edit
multibug3.py mark.dickinson, 2008-08-07 21:10 script demonstrating incref bug
mp_nohang_jnoller.patch jnoller, 2008-08-11 18:02
issue3419.patch mark.dickinson, 2008-08-17 13:29 Possible fix for incref bug
Messages (23)
msg70053 - (view) Author: Ismail Donmez (donmez) * Date: 2008-07-19 20:00
Tested on MacOSX 10.5.4, running test_multiprocessing in a tight loop :

[~/Sources/py3k]> while true;do ./python ./Lib/test/regrtest.py 
test_multiprocessing;done
test_multiprocessing
1 test OK.
test_multiprocessing
1 test OK.
test_multiprocessing
1 test OK.
test_multiprocessing
Process Process-48:
Traceback (most recent call last):
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 720, in _callmethod
    conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", 
line 232, in _bootstrap
    self.run()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", 
line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/cartman/Sources/py3k/Lib/test/test_multiprocessing.py", 
line 601, in f
    cond.acquire()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 949, in acquire
    return self._callmethod('acquire', (blocking,))
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 724, in _callmethod
    self._connect()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 711, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", 
line 133, in Client
    c = SocketClient(address)
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", 
line 254, in SocketClient
    s.connect(address)
socket.error: [Errno 61] Connection refused
Exception in thread Thread-58:
Traceback (most recent call last):
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 720, in _callmethod
    conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cartman/Sources/py3k/Lib/threading.py", line 492, in 
_bootstrap_inner
    self.run()
  File "/Users/cartman/Sources/py3k/Lib/threading.py", line 447, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/cartman/Sources/py3k/Lib/test/test_multiprocessing.py", 
line 601, in f
    cond.acquire()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 949, in acquire
    return self._callmethod('acquire', (blocking,))
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 724, in _callmethod
    self._connect()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", 
line 711, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", 
line 133, in Client
    c = SocketClient(address)
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", 
line 254, in SocketClient
    s.connect(address)
socket.error: [Errno 61] Connection refused
msg70056 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-07-19 21:01
When running test_multiprocessing in a loop, I see:

test test_multiprocessing failed -- Traceback (most recent call last):
  File "/temp/python/trunk/Lib/test/test_multiprocessing.py", line 1157,
in test_remote
    queue = manager2.get_queue()
  File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 635,
in temp
    authkey=self._authkey, exposed=exp
  File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 887,
in AutoProxy
    incref=incref)
  File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 696,
in __init__
    self._incref()
  File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 743,
in _incref
    dispatch(conn, None, 'incref', (self._id,))
  File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 79, in
dispatch
    raise convert_to_error(kind, result)
RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 181,
in handle_request
    result = func(c, *args, **kwds)
  File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 397,
in incref
    self.id_to_refcount[ident] += 1
KeyError: '1033760'
---------------------------------------------------------------------------
msg70060 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-07-19 22:13
This also happens under trunk
msg70539 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-08-01 01:22
Ben's incref error != the connection refused error. Here's some information 
from Scott Leerssen who is using the original .52 code:

Hi Jesse,

I am having some trouble with the processing module (0.52) and thought you 
might be of some help since you were so involved with it in the PEP process.

I have a queue of Process objects from which a runner thread pulls and starts 
processes.  The producers for the queue are many, many threads.  I'm seeing 
some very odd behavior when lots of processes try to start while others are 
being reaped at the same time.  I see tracebacks that look like the one below, 
and also occasionally get "hangs" where the process module "seems" to take over 
the GIL and never give it back.  I did find a race condition bug in process.py 
where a _cleaner function is run in start(), potentially joining on processes 
that other threads may also be joining at the same time, which leads to 
KeyError exceptions coming from _cleaner.  I mutexed all my starts and joins to 
get around that for now, but am stumped on the issue below.  Does it look 
familiar to you?

Thanks, and feel free to shrug and reply "beats me".
Scott

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,609 SYSMESG  Process Process-24:
[dmbd:32451:-1236096096] 2008-07-31 16:52:10,612 SYSMESG

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,613 SYSMESG  Traceback (most 
recent call last):

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,613 SYSMESG    File 
"/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site-
packages/processing/process.py", line 224, in _bootstrap

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,615 SYSMESG      
_runAfterForkers()

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,616 SYSMESG    File 
"/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site-
packages/processing/process.py", line 300, in _runAfterForkers

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,617 SYSMESG      func(obj)

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,617 SYSMESG    File 
"/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site-
packages/processing/managers.py", line 806, in _afterFork

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,620 SYSMESG      self._incref()

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,621 SYSMESG    File 
"/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site-
packages/processing/managers.py", line 760, in _incref

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,621 SYSMESG      
dispatch(connection, None, 'incref', (self._id,))

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,622 SYSMESG    File 
"/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site-
packages/processing/managers.py", line 92, in dispatch

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,623 SYSMESG      raise result

[dmbd:32451:-1236096096] 2008-07-31 16:52:10,624 SYSMESG  RemoteError:
---------------------------------------------------------------------------
Remote Traceback (most recent call last):
 File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site-
packages/processing/managers.py", line 190, in handleRequest
   result = func(c, *args, **kwds)
 File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site-
packages/processing/managers.py", line 404, in incref
   self.id_to_refcount[ident] += 1
KeyError: 2880341100L
msg70570 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-01 16:40
I'm also seeing the incref error occasionally, on OS X 10.5.4:

======================================================================
ERROR: test_remote (__main__.WithManagerTestRemoteManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_multiprocessing.py", line 1157, in test_remote
    queue = manager2.get_queue()
  File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 635, in temp
    authkey=self._authkey, exposed=exp
  File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 887, in AutoProxy
    incref=incref)
  File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 696, in __init__
    self._incref()
  File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 743, in _incref
    dispatch(conn, None, 'incref', (self._id,))
  File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 79, in dispatch
    raise convert_to_error(kind, result)
RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 181, in handle_request
    result = func(c, *args, **kwds)
  File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 397, in incref
    self.id_to_refcount[ident] += 1
KeyError: '6fdb78'
---------------------------------------------------------------------------
msg70603 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-01 22:32
For the connection refused error, the following script fails for me 
(again, on OS X 10.5.4) with the message:

errno: [Errno 61] Connection refused

on most runs.  (If xrange(10) is replaced by a smaller range like 
xrange(4) then the script usually runs without error.)  Looks like
something in semaphore.c needs fixing, but I've no idea what.



from multiprocessing import Process, Manager
sleeping = Manager().Semaphore(0)

def f():
    sleeping.release()

for _ in xrange(10):
    Process(target=f).start()
for _ in xrange(10):
    sleeping.acquire()
msg70608 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-01 23:50
> Looks like something in semaphore.c needs fixing, but I've no idea what.

I take that back.  It's nothing to do with semaphore.c.  I'll keep trying.
msg70609 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-08-02 00:01
Are you looking at the conn refused or the incref error?
msg70631 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-02 13:44
> Are you looking at the conn refused or the incref error?

The connection refused error.

The attached patch fixes the problem, for me.  On my machine, the 
connection refused error code was 61 rather than 10061.  With this patch, 
I'm no longer seeing any hangs in test_multiprocessing.py (at least, not in 
the last 500 runs :-)).  (Though I am still seeing the incref error 
occasionally.)

If anyone's prepared to answer a stupid question:  I'm curious why failed 
socket connections occur at all.  Is connecting to a socket generally 
considered an unreliable operation, or is there some aspect of the 
multiprocessing module that makes it potentially unreliable?
msg70632 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-08-02 14:23
On Aug 2, 2008, at 9:44 AM, Mark Dickinson <report@bugs.python.org>  
wrote:

>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>
>> Are you looking at the conn refused or the incref error?
>
> The connection refused error.
>
> The attached patch fixes the problem, for me.  On my machine, the
> connection refused error code was 61 rather than 10061.  With this  
> patch,
> I'm no longer seeing any hangs in test_multiprocessing.py (at least,  
> not in
> the last 500 runs :-)).  (Though I am still seeing the incref error
> occasionally.)
>
> If anyone's prepared to answer a stupid question:  I'm curious why  
> failed
> socket connections occur at all.  Is connecting to a socket generally
> considered an unreliable operation, or is there some aspect of the
> multiprocessing module that makes it potentially unreliable?
>

I believe the conn refused error is another race, the child processes  
are connecting to a manager which is shutting down/gone

Thanks again mark- when I get a chance today I'll review/test/apply  
the patch
msg70803 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-06 20:54
> I believe the conn refused error is another race, the child processes  
> are connecting to a manager which is shutting down/gone

After some research, I think it actually has to do with the value of the 
'backlog' parameter when creating a Listener instance: if there are too 
many queued requests to a socket then further connection requests are 
refused;  the 'backlog' parameter appears to set the maximum size of the 
queue.

...or something like that.  I'm a novice here... :-)

Anyway, in the __init__ method of the Server class, in managers.py, the 
Server creates a Listener instance with the line

        self.listener = Listener(address=address, backlog=5)

When I change backlog to 50 instead of 5, I don't see the connection 
refused error any more.

I'm not suggesting that the backlog value should be changed---it 
probably shouldn't.  But this does at least explain why connections can 
be refused, and why they should be retried in that case.

I do still think that the patch, or something like it, should be 
applied.
msg70809 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-07 01:33
Here's what's going on with the incref error:

Look in the Server class, in managers.py:  consider a shared object with 
id 'id'.  When a reference to the shared object is created, its id is 
added to the id_to_refcount dictionary:

{id: None}

and *shortly afterwards*, the refcount is incremented to get:

{id: 1}

When the object is deleted or goes out of scope later on, the refcount 
is decref'd, and id is removed entirely from id_to_refcount.

The problem occurs when there are two processes simultaneously asking 
for access to the same object.  If a second process happens to decref 
'id' in between the first process creating the reference and 
incrementing it, then the incref from the first process will fail.  This 
is exactly what's happening (some of the time) in test_remote in the 
test_suite.  The failing sequence of operations is:

initial state:         id not in id_to_refcount
(process 1): create    id_to_refcount[id] is None
(process 1): incref    id_to_refcount[id] == 1
(process 2): create    id_to_refcount[id] == 1
(process 1): decref    id not in id_to_refcount
(process 2): incref    KeyError!
(process 2): decref    (never get here...)

I'm not really familiar enough with the whole multiprocessing module to 
know what the right fix for this is.
msg70857 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-07 21:10
Hmmm.  That last message wasn't too coherent;  I really shouldn't try to 
post at 2:30am.

Summary:  the refcounting logic in the Server class is flawed.  In 
Server.create(), the initial refcount of a newly-created shared object is 
set to None.  This is dangerous: the moment another thread gets a look-in, 
that refcount can be incremented to 1, then decremented to 0, at which 
point the shared object gets disposed of (by the code in Server.decref).   
And all this can happen before the Proxy object for the shared object gets 
initialized, at which point the KeyError occurs.

(Not much better.)

Can anyone suggest a way to fix this?  I can't see any easy fix.  
Nevertheless, I consider this a serious flaw that should be fixed before 
2.6 and 3.0 are released.

I've attached a minimal file that produces the incref error (on about 1 
run out of every 5) for me.
msg70859 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-07 21:26
Adding Richard Oudkerk to the nosy list in case he can shed any light on 
this.

Should this be considered a release-blocker?
msg70860 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-08-07 21:36
On Aug 7, 2008, at 5:26 PM, Mark Dickinson <report@bugs.python.org>  
wrote:

>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>
> Adding Richard Oudkerk to the nosy list in case he can shed any  
> light on
> this.
>
> Should this be considered a release-blocker?
>

I believe Richard is on summer break - I've dropped him a line on the  
open issues I need his input on.

As for if this should block the release, it should be resolved before  
final, but should not block the next beta.
msg71026 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-08-11 18:02
Hey Mark - I took a look at the mp_nohang patch you added - and you're 
right, the connection refused error numbers are different from platform to 
platform. So rather than use a static list, I switched the code to use the 
errno module.

Can you apply this and confirm it works for you?
msg71030 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-11 18:53
> So rather than use a static list, I switched the code to use the 
> errno module.

Yup.  That's definitely a better solution.

Your patch fixed the problem for me.

Thanks!
msg71031 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-08-11 19:00
Thanks Mark, I've checked in the connection refused fix in r65645 on trunk
msg71068 - (view) Author: Ismail Donmez (donmez) * Date: 2008-08-12 19:12
With trunk when running test_multiprocessing in a tight loop I saw 
another problem:

test_multiprocessing
Process Process-61:
Traceback (most recent call last):
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", 
line 229, in _bootstrap
Process Process-60:
Traceback (most recent call last):
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", 
line 229, in _bootstrap
Process Process-62:
Traceback (most recent call last):
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", 
line 229, in _bootstrap
    util._run_after_forkers()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 
138, in _run_after_forkers
    util._run_after_forkers()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 
138, in _run_after_forkers
    util._run_after_forkers()
  File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 
138, in _run_after_forkers
    items = list(_afterfork_registry.items())
    items = list(_afterfork_registry.items())
  File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items
  File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items
    items = list(_afterfork_registry.items())
  File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items
    for key, wr in self.data.items():
RuntimeError: dictionary changed size during iteration
    for key, wr in self.data.items():
RuntimeError: dictionary changed size during iteration
    for key, wr in self.data.items():
RuntimeError: dictionary changed size during iteration

The original problem itself seems to be fixed, so cheers for that!
msg71273 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-17 13:29
Here's a patch that fixes the incref problem for me.  This *definitely* 
needs review from someone more familiar with the multiprocessing module 
than I am;  I'm not at all confident that it won't break something else.

The original problem:  the Server.create method creates shared objects 
with an effective reference count of 0, making them vulnerable to 
premature garbage collection.

The solution in the patch: create shared objects with a reference count 
of 1, and make it the responsibility of the caller of Server.create() to 
do a decref later (once a ProxyObject for the created shared object has 
been initialized).   In effect, Server.create creates and returns a 
reference, and the caller then owns that reference and is responsible 
for disposing of it.
msg71275 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-08-17 13:42
[jnoller]
> As for if this should block the release, it should be resolved before  
> final, but should not block the next beta.

So the priority should be 'deferred blocker', no?

[cartman]
> With trunk when running test_multiprocessing in a tight loop I saw 
> another problem:

I think this really belongs in a separate issue.  See issue 3578.
msg72358 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-09-02 18:15
Looks like Mark's patch was submitted as part of r66115 by Ben 
accidentally (as part of reverting r66114). I can confirm this patch 
resolves the incref issue as-is. I've run test_multiprocessing in a loop 
for about an hour now with no failures. 

Marking closed/fixed
msg72361 - (view) Author: Jesse Noller (jnoller) * (Python committer) Date: 2008-09-02 19:12
r66161 on py3k merges forward the fix for this
History
Date User Action Args
2008-09-02 19:12:48jnollersetmessages: + msg72361
2008-09-02 18:15:48jnollersetstatus: open -> closed
resolution: fixed
messages: + msg72358
2008-08-21 19:55:49mark.dickinsonsetkeywords: + needs review
2008-08-21 03:06:09barrysetpriority: deferred blocker -> release blocker
2008-08-17 13:42:42mark.dickinsonsetpriority: deferred blocker
messages: + msg71275
2008-08-17 13:29:25mark.dickinsonsetfiles: + issue3419.patch
messages: + msg71273
2008-08-12 19:12:38donmezsetmessages: + msg71068
2008-08-11 19:00:58jnollersetfiles: - mp_nohang.patch
2008-08-11 19:00:38jnollersetmessages: + msg71031
2008-08-11 18:53:28mark.dickinsonsetmessages: + msg71030
2008-08-11 18:02:53jnollersetfiles: + mp_nohang_jnoller.patch
messages: + msg71026
2008-08-07 21:36:07jnollersetmessages: + msg70860
2008-08-07 21:26:09mark.dickinsonsetnosy: + roudkerk
messages: + msg70859
2008-08-07 21:10:56mark.dickinsonsetfiles: + multibug3.py
messages: + msg70857
2008-08-07 01:33:32mark.dickinsonsetmessages: + msg70809
2008-08-06 20:54:12mark.dickinsonsetmessages: + msg70803
2008-08-02 14:23:56jnollersetmessages: + msg70632
2008-08-02 13:44:20mark.dickinsonsetfiles: + mp_nohang.patch
keywords: + patch
messages: + msg70631
2008-08-02 00:01:32jnollersetmessages: + msg70609
2008-08-01 23:50:49mark.dickinsonsetmessages: + msg70608
2008-08-01 22:32:19mark.dickinsonsetmessages: + msg70603
2008-08-01 16:40:53mark.dickinsonsetnosy: + mark.dickinson
messages: + msg70570
2008-08-01 01:22:33jnollersetmessages: + msg70539
2008-07-19 22:13:23jnollersetmessages: + msg70060
versions: + Python 2.6
2008-07-19 21:01:31benjamin.petersonsetassignee: jnoller
messages: + msg70056
nosy: + benjamin.peterson, jnoller
2008-07-19 20:00:42donmezcreate