Issue 3419: multiprocessing module is racy

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47669

classification

Title:	multiprocessing module is racy
Type:		Stage:
Components:	Tests	Versions:	Python 3.0, Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	jnoller	Nosy List:	barry, benjamin.peterson, donmez, jnoller, mark.dickinson, roudkerk
Priority:	release blocker	Keywords:	needs review, patch

Created on 2008-07-19 20:00 by donmez, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
multibug3.py	mark.dickinson, 2008-08-07 21:10	script demonstrating incref bug
mp_nohang_jnoller.patch	jnoller, 2008-08-11 18:02
issue3419.patch	mark.dickinson, 2008-08-17 13:29	Possible fix for incref bug

Messages (23)
msg70053 - (view)	Author: Ismail Donmez (donmez) *	Date: 2008-07-19 20:00
Tested on MacOSX 10.5.4, running test_multiprocessing in a tight loop : [~/Sources/py3k]> while true;do ./python ./Lib/test/regrtest.py test_multiprocessing;done test_multiprocessing 1 test OK. test_multiprocessing 1 test OK. test_multiprocessing 1 test OK. test_multiprocessing Process Process-48: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 720, in _callmethod conn = self._tls.connection AttributeError: 'ForkAwareLocal' object has no attribute 'connection' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 232, in _bootstrap self.run() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 88, in run self._target(self._args, self._kwargs) File "/Users/cartman/Sources/py3k/Lib/test/test_multiprocessing.py", line 601, in f cond.acquire() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 949, in acquire return self._callmethod('acquire', (blocking,)) File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 724, in _callmethod self._connect() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 711, in _connect conn = self._Client(self._token.address, authkey=self._authkey) File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", line 133, in Client c = SocketClient(address) File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", line 254, in SocketClient s.connect(address) socket.error: [Errno 61] Connection refused Exception in thread Thread-58: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 720, in _callmethod conn = self._tls.connection AttributeError: 'ForkAwareLocal' object has no attribute 'connection' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/threading.py", line 492, in _bootstrap_inner self.run() File "/Users/cartman/Sources/py3k/Lib/threading.py", line 447, in run self._target(self._args, **self._kwargs) File "/Users/cartman/Sources/py3k/Lib/test/test_multiprocessing.py", line 601, in f cond.acquire() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 949, in acquire return self._callmethod('acquire', (blocking,)) File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 724, in _callmethod self._connect() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/managers.py", line 711, in _connect conn = self._Client(self._token.address, authkey=self._authkey) File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", line 133, in Client c = SocketClient(address) File "/Users/cartman/Sources/py3k/Lib/multiprocessing/connection.py", line 254, in SocketClient s.connect(address) socket.error: [Errno 61] Connection refused
msg70056 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-07-19 21:01
When running test_multiprocessing in a loop, I see: test test_multiprocessing failed -- Traceback (most recent call last): File "/temp/python/trunk/Lib/test/test_multiprocessing.py", line 1157, in test_remote queue = manager2.get_queue() File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 635, in temp authkey=self._authkey, exposed=exp File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 887, in AutoProxy incref=incref) File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 696, in __init__ self._incref() File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 743, in _incref dispatch(conn, None, 'incref', (self._id,)) File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 79, in dispatch raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 181, in handle_request result = func(c, args, *kwds) File "/temp/python/trunk/Lib/multiprocessing/managers.py", line 397, in incref self.id_to_refcount[ident] += 1 KeyError: '1033760' ---------------------------------------------------------------------------
msg70060 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-07-19 22:13
This also happens under trunk
msg70539 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-08-01 01:22
Ben's incref error != the connection refused error. Here's some information from Scott Leerssen who is using the original .52 code: Hi Jesse, I am having some trouble with the processing module (0.52) and thought you might be of some help since you were so involved with it in the PEP process. I have a queue of Process objects from which a runner thread pulls and starts processes. The producers for the queue are many, many threads. I'm seeing some very odd behavior when lots of processes try to start while others are being reaped at the same time. I see tracebacks that look like the one below, and also occasionally get "hangs" where the process module "seems" to take over the GIL and never give it back. I did find a race condition bug in process.py where a _cleaner function is run in start(), potentially joining on processes that other threads may also be joining at the same time, which leads to KeyError exceptions coming from _cleaner. I mutexed all my starts and joins to get around that for now, but am stumped on the issue below. Does it look familiar to you? Thanks, and feel free to shrug and reply "beats me". Scott [dmbd:32451:-1236096096] 2008-07-31 16:52:10,609 SYSMESG Process Process-24: [dmbd:32451:-1236096096] 2008-07-31 16:52:10,612 SYSMESG [dmbd:32451:-1236096096] 2008-07-31 16:52:10,613 SYSMESG Traceback (most recent call last): [dmbd:32451:-1236096096] 2008-07-31 16:52:10,613 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/process.py", line 224, in _bootstrap [dmbd:32451:-1236096096] 2008-07-31 16:52:10,615 SYSMESG _runAfterForkers() [dmbd:32451:-1236096096] 2008-07-31 16:52:10,616 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/process.py", line 300, in _runAfterForkers [dmbd:32451:-1236096096] 2008-07-31 16:52:10,617 SYSMESG func(obj) [dmbd:32451:-1236096096] 2008-07-31 16:52:10,617 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 806, in _afterFork [dmbd:32451:-1236096096] 2008-07-31 16:52:10,620 SYSMESG self._incref() [dmbd:32451:-1236096096] 2008-07-31 16:52:10,621 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 760, in _incref [dmbd:32451:-1236096096] 2008-07-31 16:52:10,621 SYSMESG dispatch(connection, None, 'incref', (self._id,)) [dmbd:32451:-1236096096] 2008-07-31 16:52:10,622 SYSMESG File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 92, in dispatch [dmbd:32451:-1236096096] 2008-07-31 16:52:10,623 SYSMESG raise result [dmbd:32451:-1236096096] 2008-07-31 16:52:10,624 SYSMESG RemoteError: --------------------------------------------------------------------------- Remote Traceback (most recent call last): File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 190, in handleRequest result = func(c, args, *kwds) File "/opt/race/share/sw/os/Linux_2.6_i686/python/lib/python2.5/site- packages/processing/managers.py", line 404, in incref self.id_to_refcount[ident] += 1 KeyError: 2880341100L
msg70570 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-01 16:40
I'm also seeing the incref error occasionally, on OS X 10.5.4: ====================================================================== ERROR: test_remote (__main__.WithManagerTestRemoteManager) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/test_multiprocessing.py", line 1157, in test_remote queue = manager2.get_queue() File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 635, in temp authkey=self._authkey, exposed=exp File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 887, in AutoProxy incref=incref) File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 696, in __init__ self._incref() File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 743, in _incref dispatch(conn, None, 'incref', (self._id,)) File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 79, in dispatch raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 181, in handle_request result = func(c, args, *kwds) File "/Users/dickinsm/python_source/trunk/Lib/multiprocessing/managers.py", line 397, in incref self.id_to_refcount[ident] += 1 KeyError: '6fdb78' ---------------------------------------------------------------------------
msg70603 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-01 22:32
For the connection refused error, the following script fails for me (again, on OS X 10.5.4) with the message: errno: [Errno 61] Connection refused on most runs. (If xrange(10) is replaced by a smaller range like xrange(4) then the script usually runs without error.) Looks like something in semaphore.c needs fixing, but I've no idea what. from multiprocessing import Process, Manager sleeping = Manager().Semaphore(0) def f(): sleeping.release() for _ in xrange(10): Process(target=f).start() for _ in xrange(10): sleeping.acquire()
msg70608 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-01 23:50
> Looks like something in semaphore.c needs fixing, but I've no idea what. I take that back. It's nothing to do with semaphore.c. I'll keep trying.
msg70609 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-08-02 00:01
Are you looking at the conn refused or the incref error?
msg70631 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-02 13:44
> Are you looking at the conn refused or the incref error? The connection refused error. The attached patch fixes the problem, for me. On my machine, the connection refused error code was 61 rather than 10061. With this patch, I'm no longer seeing any hangs in test_multiprocessing.py (at least, not in the last 500 runs :-)). (Though I am still seeing the incref error occasionally.) If anyone's prepared to answer a stupid question: I'm curious why failed socket connections occur at all. Is connecting to a socket generally considered an unreliable operation, or is there some aspect of the multiprocessing module that makes it potentially unreliable?
msg70632 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-08-02 14:23
On Aug 2, 2008, at 9:44 AM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > >> Are you looking at the conn refused or the incref error? > > The connection refused error. > > The attached patch fixes the problem, for me. On my machine, the > connection refused error code was 61 rather than 10061. With this > patch, > I'm no longer seeing any hangs in test_multiprocessing.py (at least, > not in > the last 500 runs :-)). (Though I am still seeing the incref error > occasionally.) > > If anyone's prepared to answer a stupid question: I'm curious why > failed > socket connections occur at all. Is connecting to a socket generally > considered an unreliable operation, or is there some aspect of the > multiprocessing module that makes it potentially unreliable? > I believe the conn refused error is another race, the child processes are connecting to a manager which is shutting down/gone Thanks again mark- when I get a chance today I'll review/test/apply the patch
msg70803 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-06 20:54
> I believe the conn refused error is another race, the child processes > are connecting to a manager which is shutting down/gone After some research, I think it actually has to do with the value of the 'backlog' parameter when creating a Listener instance: if there are too many queued requests to a socket then further connection requests are refused; the 'backlog' parameter appears to set the maximum size of the queue. ...or something like that. I'm a novice here... :-) Anyway, in the __init__ method of the Server class, in managers.py, the Server creates a Listener instance with the line self.listener = Listener(address=address, backlog=5) When I change backlog to 50 instead of 5, I don't see the connection refused error any more. I'm not suggesting that the backlog value should be changed---it probably shouldn't. But this does at least explain why connections can be refused, and why they should be retried in that case. I do still think that the patch, or something like it, should be applied.
msg70809 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-07 01:33
Here's what's going on with the incref error: Look in the Server class, in managers.py: consider a shared object with id 'id'. When a reference to the shared object is created, its id is added to the id_to_refcount dictionary: {id: None} and shortly afterwards, the refcount is incremented to get: {id: 1} When the object is deleted or goes out of scope later on, the refcount is decref'd, and id is removed entirely from id_to_refcount. The problem occurs when there are two processes simultaneously asking for access to the same object. If a second process happens to decref 'id' in between the first process creating the reference and incrementing it, then the incref from the first process will fail. This is exactly what's happening (some of the time) in test_remote in the test_suite. The failing sequence of operations is: initial state: id not in id_to_refcount (process 1): create id_to_refcount[id] is None (process 1): incref id_to_refcount[id] == 1 (process 2): create id_to_refcount[id] == 1 (process 1): decref id not in id_to_refcount (process 2): incref KeyError! (process 2): decref (never get here...) I'm not really familiar enough with the whole multiprocessing module to know what the right fix for this is.
msg70857 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-07 21:10
Hmmm. That last message wasn't too coherent; I really shouldn't try to post at 2:30am. Summary: the refcounting logic in the Server class is flawed. In Server.create(), the initial refcount of a newly-created shared object is set to None. This is dangerous: the moment another thread gets a look-in, that refcount can be incremented to 1, then decremented to 0, at which point the shared object gets disposed of (by the code in Server.decref). And all this can happen before the Proxy object for the shared object gets initialized, at which point the KeyError occurs. (Not much better.) Can anyone suggest a way to fix this? I can't see any easy fix. Nevertheless, I consider this a serious flaw that should be fixed before 2.6 and 3.0 are released. I've attached a minimal file that produces the incref error (on about 1 run out of every 5) for me.
msg70859 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-07 21:26
Adding Richard Oudkerk to the nosy list in case he can shed any light on this. Should this be considered a release-blocker?
msg70860 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-08-07 21:36
On Aug 7, 2008, at 5:26 PM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > > Adding Richard Oudkerk to the nosy list in case he can shed any > light on > this. > > Should this be considered a release-blocker? > I believe Richard is on summer break - I've dropped him a line on the open issues I need his input on. As for if this should block the release, it should be resolved before final, but should not block the next beta.
msg71026 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-08-11 18:02
Hey Mark - I took a look at the mp_nohang patch you added - and you're right, the connection refused error numbers are different from platform to platform. So rather than use a static list, I switched the code to use the errno module. Can you apply this and confirm it works for you?
msg71030 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-11 18:53
> So rather than use a static list, I switched the code to use the > errno module. Yup. That's definitely a better solution. Your patch fixed the problem for me. Thanks!
msg71031 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-08-11 19:00
Thanks Mark, I've checked in the connection refused fix in r65645 on trunk
msg71068 - (view)	Author: Ismail Donmez (donmez) *	Date: 2008-08-12 19:12
With trunk when running test_multiprocessing in a tight loop I saw another problem: test_multiprocessing Process Process-61: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 229, in _bootstrap Process Process-60: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 229, in _bootstrap Process Process-62: Traceback (most recent call last): File "/Users/cartman/Sources/py3k/Lib/multiprocessing/process.py", line 229, in _bootstrap util._run_after_forkers() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 138, in _run_after_forkers util._run_after_forkers() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 138, in _run_after_forkers util._run_after_forkers() File "/Users/cartman/Sources/py3k/Lib/multiprocessing/util.py", line 138, in _run_after_forkers items = list(_afterfork_registry.items()) items = list(_afterfork_registry.items()) File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items items = list(_afterfork_registry.items()) File "/Users/cartman/Sources/py3k/Lib/weakref.py", line 103, in items for key, wr in self.data.items(): RuntimeError: dictionary changed size during iteration for key, wr in self.data.items(): RuntimeError: dictionary changed size during iteration for key, wr in self.data.items(): RuntimeError: dictionary changed size during iteration The original problem itself seems to be fixed, so cheers for that!
msg71273 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-17 13:29
Here's a patch that fixes the incref problem for me. This definitely needs review from someone more familiar with the multiprocessing module than I am; I'm not at all confident that it won't break something else. The original problem: the Server.create method creates shared objects with an effective reference count of 0, making them vulnerable to premature garbage collection. The solution in the patch: create shared objects with a reference count of 1, and make it the responsibility of the caller of Server.create() to do a decref later (once a ProxyObject for the created shared object has been initialized). In effect, Server.create creates and returns a reference, and the caller then owns that reference and is responsible for disposing of it.
msg71275 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2008-08-17 13:42
[jnoller] > As for if this should block the release, it should be resolved before > final, but should not block the next beta. So the priority should be 'deferred blocker', no? [cartman] > With trunk when running test_multiprocessing in a tight loop I saw > another problem: I think this really belongs in a separate issue. See issue 3578.
msg72358 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-09-02 18:15
Looks like Mark's patch was submitted as part of r66115 by Ben accidentally (as part of reverting r66114). I can confirm this patch resolves the incref issue as-is. I've run test_multiprocessing in a loop for about an hour now with no failures. Marking closed/fixed
msg72361 - (view)	Author: Jesse Noller (jnoller) *	Date: 2008-09-02 19:12
r66161 on py3k merges forward the fix for this

History
Date	User	Action	Args
2022-04-11 14:56:36	admin	set	nosy: + barry github: 47669
2008-09-02 19:12:48	jnoller	set	messages: + msg72361
2008-09-02 18:15:48	jnoller	set	status: open -> closed resolution: fixed messages: + msg72358
2008-08-21 19:55:49	mark.dickinson	set	keywords: + needs review
2008-08-21 03:06:09	barry	set	priority: deferred blocker -> release blocker
2008-08-17 13:42:42	mark.dickinson	set	priority: deferred blocker messages: + msg71275
2008-08-17 13:29:25	mark.dickinson	set	files: + issue3419.patch messages: + msg71273
2008-08-12 19:12:38	donmez	set	messages: + msg71068
2008-08-11 19:00:58	jnoller	set	files: - mp_nohang.patch
2008-08-11 19:00:38	jnoller	set	messages: + msg71031
2008-08-11 18:53:28	mark.dickinson	set	messages: + msg71030
2008-08-11 18:02:53	jnoller	set	files: + mp_nohang_jnoller.patch messages: + msg71026
2008-08-07 21:36:07	jnoller	set	messages: + msg70860
2008-08-07 21:26:09	mark.dickinson	set	nosy: + roudkerk messages: + msg70859
2008-08-07 21:10:56	mark.dickinson	set	files: + multibug3.py messages: + msg70857
2008-08-07 01:33:32	mark.dickinson	set	messages: + msg70809
2008-08-06 20:54:12	mark.dickinson	set	messages: + msg70803
2008-08-02 14:23:56	jnoller	set	messages: + msg70632
2008-08-02 13:44:20	mark.dickinson	set	files: + mp_nohang.patch keywords: + patch messages: + msg70631
2008-08-02 00:01:32	jnoller	set	messages: + msg70609
2008-08-01 23:50:49	mark.dickinson	set	messages: + msg70608
2008-08-01 22:32:19	mark.dickinson	set	messages: + msg70603
2008-08-01 16:40:53	mark.dickinson	set	nosy: + mark.dickinson messages: + msg70570
2008-08-01 01:22:33	jnoller	set	messages: + msg70539
2008-07-19 22:13:23	jnoller	set	messages: + msg70060 versions: + Python 2.6
2008-07-19 21:01:31	benjamin.peterson	set	assignee: jnoller messages: + msg70056 nosy: + benjamin.peterson, jnoller
2008-07-19 20:00:42	donmez	create