Issue 20540: Python 3.3/3.4 regression in multiprocessing manager ?

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64739

classification

Title:	Python 3.3/3.4 regression in multiprocessing manager ?
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Irvin.Probst, neologix, pitrou, python-dev, sbt, serhiy.storchaka, vstinner
Priority:	high	Keywords:	3.3regression, patch

Created on 2014-02-07 12:06 by Irvin.Probst, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test_manager.py	Irvin.Probst, 2014-02-07 12:06	Test code
multi_nagle.diff	neologix, 2014-02-07 13:11
multi_nagle_2.patch	pitrou, 2014-02-07 19:15		review
multi_nagle_simplify.patch	serhiy.storchaka, 2014-02-09 12:16		review

Messages (16)
msg210457 - (view)	Author: Irvin Probst (Irvin.Probst)	Date: 2014-02-07 12:06
After checking with the kind people of help@python.org I think I found a bug in the way connections are handled in the multiprocessing module. Using python 3.3 I've noticed a performance drop of about 70 times when running some code performing basic requests on a SyncManager. As this code was burried deep into a big project I made a test case attached to this report to reproduce this behavior. Here is what this code does: - define a class SharedData with several instance variables (a,b and c here) - this class has two methods exposed through a proxy (good() ans bad()) both see a huge performance drop using 3.3 and can be used to reproduce this behavior. The only difference is that good() uses a mutex whereas bad() does not, I wished to check that mutexes were not to blame for this problem. - create a Syncmanager giving access to a SharedData instance - launch a multiprocessing.Process() running the do_stuff() function, this function calls 10 times the good() (or bad()) method of SharedData through the Syncmanager, passes some values to it and gets back the result. - after each call to the proxy the time elapsed, roughly measured with time.time(), is printed System specs: Linux turing 3.12-1-686-pae #1 SMP Debian 3.12.6-2 (2013-12-29) i686 GNU/Linux Python version: latests 2.6,2.7,3.2 and 3.3 from standard debian repos 3.3.0 and 3.4.0 beta3 compiled from source time elapsed in each call to the proxy using Python 2.6, 2.7 and 3.2: first call to proxy ~ 0.04 seconds, next calls ~0.001 sec time elapsed in each call to the proxy using Python 3.3.0, 3.3.2, 3.3.3 , 3.4.0b3: first call to proxy ~0.27 seconds, next calls: 0.07 sec I reproduced this behavior using python 2.7 and 3.3.3 on an ubuntu computer running the latest amd64 release. Of course I tried without a print() after each call but it does not change anything, python 3.3 is much slower here. Using cProfile I tracked this down to the builtin read() method and indeed timing the read() syscall in posix_read() using gettimeofday() confirms it takes ages, posix_write() is fine though. I think the problem comes from the underlying socket the interpreter is reading from. connections.py from multiprocessing has been rewrittend between 3.2 and 3.3 but I can't see what's wrong in the way it has been done, basic socket options seem to be exactly the same. One interesting point is that it seems to only affect the last bytes sent through the socket, e.g if you send a numpy.array big enough to fill the socket's read() buffer the first calls to read() are done at a normal speed, only the last one takes time. If you confirm my test code is not to blame it makes IMHO the SyncManager in 3.3 and 3.4 completely unusable for frequent data exchanges between processes. Thanks for your time.
msg210464 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-02-07 13:11
Thanks for the detailed report. > connections.py from multiprocessing has been rewrittend between 3.2 > and 3.3 but I can't see what's wrong in the way it has been done, > basic socket options seem to be exactly the same. Indeed, multiprocessing.connection has been rewritten in Python (by Antoine :-). > One interesting point is that it seems to only affect the last bytes > sent through the socket, e.g if you send a numpy.array big enough to > fill the socket's read() buffer the first calls to read() are done at > a normal speed, only the last one takes time. And that's the catch. Before, connection.send looked like this (for sockets): """ static Py_ssize_t conn_send_string(ConnectionObject conn, char string, size_t length) { Py_ssize_t res; /* The "header" of the message is a 32 bit unsigned number (in network order) which specifies the length of the "body". If the message is shorter than about 16kb then it is quicker to combine the "header" and the "body" of the message and send them at once. / if (length < (161024)) { char message; message = PyMem_Malloc(length+4); if (message == NULL) return MP_MEMORY_ERROR; (UINT32)message = htonl((UINT32)length); memcpy(message+4, string, length); Py_BEGIN_ALLOW_THREADS res = _conn_sendall(conn->handle, message, length+4); Py_END_ALLOW_THREADS PyMem_Free(message); } else { UINT32 lenbuff; if (length > MAX_MESSAGE_LENGTH) return MP_BAD_MESSAGE_LENGTH; lenbuff = htonl((UINT32)length); Py_BEGIN_ALLOW_THREADS res = _conn_sendall(conn->handle, (char)&lenbuff, 4) \|\| _conn_sendall(conn->handle, string, length); Py_END_ALLOW_THREADS } return res; } """ So, for short messages, the header was combined with the payload, and written at once to the socket, but for large messages, it was written in two steps. Now the code looks like this: """ def _send_bytes(self, buf): # For wire compatibility with 3.2 and lower n = len(buf) self._send(struct.pack("!i", n)) # The condition is necessary to avoid "broken pipe" errors # when sending a 0-length buffer if the other end closed the pipe. if n > 0: self._send(buf) """ First the header is sent, then the payload. The problem is that this kind of pattern causes problem because of a nasty interaction between Nagle's algorithm and delayed ack: in short, if the payload isn't large enough, the TCP stack won't send it before a short delay, see http://en.wikipedia.org/wiki/Nagle's_algorithm for more details. So, one possible fix is to always combine the payload and header. With the patch attached, we go from: """ $ ./python /tmp/test_manager.py 0.28662800788879395 0.08182215690612793 0.08193612098693848 0.08193612098693848 0.08194088935852051 0.08194208145141602 0.0819399356842041 0.08194184303283691 0.08194303512573242 0.0819389820098877 10 """ to """ $ ./python /tmp/test_manager.py 0.04239797592163086 0.00041413307189941406 0.0004057884216308594 0.0004088878631591797 0.0004029273986816406 0.00040793418884277344 0.0004069805145263672 0.0004069805145263672 0.0004050731658935547 0.0004069805145263672 10 """ Another possibility is to disable Nagle with TCP_NODELAY if available, or use some heuristic similar to the one in 3.2.
msg210465 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-02-07 13:27
I would favour reusing the 3.2 heuristic. Joining the header and the payload may be expensive if the payload is very large.
msg210466 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-02-07 13:29
Modern operation systems have new functions like writev() to send to consecutive buffers. Can't we use them, and emulate them if they are missing (use 3.2 algorithm: join if small, or send separatly)?
msg210471 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-02-07 13:55
> Modern operation systems have new functions like writev() to send to > consecutive buffers. Can't we use them, and emulate them if they are > missing (use 3.2 algorithm: join if small, or send separatly)? What would the complication bring?
msg210478 - (view)	Author: Irvin Probst (Irvin.Probst)	Date: 2014-02-07 14:53
FWIW, according to your comments I tried a quick and dirty fix in my code as I can't wait for a new Python release to make it work: The do_stuff function now does: """ def do_stuff(): client=make_client('',6666, b"foo") data_proxy=client.get_proxy() #make a dummy request to get the underlying #fd we are reading from (see bug #20540) c=data_proxy.good([1,2],[3,4]) fd=data_proxy._tls.connection._handle #setting TCP_NODELAY on 3.3.x should fix the delay issue until a new release sock=socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM) sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) for i in range(10): t_s=time.time() c=data_proxy.good([1,2],[3,4]) print(time.time()-t_s) print(c) """ I'm now down to 0.04s per request instead of ~0.08s, I guess the remaining delay comes from the server side socket which has not been affected by the TCP_NODELAY on the client side. Regards.
msg210530 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-02-07 19:15
Updated patch with the 3.2 heuristic for deciding when to concatenate.
msg210535 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-02-07 19:36
> Updated patch with the 3.2 heuristic for deciding when to concatenate. LGTM. For posterity, a quick benchmark performing 100 trivial remote calls: Before: $ ./python ~/test_manager.py 8.202659845352173 After: $ ./python ~/test_manager.py 0.13146567344665527 Note that we're still behind the pre-rewrite performance (this is on 2.7): $ ./python ~/test_manager.py 0.0707828998566 That's probably just the C->Python overhead, but I'll double-check.
msg210681 - (view)	Author: Richard Oudkerk (sbt) *	Date: 2014-02-08 21:24
LGTM
msg210686 - (view)	Author: Richard Oudkerk (sbt) *	Date: 2014-02-08 21:43
BTW, I see little difference between 3.2 and the unpatched default branch on MacOSX: $ py-32/release/python.exe ~/Downloads/test_manager.py 0.0007331371307373047 8.20159912109375e-05 9.417533874511719e-05 8.082389831542969e-05 7.796287536621094e-05 0.00011587142944335938 0.00011396408081054688 7.891654968261719e-05 8.392333984375e-05 7.605552673339844e-05 10 $ time py-default/release/python.exe ~/Downloads/test_manager.py 0.0007359981536865234 0.0001289844512939453 0.00018715858459472656 0.00015497207641601562 0.00012087821960449219 0.00013399124145507812 0.00011992454528808594 0.00011587142944335938 0.00010895729064941406 0.00017499923706054688 10
msg210688 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-02-08 22:01
> Note that we're still behind the pre-rewrite performance (this is on > 2.7): If I believe the `time` command, the difference is additional userspace CPU time, therefore probably the pure Python overhead.
msg210692 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-08 22:09
New changeset 4816ab0477d2 by Antoine Pitrou in branch '3.3': Issue #20540: Fix a performance regression (vs. Python 3.2) when layering a multiprocessing Connection over a TCP socket. http://hg.python.org/cpython/rev/4816ab0477d2 New changeset 125c24f47f3c by Antoine Pitrou in branch 'default': Issue #20540: Fix a performance regression (vs. Python 3.2) when layering a multiprocessing Connection over a TCP socket. http://hg.python.org/cpython/rev/125c24f47f3c
msg210693 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-02-08 22:09
I've committed the fix. If there's less low-hanging fruit to pick, we can do that in another issue :)
msg210752 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-09 12:16
_send_bytes() now looks a little complicated. There are no need in separate branches for n==0. header + buf where buf is b'' is fast (it is not slower than additional check n > 0). So this microoptimization is not needed. The chunks list is not needed, we can just call self._send(). This will get rid of small overhead of creating and iterating a list.
msg224439 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-07-31 18:40
What would you say about multi_nagle_simplify.patch, Antoine?
msg224450 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-07-31 22:47
New changeset 20fd13242a45 by Antoine Pitrou in branch 'default': Simplify code in multiprocessing.Connection.send_bytes(). http://hg.python.org/cpython/rev/20fd13242a45

History
Date	User	Action	Args
2022-04-11 14:57:58	admin	set	github: 64739
2014-07-31 22:47:11	python-dev	set	messages: + msg224450
2014-07-31 18:40:57	serhiy.storchaka	set	messages: + msg224439
2014-02-09 12:20:06	pitrou	set	resolution: rejected -> fixed
2014-02-09 12:16:52	serhiy.storchaka	set	files: + multi_nagle_simplify.patch nosy: + serhiy.storchaka messages: + msg210752 keywords: + patch
2014-02-08 22:09:57	pitrou	set	status: open -> closed resolution: rejected messages: + msg210693 stage: patch review -> resolved
2014-02-08 22:09:17	python-dev	set	nosy: + python-dev messages: + msg210692
2014-02-08 22:01:40	pitrou	set	messages: + msg210688
2014-02-08 21:43:36	sbt	set	messages: + msg210686
2014-02-08 21:24:10	sbt	set	messages: + msg210681
2014-02-07 19:36:28	neologix	set	messages: + msg210535
2014-02-07 19:15:17	pitrou	set	files: + multi_nagle_2.patch priority: normal -> high messages: + msg210530 keywords: + 3.3regression, - patch stage: patch review
2014-02-07 14:53:52	Irvin.Probst	set	messages: + msg210478
2014-02-07 13:55:55	pitrou	set	messages: + msg210471
2014-02-07 13:29:53	vstinner	set	messages: + msg210466
2014-02-07 13:27:35	pitrou	set	messages: + msg210465
2014-02-07 13:25:35	vstinner	set	nosy: + vstinner
2014-02-07 13:11:24	neologix	set	files: + multi_nagle.diff nosy: + neologix, pitrou, sbt messages: + msg210464 keywords: + patch
2014-02-07 12:06:28	Irvin.Probst	create