classification
Title: Python 3.3/3.4 regression in multiprocessing manager ?
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Irvin.Probst, haypo, neologix, pitrou, python-dev, sbt, serhiy.storchaka
Priority: high Keywords: 3.3regression, patch

Created on 2014-02-07 12:06 by Irvin.Probst, last changed 2014-07-31 22:47 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
test_manager.py Irvin.Probst, 2014-02-07 12:06 Test code
multi_nagle.diff neologix, 2014-02-07 13:11
multi_nagle_2.patch pitrou, 2014-02-07 19:15 review
multi_nagle_simplify.patch serhiy.storchaka, 2014-02-09 12:16 review
Messages (16)
msg210457 - (view) Author: Irvin Probst (Irvin.Probst) Date: 2014-02-07 12:06
After checking with the kind people of help@python.org I think I found a bug in the way connections are handled in the multiprocessing module.

Using python 3.3 I've noticed a performance drop of about 70 times when running some code performing basic requests on a SyncManager. As this code was burried deep into a big project I made a test case attached to this report to reproduce this behavior.

Here is what this code does:
- define a class SharedData with several instance variables (a,b and c here)
- this class has two methods exposed through a proxy (good() ans bad()) both see a huge performance drop using 3.3 and can be used to reproduce this behavior. The only difference is that good() uses a mutex whereas bad() does not, I wished to check that mutexes were not to blame for this problem.
- create a Syncmanager giving access to a SharedData instance
- launch a multiprocessing.Process() running the do_stuff() function, this function calls 10 times the good() (or bad()) method of SharedData through the Syncmanager, passes some values to it and gets back the result.
- after each call to the proxy the time elapsed, roughly measured with time.time(), is printed

System specs:
Linux turing 3.12-1-686-pae #1 SMP Debian 3.12.6-2 (2013-12-29) i686 GNU/Linux

Python version:
latests 2.6,2.7,3.2 and 3.3 from standard debian repos
3.3.0 and 3.4.0 beta3 compiled from source

time elapsed in each call to the proxy using Python 2.6, 2.7 and 3.2:
first call to proxy ~ 0.04 seconds, next calls ~0.001 sec

time elapsed in each call to the proxy using Python 3.3.0, 3.3.2, 3.3.3 , 3.4.0b3:
first call to proxy ~0.27 seconds, next calls: 0.07 sec

I reproduced this behavior using python 2.7 and 3.3.3 on an ubuntu computer running the latest amd64 release.

Of course I tried without a print() after each call but it does not change anything, python 3.3 is *much* slower here.

Using cProfile I tracked this down to the builtin read() method and indeed timing the read() syscall in posix_read() using gettimeofday() confirms it takes ages, posix_write() is fine though. I think the problem comes from the underlying socket the interpreter is reading from.

connections.py from multiprocessing has been rewrittend between 3.2 and 3.3 but I can't see what's wrong in the way it has been done, basic socket options seem to be exactly the same.

One interesting point is that it seems to only affect the last bytes sent through the socket, e.g if you send a numpy.array big enough to fill the socket's read() buffer the first calls to read() are done at a normal speed, only the last one takes time.

If you confirm my test code is not to blame it makes IMHO the SyncManager in 3.3 and 3.4 completely unusable for frequent data exchanges between processes.

Thanks for your time.
msg210464 - (view) Author: Charles-Fran├žois Natali (neologix) * (Python committer) Date: 2014-02-07 13:11
Thanks for the detailed report.

> connections.py from multiprocessing has been rewrittend between 3.2
> and 3.3 but I can't see what's wrong in the way it has been done,
> basic socket options seem to be exactly the same.

Indeed, multiprocessing.connection has been rewritten in Python (by Antoine :-).

> One interesting point is that it seems to only affect the last bytes
> sent through the socket, e.g if you send a numpy.array big enough to
> fill the socket's read() buffer the first calls to read() are done at
> a normal speed, only the last one takes time.

And that's the catch.

Before, connection.send looked like this (for sockets):
"""
static Py_ssize_t
conn_send_string(ConnectionObject *conn, char *string, size_t length)
{
    Py_ssize_t res;
    /* The "header" of the message is a 32 bit unsigned number (in
       network order) which specifies the length of the "body".  If
       the message is shorter than about 16kb then it is quicker to
       combine the "header" and the "body" of the message and send
       them at once. */
    if (length < (16*1024)) {
        char *message;

        message = PyMem_Malloc(length+4);
        if (message == NULL)
            return MP_MEMORY_ERROR;

        *(UINT32*)message = htonl((UINT32)length);
        memcpy(message+4, string, length);
        Py_BEGIN_ALLOW_THREADS
        res = _conn_sendall(conn->handle, message, length+4);
        Py_END_ALLOW_THREADS
        PyMem_Free(message);
    } else {
        UINT32 lenbuff;

        if (length > MAX_MESSAGE_LENGTH)
            return MP_BAD_MESSAGE_LENGTH;

        lenbuff = htonl((UINT32)length);
        Py_BEGIN_ALLOW_THREADS
        res = _conn_sendall(conn->handle, (char*)&lenbuff, 4) ||
            _conn_sendall(conn->handle, string, length);
        Py_END_ALLOW_THREADS
    }
    return res;
}
"""

So, for short messages, the header was combined with the payload, and written at once to the socket, but for large messages, it was written in two steps.

Now the code looks like this:
"""
    def _send_bytes(self, buf):
        # For wire compatibility with 3.2 and lower
        n = len(buf)
        self._send(struct.pack("!i", n))
        # The condition is necessary to avoid "broken pipe" errors
        # when sending a 0-length buffer if the other end closed the pipe.
        if n > 0:
            self._send(buf)
"""

First the header is sent, then the payload.

The problem is that this kind of pattern causes problem because of a nasty interaction between Nagle's algorithm and delayed ack: in short, if the payload isn't large enough, the TCP stack won't send it before a short delay, see http://en.wikipedia.org/wiki/Nagle's_algorithm for more details.

So, one possible fix is to always combine the payload and header.
With the patch attached, we go from:
"""
$ ./python /tmp/test_manager.py
0.28662800788879395
0.08182215690612793
0.08193612098693848
0.08193612098693848
0.08194088935852051
0.08194208145141602
0.0819399356842041
0.08194184303283691
0.08194303512573242
0.0819389820098877
10
"""

to

"""
$ ./python /tmp/test_manager.py
0.04239797592163086
0.00041413307189941406
0.0004057884216308594
0.0004088878631591797
0.0004029273986816406
0.00040793418884277344
0.0004069805145263672
0.0004069805145263672
0.0004050731658935547
0.0004069805145263672
10
"""

Another possibility is to disable Nagle with TCP_NODELAY if available, or use some heuristic similar to the one in 3.2.
msg210465 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-02-07 13:27
I would favour reusing the 3.2 heuristic. Joining the header and the payload may be expensive if the payload is very large.
msg210466 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-02-07 13:29
Modern operation systems have new functions like writev() to send to consecutive buffers. Can't we use them, and emulate them if they are missing (use 3.2 algorithm: join if small, or send separatly)?
msg210471 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-02-07 13:55
> Modern operation systems have new functions like writev() to send to
> consecutive buffers. Can't we use them, and emulate them if they are
> missing (use 3.2 algorithm: join if small, or send separatly)?

What would the complication bring?
msg210478 - (view) Author: Irvin Probst (Irvin.Probst) Date: 2014-02-07 14:53
FWIW, according to your comments I tried a quick and dirty fix in my code as I can't wait for a new Python release to make it work:

The do_stuff function now does:

"""
def do_stuff():
    client=make_client('',6666, b"foo")
    data_proxy=client.get_proxy()

    #make a dummy request to get the underlying
    #fd we are reading from (see bug #20540)
    c=data_proxy.good([1,2],[3,4])
    fd=data_proxy._tls.connection._handle

    #setting TCP_NODELAY on 3.3.x should fix the delay issue until a new release
    sock=socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

    for i in range(10):
        t_s=time.time()
        c=data_proxy.good([1,2],[3,4])
        print(time.time()-t_s)
    print(c)
"""

I'm now down to 0.04s per request instead of ~0.08s, I guess the remaining delay comes from the server side socket which has not been affected by the TCP_NODELAY on the client side.

Regards.
msg210530 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-02-07 19:15
Updated patch with the 3.2 heuristic for deciding when to concatenate.
msg210535 - (view) Author: Charles-Fran├žois Natali (neologix) * (Python committer) Date: 2014-02-07 19:36
> Updated patch with the 3.2 heuristic for deciding when to concatenate.

LGTM.

For posterity, a quick benchmark performing 100 trivial remote calls:
Before:
$ ./python ~/test_manager.py
8.202659845352173

After:
$ ./python ~/test_manager.py
0.13146567344665527

Note that we're still behind the pre-rewrite performance (this is on 2.7):
$ ./python ~/test_manager.py
0.0707828998566

That's probably just the C->Python overhead, but I'll double-check.
msg210681 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2014-02-08 21:24
LGTM
msg210686 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2014-02-08 21:43
BTW, I see little difference between 3.2 and the unpatched default branch on MacOSX:

$ py-32/release/python.exe ~/Downloads/test_manager.py 
0.0007331371307373047
8.20159912109375e-05
9.417533874511719e-05
8.082389831542969e-05
7.796287536621094e-05
0.00011587142944335938
0.00011396408081054688
7.891654968261719e-05
8.392333984375e-05
7.605552673339844e-05
10

$ time py-default/release/python.exe ~/Downloads/test_manager.py 
0.0007359981536865234
0.0001289844512939453
0.00018715858459472656
0.00015497207641601562
0.00012087821960449219
0.00013399124145507812
0.00011992454528808594
0.00011587142944335938
0.00010895729064941406
0.00017499923706054688
10
msg210688 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-02-08 22:01
> Note that we're still behind the pre-rewrite performance (this is on
> 2.7):

If I believe the `time` command, the difference is additional userspace CPU time, therefore probably the pure Python overhead.
msg210692 - (view) Author: Roundup Robot (python-dev) Date: 2014-02-08 22:09
New changeset 4816ab0477d2 by Antoine Pitrou in branch '3.3':
Issue #20540: Fix a performance regression (vs. Python 3.2) when layering a multiprocessing Connection over a TCP socket.
http://hg.python.org/cpython/rev/4816ab0477d2

New changeset 125c24f47f3c by Antoine Pitrou in branch 'default':
Issue #20540: Fix a performance regression (vs. Python 3.2) when layering a multiprocessing Connection over a TCP socket.
http://hg.python.org/cpython/rev/125c24f47f3c
msg210693 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-02-08 22:09
I've committed the fix. If there's less low-hanging fruit to pick, we can do that in another issue :)
msg210752 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-09 12:16
_send_bytes() now looks a little complicated.

There are no need in separate branches for n==0. header + buf where buf is b'' is fast (it is not slower than additional check n > 0). So this microoptimization is not needed.

The chunks list is not needed, we can just call self._send(). This will get rid of small overhead of creating and iterating a list.
msg224439 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-31 18:40
What would you say about multi_nagle_simplify.patch, Antoine?
msg224450 - (view) Author: Roundup Robot (python-dev) Date: 2014-07-31 22:47
New changeset 20fd13242a45 by Antoine Pitrou in branch 'default':
Simplify code in multiprocessing.Connection.send_bytes().
http://hg.python.org/cpython/rev/20fd13242a45
History
Date User Action Args
2014-07-31 22:47:11python-devsetmessages: + msg224450
2014-07-31 18:40:57serhiy.storchakasetmessages: + msg224439
2014-02-09 12:20:06pitrousetresolution: rejected -> fixed
2014-02-09 12:16:52serhiy.storchakasetfiles: + multi_nagle_simplify.patch

nosy: + serhiy.storchaka
messages: + msg210752

keywords: + patch
2014-02-08 22:09:57pitrousetstatus: open -> closed
resolution: rejected
messages: + msg210693

stage: patch review -> resolved
2014-02-08 22:09:17python-devsetnosy: + python-dev
messages: + msg210692
2014-02-08 22:01:40pitrousetmessages: + msg210688
2014-02-08 21:43:36sbtsetmessages: + msg210686
2014-02-08 21:24:10sbtsetmessages: + msg210681
2014-02-07 19:36:28neologixsetmessages: + msg210535
2014-02-07 19:15:17pitrousetfiles: + multi_nagle_2.patch
priority: normal -> high
messages: + msg210530

keywords: + 3.3regression, - patch
stage: patch review
2014-02-07 14:53:52Irvin.Probstsetmessages: + msg210478
2014-02-07 13:55:55pitrousetmessages: + msg210471
2014-02-07 13:29:53hayposetmessages: + msg210466
2014-02-07 13:27:35pitrousetmessages: + msg210465
2014-02-07 13:25:35hayposetnosy: + haypo
2014-02-07 13:11:24neologixsetfiles: + multi_nagle.diff

nosy: + neologix, pitrou, sbt
messages: + msg210464

keywords: + patch
2014-02-07 12:06:28Irvin.Probstcreate