classification
Title: Speedup DefaultSelectors.modify() by 2x
Type: performance Stage: patch review
Components: asyncio, Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: giampaolo.rodola, neologix, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2017-04-07 10:40 by giampaolo.rodola, last changed 2017-06-09 20:20 by vstinner.

Files
File name Uploaded Description Edit
selectors_modify.diff giampaolo.rodola, 2017-04-07 10:40 review
bench.py giampaolo.rodola, 2017-04-07 10:40
bench_selectors_modify.py vstinner, 2017-04-07 12:24
echo_server.py giampaolo.rodola, 2017-04-08 18:31
echo_client.py giampaolo.rodola, 2017-04-08 18:31
Pull Requests
URL Status Linked Edit
PR 1030 merged giampaolo.rodola, 2017-04-07 12:49
Messages (21)
msg291261 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 10:40
Patch in attachment modifies DefaultSelector.modify() so that it uses the underlying selector's modify() method instead of unregister() and register() resulting in a 2x speedup.

Without patch:
~/svn/cpython {master}$ ./python bench.py 
0.006010770797729492

With patch:
~/svn/cpython {master}$ ./python bench.py 
0.00330352783203125
msg291264 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2017-04-07 12:02
Hm, do you have a realistic benchmark which would show the benefit?
Because this is really a micro-benchmark, and I'm not convinced that
Selector.modify() is a significant bottleneck in a real-world
application.
msg291265 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 12:19
modify() can be used often in verbose protocols such as FTP and SMTP where a lot of requests and responses are continuously exchanged between client and server, so you need to switch from EVENT_READ to EVENT_WRITE all the time. I did a similar change some years ago in pyftpdlib but I don't have another benchmark other than this one.
msg291266 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-07 12:24
Hi Giampaolo Rodola'! It seems like you proposed the same idea 4 years ago and I wrote a similar patch: issue #18932 :-)

I suggest you to use my perf module to produce more reliable benchmarks. Here is my results on my computer smithers tuned for benchmarks:

haypo@smithers$ ./python bench_selectors.py -o ref.json
[apply the patch]
haypo@smithers$ ./python bench_selectors.py -o patch.json
haypo@smithers$ ./python -m perf compare_to ref.json patch.json  --table
+----------------------+---------+------------------------------+
| Benchmark            | ref     | patch                        |
+======================+=========+==============================+
| PollSelector.modify  | 11.3 us | 8.22 us: 1.37x faster (-27%) |
+----------------------+---------+------------------------------+
| EpollSelector.modify | 13.5 us | 8.88 us: 1.52x faster (-34%) |
+----------------------+---------+------------------------------+

Not significant (1): SelectSelector.modify


@neologix: "Hm, do you have a realistic benchmark which would show the benefit?"

I don't think that selector.modify() can be a bottleneck, but IMHO the change is simple and safe enough to be worth it. In a network server with 10k client, an optimization making .modify() 1.52x faster is welcomed.
msg291267 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-07 12:28
@Giampaolo Rodola: CPython development moved to GitHub, can you please create a pull request instead of a patch? Thank you.
https://docs.python.org/devguide/pullrequest.html

Hum, I see that my old patch of issue #18932 (selectors_optimize_modify-2.patch) is different. I tried to factorize code. What do you think of my change?
msg291268 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 12:40
Hey Stinner, thanks for chiming in! Your old patch uses unregister() and register() so I'm not sure what speedup that'll take as the whole point is to avoid doing that. You may want to rewrite it and benchmark it but it looks like it'll be slower.
msg291269 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 12:49
PR: https://github.com/python/cpython/pull/1030
msg291270 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-07 12:58
@neologix: Do you have a GitHub account? It's hard to me to see if https://github.com/neologix is you or not, and your GitHub username is not filled in the Bug Tracker database.
msg291271 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-07 13:03
Giampaolo Rodola': "Your old patch uses unregister() and register() so I'm not sure what speedup that'll take as the whole point is to avoid doing that."

My patch calls generic unregister() and register() of the base classes, but it only uses one syscall per modify() call.

Oh by the way, it seems like my patch changes KqueueSelector, but your change doesn't. Am I right? KqueueSelector is the default selector on FreeBSD and macOS. IMHO it's worth it to optimize it as well.
msg291272 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 13:11
> My patch calls generic unregister() and register() of the base
> classes, but it only uses one syscall per modify() call.

Doesn't that mean doing 3 operations (unregister(), register(), modify()) instead of the current 2 (unregister(), register())? I don't see how it can be faster than a single modify() syscall.

> Oh by the way, it seems like my patch changes KqueueSelector, 
> but your change doesn't. Am I right?

You are right but it looks like you end up doing the same thing as unregister() and register(). kqueue() has no modify() method so I don't think it can benefit from this change.
msg291273 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-07 13:24
> Doesn't that mean doing 3 operations (unregister(), register(), modify()) instead of the current 2 (unregister(), register())? I don't see how it can be faster than a single modify() syscall.

The idea is to reuse _BaseSelectorImpl.register() and _BaseSelectorImpl.unregister() to factorize the code. These methods don't use syscall, they create the SelectorKey object and update _fd_to_key. So each class doesn't have to redo these things.

I don't insist to redo what I did, I'm just trying to explain my change because your change basically copy/paste the same code 3 times, and you forgot KqueueSelector, so you even may have to copy it a 4th time ;-)
msg291274 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 13:39
> The idea is to reuse _BaseSelectorImpl.register() and
> _BaseSelectorImpl.unregister() to factorize the code.

You can't factorize the logic of modify() into those as they do two different things. I also don't like repeating the same thing 3 times but given how the module is organized I'm not sure how to do that as I need to pass 3 things around: the low-level selector (epoll, poll, whatever) and the read and write constants (POLLIN, EPOLLIN) which change depending on the selector being used.
The same thing applies to the devpoll class (http://bugs.python.org/issue18931).
I can write a second patch which to refactor the whole module if that is desirable but I prefer to do that in another PR.
msg291280 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2017-04-07 17:34
> I don't think that selector.modify() can be a bottleneck, but IMHO the change is simple and safe enough to be worth it. In a network server with 10k client, an optimization making .modify() 1.52x faster is welcomed.

IMHO it complicates the code for little benefit: that's why I asked
for a realistic benchmark. This patch could made modify() 10x faster,
if modify() only accounts for 1% of the overall overhead in a
realistic use-case, then it's not worth it.
msg291282 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 18:09
In certain protocols modify() is supposed to be used on every interaction between client and server. E.g. an FTP server does this:

- register(fd, EVENT_READ); recv()         # wait command from client
- modify(fd, EVENT_WRITE); send(response)  # send response
- modify(fd, EVENT_READ); recv()           # wait for new command

...so it's two calls for each command received.
In asyncio modify() is also used in different circumstances. 
If you're worried about code duplication I can refactor selectors.py first, but IMO it would be a bad idea to reject this or future improvements because the current code status prevents code reuse. Right now there are already 3 classes sharing basically the same code (poll, epoll and devpoll related classes). Those can be refactored similarly to this:
https://github.com/giampaolo/pyftpdlib/blob/ab699b5f89223e03593f3e004d6a370b4c2e5308/pyftpdlib/ioloop.py#L465-L565
msg291286 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-07 19:25
@neologix: here's a PR which refactors the poll-related classes: https://github.com/python/cpython/pull/1035/files
From there, we'll be able to avoid modify() code duplication.
msg291306 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2017-04-07 22:41
This refactoring was already suggested a long time ago, and at the
time both Guido and I didn't like it because it makes the code
actually more complicated: DRY in this case doesn't apply IMO.

Also, this whole thread is a repeat of:
http://bugs.python.org/issue18932

At the time, I already asked for one realistic use case demonstrating
the benefit of implementing modify() natively instead of
unregister()+register().
I know that this code makes modify() faster, but as I said multiple
times, I'd like to se the impact on something else than a
micro-benchmark: arguably if you propose this it's because you've
profiled/found it to be an actual bottleneck, do you have *one*
benchmark to support this change?
msg291340 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-08 18:31
For the sake of experiment I'm attaching a toy echo server which uses modify() to switch between EVENT_READ and EVENT_WRITE. Without patch I get 35000 req/sec, with patch around 39000 req/sec (11.4% faster).
To be entirely honest a smarter echo server would switch between EVENT_READ / EVENT_WRITE only in case of BlockingIOError on send() or recv(), but unfortunately it's a condition which (for send()) does not occur often by testing on localhost (for recv() it naturally occurs +27000 times out of 39000). On the other hand, by experimenting with a real network it occurs quite soon:

import socket
sock = socket.socket()
sock.connect(('google.com', 80))
sock.setblocking(False)
print(sock.send(b'x' * 50000))
print(sock.send(b'x' * 50000))  # raise BlockingIOError

So basically my benchmark is emulating a worst case scenario in which send() always blocks on the first call and succeed on the next one, in order to mimic recv() which blocks half of the times.
The whole point is that the speedup entirely depends on how many times send() or recv() will block in a real world app connected to the internet (because that's when modify() is supposed to be used) but that's hard to determine, especially under heavy server loads which is hard to emulate. This also involves the SSL handshake when it fails with WANT_READ / WANT_WRITE, which is another circumstance where modify() is going to be used and for which I have no benchmarks.
I personally don't mind about selectors module per-se, but given that it's the core of important libs such as asyncio and curio I would like to hear a better rationale than "let's reject this 1.5x speedup because DRY does not apply in this case".
Also please consider that Tornado, Twisted and gevent use native modify() method.
msg292586 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2017-04-29 09:24
The rationale for rejecting wouldn't be "DRY does not apply in this
case", it would be that this makes the code more complicated, and that
a negligible speedup would not be worth it.
Now, thanks to your benchmark, a 10% speedup is not negligible, so
that seems like a reasonable change.
I'll have a look at your refactoring code, and once pushed, we can
optimize modify().
msg292602 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-29 22:23
I also like the plan starting with refactoring ;-)
msg295338 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-06-07 14:10
OK, https://github.com/python/cpython/pull/1030 should be good to go.
msg295564 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-09 20:20
New changeset fbfaa6fd57f8dc8a3da808acb8422370fad2f27b by Victor Stinner (Giampaolo Rodola) in branch 'master':
bpo-30014: make poll-like selector's modify() method faster (#1030)
https://github.com/python/cpython/commit/fbfaa6fd57f8dc8a3da808acb8422370fad2f27b
History
Date User Action Args
2017-06-09 20:20:43vstinnersetmessages: + msg295564
2017-06-07 14:10:40giampaolo.rodolasetmessages: + msg295338
2017-04-29 22:23:24vstinnersetmessages: + msg292602
2017-04-29 09:24:18neologixsetmessages: + msg292586
2017-04-08 18:31:53giampaolo.rodolasetfiles: + echo_client.py
2017-04-08 18:31:40giampaolo.rodolasetfiles: + echo_server.py

messages: + msg291340
2017-04-07 22:41:26neologixsetmessages: + msg291306
2017-04-07 19:25:28giampaolo.rodolasetmessages: + msg291286
2017-04-07 18:09:29giampaolo.rodolasetmessages: + msg291282
2017-04-07 17:34:58neologixsetmessages: + msg291280
title: Speedup DefaultSelectors.modify() by 1.5x -> Speedup DefaultSelectors.modify() by 2x
2017-04-07 14:31:15gvanrossumsetnosy: - gvanrossum
2017-04-07 13:39:46giampaolo.rodolasetmessages: + msg291274
2017-04-07 13:24:04vstinnersetmessages: + msg291273
2017-04-07 13:11:44giampaolo.rodolasetmessages: + msg291272
2017-04-07 13:03:27vstinnersettitle: Speedup DefaultSelectors.modify() by 2x -> Speedup DefaultSelectors.modify() by 1.5x
2017-04-07 13:03:12vstinnersetmessages: + msg291271
2017-04-07 12:58:56vstinnersetmessages: + msg291270
2017-04-07 12:49:22giampaolo.rodolasetmessages: + msg291269
pull_requests: + pull_request1187
2017-04-07 12:40:38giampaolo.rodolasetmessages: + msg291268
2017-04-07 12:28:21vstinnersetmessages: + msg291267
2017-04-07 12:24:37vstinnersetfiles: + bench_selectors_modify.py

messages: + msg291266
2017-04-07 12:19:16giampaolo.rodolasetmessages: + msg291265
2017-04-07 12:02:25neologixsetmessages: + msg291264
2017-04-07 11:01:15giampaolo.rodolasettype: performance
components: + Library (Lib), asyncio
stage: patch review
2017-04-07 10:41:47giampaolo.rodolasetnosy: + gvanrossum, vstinner, neologix, yselivanov
2017-04-07 10:40:26giampaolo.rodolasetfiles: + bench.py
versions: + Python 3.7
2017-04-07 10:40:07giampaolo.rodolacreate