classification
Title: select.epoll.poll() should avoid calling malloc() each time
Type: performance Stage:
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: josh.r, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2016-01-28 15:11 by vstinner, last changed 2016-02-02 22:37 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
epoll_stack.patch vstinner, 2016-01-28 15:11 review
epoll_buffer.patch vstinner, 2016-01-28 15:52 review
bench_epoll_poll.py vstinner, 2016-01-28 16:42
Messages (10)
msg259141 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 15:11
My colleague Fabio M. Di Nitto noticed that the memory allocation on the heap (PyMem_Malloc()) in select.epoll.wait() can have an impact on performance when select.epoll.wait() is a busy applicatin.

He proposed attached patch to allocate memory on the stack for up to FD_SETSIZE-1 events: see attached patch.
msg259143 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 15:52
I dislike allocating large memory block on the stack, we should not abuse it.

I propose instead to store the memory block allocated on the heap in the epoll object. Attached patch implements this idea.

The buffer is stolen from the epoll object when epoll.poll() is called. Then it is stored again inside the epoll object, except if another thread stored a larger buffer in the meanwhile. The largest buffer is also kept.

The buffer exchange is protected by the GIL.

My patch also overallocates the buffer by 50% to avoid calling realloc() to many times. Using the selectors module, maxevents is the number of registered FD. Using asyncio, the number of registered FD changes a lot.

Side effect: the memory block is not released after the call to epoll.poll(). If you can it with an insane max_events, it will bloat your memory until your close the poller.

If the approach is considered interested and efficient, I can work on a similar patch for other pollers.
msg259144 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-28 16:01
Any benchmarks?
msg259146 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 16:42
> Any benchmarks?

Wait, it's easier to write a patch than to cook a benchmark :-)

Attached script bench_epoll_poll.py creates a lot of sockets and call epoll.poll() in non-blocking mode (timeout=0). You can decide to create or not events.

It looks like performance when epoll.poll() gets events are unchanged with my patch.

The performance are better when epoll.poll() doesn't get any event: between 6% and 20% faster.

I'm not completly sure that it's worth it since we are talking about nanoseconds. The largest seen difference was 94 nanoseconds: 420 ns => 326 ns.

My microbenchmark is unstable. Even if it runs the benchmark 25 or 50 times and take the minimum, I have to run it between 3 and 10 times to see the "real" minimum timing... It's a common problem for microbenchmarks with timing smaller than 1 ms.


Create read+events (10 sockets):

*  Without the patch: Best of 25 runs (100000 loops, 10 FDs): 1294 ns per call to epoll.poll()
*  With the patch: Best of 50 runs (100000 loops, 10 FDs): 1221 ns (-6%) per call to epoll.poll()

Create read+events (8000 sockets):

*  Without the patch: Best of 25 runs (10 loops, 16000 FDs): 4.6 ms per call to epoll.poll()
*  With the patch: Best of 25 runs (10 loops, 16000 FDs): 4.6 ms (same) per call to epoll.poll()

Don't create events (10 sockets):

*  Without the patch: Best of 50 runs (100000 loops, 10 FDs): 367 ns per call to epoll.poll()
*  With the patch: Best of 50 runs (100000 loops, 10 FDs): 343 ns (-7%) per call to epoll.poll()

Don't create events (1000 sockets):

*  Without the patch: Best of 50 runs (100000 loops, 1000 FDs): 420 ns per call to epoll.poll()
*  With the patch: Best of 50 runs (100000 loops, 1000 FDs): 326 ns (-22%) per call to epoll.poll()

Don't create events (16000 sockets):

*  Without the patch: Best of 50 runs (100000 loops, 16000 FDs): 422 ns per call to epoll.poll()
*  With the patch: Best of 50 runs (100000 loops, 16000 FDs): 338 ns (-20%) per call to epoll.poll()
msg259147 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2016-01-28 16:45
Overallocating by 50% might be overkill here; I wouldn't imagine most users of epoll.poll would use anything but:

1. Limit to 1 event
2. Limit to some constant K events
3. Use default maxevents (FD_SETSIZE - 1)

Even if called in pathological order, that would only involve three progressive reallocations before steady state; potentially avoiding one of them (if K and FD_SETSIZE - 1 are within the 50% overallocation) in a relatively uncommon case doesn't seem like it's worth the guaranteed waste in the common case.
msg259148 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 17:09
> Overallocating by 50% might be overkill here; I wouldn't imagine most users of epoll.poll would use anything but: (...)

See my second message:
"My patch also overallocates the buffer by 50% to avoid calling realloc() to many times. Using the selectors module, maxevents is the number of registered FD. Using asyncio, the number of registered FD changes a lot."

If you use selectors or asyncio, max_events depends on the current number of registered FD and it almost changes at each call to epoll.poll().

asyncio only registers an FD to listen for read events when the application starts to read on this FD. There is also an optimization: it first tries to read bytes in non-blocking mode. It only registers the FD if it receives 0 byte (if recv() fails with a "woud block" error). It's similar for write events.
msg259370 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-02 08:40
@Yury, Serhiy: Do you think that it's worth to avoid malloc in epoll? Aside of the performance, my colleague also told me that heavy pressure on the memory allocator can slowly create framgmentation of the heap memory, which is true.
msg259397 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-02 15:48
I'm not sure that it is worth to apply this optimization. The patch adds half a hundred lines of complex code for only 80 ns benefit. On my computer just incrementing an integer takes 100 ns.
msg259416 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-02 18:51
> Aside of the performance, my colleague also told me that heavy pressure on the memory allocator can slowly create framgmentation of the heap memory, which is true.

Right, but there are so many other things which contribute to the memory fragmentation.. I doubt that optimizing epoll will make any detectable improvement on that side.

All in all, I think I'm with Serhiy -- since the performance benefits are unclear, we shouldn't complicate the code.
msg259442 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-02 22:37
Serhiy Storchaka: "I'm not sure that it is worth to apply this optimization. The patch adds half a hundred lines of complex code for only 80 ns benefit. On my computer just incrementing an integer takes 100 ns."

Yury Selivanov: "(...) Right, but there are so many other things which contribute to the memory fragmentation.. I doubt that optimizing epoll will make any detectable improvement on that side."

To be honest, I expected better speedup before I ran the benchmark. That's why I repeat that any performance patch must come with a concrete benchmark showing a non-negligible speedup ;-)

I reject my optimization since it doesn't make the code faster.
History
Date User Action Args
2016-02-02 22:37:34vstinnersetstatus: open -> closed
resolution: rejected
messages: + msg259442
2016-02-02 18:51:32yselivanovsetmessages: + msg259416
2016-02-02 15:48:39serhiy.storchakasetmessages: + msg259397
2016-02-02 08:40:09vstinnersetmessages: + msg259370
2016-01-28 20:40:29vstinnersettitle: select.epoll.poll() should away calling malloc() each time -> select.epoll.poll() should avoid calling malloc() each time
2016-01-28 17:09:05vstinnersetmessages: + msg259148
2016-01-28 16:45:53josh.rsetnosy: + josh.r

messages: + msg259147
title: select.epoll.wait() should away calling malloc() each time -> select.epoll.poll() should away calling malloc() each time
2016-01-28 16:42:48vstinnersetfiles: + bench_epoll_poll.py

messages: + msg259146
2016-01-28 16:01:11serhiy.storchakasetmessages: + msg259144
2016-01-28 15:52:43vstinnersetfiles: + epoll_buffer.patch

messages: + msg259143
2016-01-28 15:11:45vstinnercreate