classification
Title: increase epoll.poll() maxevents default value, and improve documentation
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: neologix, pitrou, sbt
Priority: normal Keywords:

Created on 2013-01-06 00:33 by neologix, last changed 2013-01-06 20:37 by neologix. This issue is now closed.

Files
File name Uploaded Description Edit
test_epoll.py neologix, 2013-01-06 14:37
test_epoll_2.py sbt, 2013-01-06 19:49
Messages (12)
msg179157 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-01-06 00:33
In issue #16853, it was noted that many several projects don't set epoll.poll() maxevents argument, which effectively limits the number of events retuend to FD_SETSIZE-1 (set in selectmodule.c).

Also, the methode documentation can confuse users into thinking that by default, the number of events is unlimited:
"""

.. method:: epoll.poll(timeout=-1, maxevents=-1)

   Wait for events. timeout in seconds (float)
"""

It would probably make sense to use a larger default value for epoll max events (the only downside is increased memory consumption for the events array), maybe based on RLIMIT_NOFILE hard limit.
And the documentation should probably be improved.
msg179180 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-01-06 13:12
Is this actually a problem?

If events are arranged in a queue and epoll_wait() just removes the oldest events (up to maxevents) from that queue then there would be no problem with using a small value for maxevents.

I don't *know* if that is the case, but I would consider epoll to be broken if it does not do something similar.
msg179183 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-01-06 14:37
The implementation can't simply drain a queue, because it can be
level-triggered (which is the case by default), so you want to keep
events around. The kernel actually uses a red-black tree, but I didn't
really check how it's done (there's probably very good reasosns for
that).

Anyway, it can turn out to be a problem, for two reasons:
- performance: by specifying a maxevents value too low, several calls
to epoll_wait() must be made, instead of being able to process all
events at once
- the main problem that came to my mind is really starvation: let's
say you have 2*FD_SETSIZE client sockets registered in your poll
object. The first call to epoll_wait() returns sockets from 0 to
FD_SETSIZE-1: you process them, so they're not ready anymore. The next
call returns the clients from FD_SETSIZE to 2*FD_SETSIZE, same thing.
But by the time you call epoll_wait() for the third time, if the first
FD_SETSIZE clients are ready again, they will be returned, etc. So the
2*FD_SETSIZE th client may very well never be reported ready: that's
starvation.

I actually wrote a script to reproduce this issue:
"""
$ ./python /home/cf/test_epoll.py
Working with 4080 FDs, -1 maxevents
Number of missing FDs:4080
Number of ready FDs: 1023
Number of missing FDs:3057
Number of ready FDs: 0
Number of missing FDs:3057
Number of ready FDs: 1023
Number of missing FDs:2034
Number of ready FDs: 0
Number of missing FDs:2034
Number of ready FDs: 1023
Number of missing FDs:2034
Number of ready FDs: 0
Number of missing FDs:2034
Number of ready FDs: 1023
Number of missing FDs:2034
Number of ready FDs: 0
[...]
"""

If you specify a large enough maxevents:
"""
$ ./python /home/cf/test_epoll.py 64000
Working with 4080 FDs, 64000 maxevents
Number of missing FDs:4080
Number of ready FDs: 4080
"""

Note that it's really a corner issue, but I stumpled upon this problem
while writing a test in issue #16853, and several projects (Tulip,
Tornado, pyftpdlib) fall into this trap.

I see several options:
1) just keep it that way (i.e. with maxevents set to FD_SETSIZE), and
add a note in the documentation. I think it's always better to handle
problems in the library than let the users get bitten.
2) increase `maxevents` default value. The problem is that I don't
like magic values, and a too large value could incur increased memory
consumption (and with the current implementation reduced performance
because the epoll events buffer is allocated at each poll(), see issue
#16876.
3) use a simple heuristic: start with a reasonable value for
`maxevents` (FD_SETSIZE seems like a good candidate), and if
epoll_wait() ever returns `maxevents` events, double the value (that's
what libevent does, with a capping to 32000).
msg179185 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-06 14:53
> 3) use a simple heuristic: start with a reasonable value for
> `maxevents` (FD_SETSIZE seems like a good candidate), and if
> epoll_wait() ever returns `maxevents` events, double the value (that's
> what libevent does, with a capping to 32000).

What if some events are edge-triggered?
msg179189 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-01-06 15:45
> I actually wrote a script to reproduce this issue:

The program does *not* demonstrate starvation because you are servicing the resource represented by the "starved" duplicate fds before calling poll() again.

You are creating thousands of duplicate handles for the same resource and then complaining that they do not behave independently!

I tried modifing your program by running poll() in a loop, exiting when no more unseen fds are reported as ready.  This makes the program exit immediately.

So

        ready_writers = set(fd for fd, evt in 
                            ep.poll(-1, MAXEVENTS) if fd != r)
        seen_writers |= ready_writers

becomes

        while True:
            ready_writers = set(fd for fd, evt in
                                ep.poll(-1, MAXEVENTS) if fd != r)
            if ready_writers.issubset(seen_writers):
                break
            seen_writers |= ready_writers

I still cannot see a problem with epoll().
msg179201 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-01-06 18:14
> The program does *not* demonstrate starvation because you are servicing the resource represented by the "starved" duplicate fds before calling poll() again.

No.
What the program does is the following:

while all the write FDs have not been returned by epoll.wait() at least once:
    1) make all writer FDs ready by draining the pipe
    2) fetch the list of event reported by epoll(), and add them to
the list of seen FDs
    3) make all the writers FDs not ready by filling the PIPE
    4) fetch the list of event reported by epoll(), and add them to
the list of seen FDs

With the default 'maxevents' parameters, it never completes.

This shows that the FDs returned at step 2 are actually a strict
subset of all the ready FDs, and this forever: some FDs which are
actually ready (they are all ready at step 2) are *never* reported.
This is starvation.

By increasing epoll.wait() maxevents, it completes immediately.

> You are creating thousands of duplicate handles for the same resource and then complaining that they do not behave independently!

The fact that that the FDs are duped shouldn't change anything to the
events reported: it works while the number of FDs is less than
FD_SETSIZE (epoll_wait() maxevents argument).

See also epoll() documentation:
"""

       Q0  What is the key used to distinguish the file descriptors
registered in an
           epoll set?

       A0  The key is the combination of the file descriptor number
and the open file
           description (also known as an "open file handle", the
kernel's internal
           representation of an open file).

       Q1  What happens if you register the same file descriptor on an
epoll instance
           twice?

       A1  You will probably get EEXIST.  However, it is possible to
add a duplicate
           (dup(2), dup2(2), fcntl(2) F_DUPFD) descriptor to the same
epoll instance.
           This can be a useful technique for filtering events, if the
duplicate file
           descriptors are registered with different events masks.
"""

I just used dup() to make it easier to test, but you'll probably get
the same thing it your FDs were sockets connected to different
endpoints.

> I tried modifing your program by running poll() in a loop, exiting when no more unseen fds are reported as ready.  This makes the program exit immediately.
>
> So
>
>         ready_writers = set(fd for fd, evt in
>                             ep.poll(-1, MAXEVENTS) if fd != r)
>         seen_writers |= ready_writers
>
> becomes
>
>         while True:
>             ready_writers = set(fd for fd, evt in
>                                 ep.poll(-1, MAXEVENTS) if fd != r)
>             if ready_writers.issubset(seen_writers):
>                 break
>             seen_writers |= ready_writers

Of course it does, since the returned FDs are a subset of all the
ready file descriptors.

The point is precisely that, when there are more FDs ready than
maxevents, some FDs will never be reported.
msg179207 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-01-06 19:20
> The fact that that the FDs are duped shouldn't change anything to the
> events reported: it works while the number of FDs is less than
> FD_SETSIZE (epoll_wait() maxevents argument).

That assumes that epoll_wait() is supposed to return *all* ready fds.  But that is not possible because maxevents is finite.  If you want all events then obviously you may need to call epoll_wait() multiple times.

> I just used dup() to make it easier to test, but you'll probably get
> the same thing it your FDs were sockets connected to different
> endpoints.

This is the part I disagree with -- I think it makes all the difference.  Please try making such a modification.

>>         while True:
>>             ready_writers = set(fd for fd, evt in
>>                                 ep.poll(-1, MAXEVENTS) if fd != r)
>>             if ready_writers.issubset(seen_writers):
>>                 break
>>             seen_writers |= ready_writers
>
>Of course it does, since the returned FDs are a subset of all the
>ready file descriptors.
>
>The point is precisely that, when there are more FDs ready than
>maxevents, some FDs will never be reported.

The program can only terminate when the outer

    while all_writers - seen_writers:
        ...

loop terminates.  So seen_writers == all_writers, and every fd has been reported.
msg179210 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-01-06 19:42
> That assumes that epoll_wait() is supposed to return *all* ready fds.  But that is not possible because maxevents is finite.  If you want all events then obviously you may need to call epoll_wait() multiple times.

Yes, but the problem is that between two epoll_wait() calls, the
readiness of the FDs can have changed: and if that happens, you'll get
the same list over and over.

> The program can only terminate when the outer
>
>     while all_writers - seen_writers:
>         ...
>
> loop terminates.  So seen_writers == all_writers, and every fd has been reported.

Yes, but there are no event generated between calls to epoll_wait() in
your inner loop.
In a typical usage of select()/poll() epoll() will look like:

while True:
    evts = poll()
    for evt in evts:
        do_something(fd)

and between two calls to poll(), you can get new events (new
connections, space in the socket buffer, etc). The pipe
filling/draining is used to generate new events. Your modification
doesn't take that into account.

> This is the part I disagree with -- I think it makes all the difference.  Please try making such a modification.

Will do.
msg179211 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-01-06 19:49
Here is a version which uses epoll to service a number of pipes which is larger than maxevents.  (If NUM_WRITERS is too large then I get "OSError: [Errno 24] Too many open files".)

All pipes get serviced and the output is:

Working with 20 FDs, 5 maxevents
[5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43]
[15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43]
[25, 27, 29, 31, 33, 35, 37, 39, 41, 43]
[35, 37, 39, 41, 43]

The lists show the (sorted) unseen writers at each loop.
msg179212 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-01-06 19:54
Of course it does, the write ends all get filled, so the number of ready writers drops to zero... That's not all at the problem I'm talking about (I may not be clear though, a script will make it more clear).
msg179213 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-01-06 19:55
> Yes, but the problem is that between two epoll_wait() calls, the
> readiness of the FDs can have changed: and if that happens, you'll get
> the same list over and over.

If an fd *was* ready but isn't anymore then why would you want to know about it?  Trying to use the fd will fail with EAGAIN.
msg179222 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-01-06 20:37
You're right, I just saw my mistake.
Sorry for the noise (the extra performance arguement is not a good enough motivation to tune this dynaically).

Closing.
History
Date User Action Args
2013-01-06 20:37:39neologixsetstatus: open -> closed
resolution: not a bug
messages: + msg179222

stage: needs patch -> resolved
2013-01-06 19:55:34sbtsetmessages: + msg179213
2013-01-06 19:54:37neologixsetmessages: + msg179212
2013-01-06 19:49:15sbtsetfiles: + test_epoll_2.py

messages: + msg179211
2013-01-06 19:42:05neologixsetmessages: + msg179210
2013-01-06 19:20:15sbtsetmessages: + msg179207
2013-01-06 18:14:48neologixsetmessages: + msg179201
2013-01-06 15:45:36sbtsetmessages: + msg179189
2013-01-06 14:53:57pitrousetnosy: + pitrou
messages: + msg179185
2013-01-06 14:37:57neologixsetfiles: + test_epoll.py

messages: + msg179183
2013-01-06 13:12:45sbtsetnosy: + sbt
messages: + msg179180
2013-01-06 00:33:24neologixcreate