classification
Title: Low FD_SETSIZE limit on Windows
Type: resource usage Stage: patch review
Components: Build, Windows Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: David Hirschfeld, andzn, asvetlov, desbma, eryksun, njs, paul.moore, pitrou, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords: patch

Created on 2016-11-15 22:52 by David Hirschfeld, last changed 2019-06-15 04:17 by andzn.

Pull Requests
URL Status Linked Edit
PR 13842 open andzn, 2019-06-05 12:41
PR 13845 closed vstinner, 2019-06-05 16:07
Messages (20)
msg280900 - (view) Author: David Hirschfeld (David Hirschfeld) Date: 2016-11-15 22:52
Back in 1999 the compile-time constant FD_SETSIZE was raised from (the default on Windows) 64 to 512 open sockets to support *serious async servers*

http://bugs.python.org/issue210843
https://github.com/python/cpython/blame/master/Modules/selectmodule.c#L29-L36

The world has moved on and serious async servers require far more than 512 sockets available. This is one of the key reasons why tornado explicitly states:

> Tornado will also run on Windows, although this configuration is not officially supported and is recommended only for development use.

Yes, using select on Windows is the wrong thing to do, but it's far preferable to be able to use a library which makes use of select, putting up with the poor performance than it is to be told to use linux which often isn't an option.

Since there's no alternative other than recompiling Python it would be good if this constant could be increased to a more useful (these days) value of say 16384.

As a data point ZMQ have recently increased the value of this constant to 16k themselves

https://github.com/zeromq/libzmq/pull/2035
msg280901 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-15 22:57
To implement an efficient event loop, IOCP is the way to do: the asyncio module supports it using ProactorEventLoop.
msg280902 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-11-15 23:42
It looks like Modules/selectmodule.c already has the handling for when this gets large (and we really want to stop allocating the static array on the stack), which is a plus, but I'd want to see the performance cost to small selects if suddenly we're allocating 100s of KB on the heap rather than small and cheap stack allocations.

For the amount of work going on here for each select() call, Victor is right - you're going to get severely diminishing returns as this increases in size.

However, I see no reason why this couldn't be totally dynamic, at least on Windows. Redefining that value just changes a static array definition, but the array size never gets passed in to the OS AFAICT, so there's no reason we couldn't stack allocate for small select()s and heap allocate for big select()s. That's considerably more work than anyone had in mind, I'd wager.
msg281058 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-11-17 17:25
> To implement an efficient event loop, IOCP is the way to do: the asyncio module supports it using ProactorEventLoop.

Not everyone uses asyncio, though, so it would still be nice to see the select() limit raised anyway.

Implementing your own event loop using IOCP is a hard exercise, much harder than getting something to work with select().
msg305672 - (view) Author: desbma (desbma) * Date: 2017-11-06 19:48
I just want to say that I strongly support either bumping the value of FD_SETSIZE to something a lot higher than 512, or making it configurable from Python code.

I am the author of a program that makes heavy use of asyncio. Some Windows users have reported errors when using big directory trees, that I could not reproduce on Linux. Then I found the note about the SelectorEventLoop limitation in the doc https://docs.python.org/3/library/asyncio-eventloops.html#windows

I can't use ProactorEventLoop because I support Python 3.4 which does not have it. I had to work around this limitation by calling run_until_complete periodically to limit the number of pending tasks. 

I sincerely think this the kind of thing that can hurt the global usage of asyncio. 
Hell I can do 600 IO tasks in parallel if I want to with concurrent.futures.ThreadPoolExecutor, why can't I do the same with asyncio?
msg344713 - (view) Author: Andrei Zene (andzn) * Date: 2019-06-05 12:42
We would also need this limit to be raised. We are using a server that was implemented using tornado, and tornado uses select.select. 

> It looks like Modules/selectmodule.c already has the handling for when this gets large (and we really want to stop allocating the static array on the stack), which is a plus, but I'd want to see the performance cost to small selects if suddenly we're allocating 100s of KB on the heap rather than small and cheap stack allocations.

I gave this a try. The code I used for the server and the client for taken from this page: https://steelkiwi.com/blog/working-tcp-sockets/. I just added some code to measure the time it stays in the "select" call. To measure the time I used time.process_time(). I then connected/disconnected only 1 client 10.000 times in a row, discarding the measurements of the first 1000 iterations (warm-up).

Here are the results:
FD_SETSIZE = 16384 bytes
Time elapsed (avg ns):  20029.566224207087
Time elapsed (max ns):  15625000
Time elapsed (min ns):  0

FD_SETSIZE = 512 bytes
Time elapsed (avg ns):  14671.361502347418
Time elapsed (max ns):  15625000
Time elapsed (min ns):  0

> However, I see no reason why this couldn't be totally dynamic, at least on Windows. Redefining that value just changes a static array definition, but the array size never gets passed in to the OS AFAICT, so there's no reason we couldn't stack allocate for small select()s and heap allocate for big select()s. That's considerably more work than anyone had in mind, I'd wager.
I'm not sure this can be totally dynamic on Windows. The problem is the fd_set struct which is defined in Winsock2.h

typedef struct fd_set {
        u_int fd_count;               /* how many are SET? */
        SOCKET  fd_array[FD_SETSIZE];   /* an array of SOCKETs */
} fd_set;

The heap-allocated pylists are converted into fd_set structs which are always allocated on the stack btw. The problem is there's no way to create a fd_set of a desired FD_SETSIZE at runtime.

I have also submitted a PR with the change for convenience: https://github.com/python/cpython/pull/13842.
msg344715 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-05 12:44
Would it be possible to make FD_SETSIZE configurable at runtime, at least on Windows? IMHO it would be a better approach.
msg344716 - (view) Author: Andrei Zene (andzn) * Date: 2019-06-05 12:53
> Would it be possible to make FD_SETSIZE configurable at runtime, at least on Windows? IMHO it would be a better approach.

That would be awesome, but it doesn't look like it would be possible. As I have already pointed out in my previous comment:

> I'm not sure this can be totally dynamic on Windows. The problem is the fd_set struct which is defined in Winsock2.h
>
> typedef struct fd_set {
>         u_int fd_count;               /* how many are SET? */
>         SOCKET  fd_array[FD_SETSIZE];   /* an array of SOCKETs */
> } fd_set;
>
> The heap-allocated pylists are converted into fd_set structs which are always allocated on the stack btw. The problem is there's no way to create a fd_set of a desired FD_SETSIZE at runtime.
msg344720 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-06-05 13:32
I believe the number still can be configurable but requires more work.

The idea is:
FD_* macroses should be replaced with custom fd_set manipulation functions.
on select.select() call we can calculate the number of the highest used socket and use this number plus 1 as fd_set structure size.

Did not try to prototype it though.
msg344742 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-05 16:07
I wrote a proof-of-concept: PR 13845.

... but now I'm confused. How does select() know the FD_SETSIZE? Is it hardcoded in select() ABI? Does this PR work? Or does it make no sense?
msg344744 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-06-05 16:21
> How does select() know the FD_SETSIZE?

It doesn't have to. The actual number of descriptors is passed in, and the API presumably assumes that FD_SETSIZE is bigger than that (or else the caller would have crashed writing to invalid memory).

I don't see why we'd make it a parameter though. If we have all the descriptors, just allocate a big enough array and pass it in?
msg344756 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-06-05 17:58
From my understanding Steve is right.
msg344757 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-06-05 18:03
> It doesn't have to. The actual number of descriptors is passed in, and 

To clarify, this is the fd_count in each fd_set, i.e. Winsock uses a counted SOCKET array instead of the bitmask that's found on other platforms. Winsock select() ignores nfds [1].

If Python's select module used a dynamic fd_set in Windows, this entails directly initializing fd_count and fd_array. The FD_SET macro can't be used since it's intended for the static definition and limited to FD_SETSIZE. FD_ZERO, FD_ISSET, and FD_CLR should still work.

[1] https://docs.microsoft.com/en-us/windows/desktop/api/winsock2/nf-winsock2-select
msg344758 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-06-05 18:04
> The actual number of descriptors is passed in

Yes, but:

"""The nfds parameter is included only for compatibility with Berkeley sockets."""

... according to https://docs.microsoft.com/en-us/windows/desktop/api/Winsock2/nf-winsock2-select

So it's really FD_SETSIZE that matters, and it's a compile-time constant.

Note that Anaconda has been doing this for a long time:
https://github.com/AnacondaRecipes/python-feedstock/blob/master/recipe/0004-Win32-Change-FD_SETSIZE-from-512-to-2048.patch
msg344760 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-06-05 18:56
> So it's really FD_SETSIZE that matters, and it's a compile-time constant.

The only way this constant can matter is if the array has to end with an empty slot, which doesn't appear to be the case. The pre-compiled function can't tell how you compiled the code calling it, and the only place where FD_SETSIZE is used is in the macros to set them.

Dynamic allocation ought to be just fine, and likely faster for big arrays.
msg344813 - (view) Author: Andrei Zene (andzn) * Date: 2019-06-06 13:09
Awesome work Victor! Thanks! I'll give it a try. One thing I see mising though (if I understand well) is the allocation of three fd_sets (I posted a comment on github regarding that).

Those fd_sets are currently allocated on stack and the array they contain is determined currently by FD_SETSIZE.

We'll basically need a similar structure that we can allocate dynamically, and cast it to fd_set when passing it to the select call.

I can go ahead and try implement the feedback and test it together with the changes in Victor's PR.
msg344815 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-06-06 13:59
> Those fd_sets are currently allocated on stack and the array they 
> contain is determined currently by FD_SETSIZE. We'll basically need 
> a similar structure that we can allocate dynamically, and cast it to
> fd_set when passing it to the select call.

This is possible in Windows. Other platforms are limited to a fixed FD_SETSIZE, in which case the old error should be retained:

        if (index >= (unsigned int)FD_SETSIZE) {
            PyErr_SetString(PyExc_ValueError,
                          "too many file descriptors in select()");
            goto finally;
        }
msg345127 - (view) Author: Andrei Zene (andzn) * Date: 2019-06-10 12:30
I have updated PR 13842(https://github.com/python/cpython/pull/13842) to dynamically alocate the fd_sets on windows.

I did some testing on windows with the following two paterns:
1. 10000 transient clients: (open connection, send message, close connection)
2. 1024 permanent clients: (open connection, then in infinite loop: (read message, send message back))

Everything seemed to be fine.

I didn't do any testing on Linux.

On windows I also had to remove the if that vstinner added to check if the greatest file descriptor is greater than the setsize. That condition always failed for me on Windows. Apparently, according to: https://docs.microsoft.com/en-us/windows/desktop/api/winsock2/nf-winsock2-select, windows ignores nfds and "is included only for compatibility with Berkeley sockets." The documentation also states that the FD_ "macros are compatible with those used in the Berkeley software, but the underlying representation is completely different.", and: "Internally, socket handles in an fd_set structure are not represented as bit flags as in Berkeley Unix. Their data representation is opaque."

This is why I chose to determine setsize based on the inputs. If you think that's not going to work, please say why :)
msg345401 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2019-06-12 18:38
Traditionally on Unix, sockets are represented by file descriptors. File descriptors are small integers. (POSIX actually mandates the "small" part: whenever a new fd is created, the OS has to assign it the smallest integer value that's not already being used.)

So when select was designed, they decided to be "clever" and use a bitmask to represent the FD sets. If all your fds have values less than 64, then you can use a single 64-bit integer to represent any arbitrary subset of them, sweet, it's super efficient. It's also extremely weird. No other API cares about the actual integer value of an fd, they're just opaque tokens. Select is almost unique in being O(highest fd value).

Of course this microoptimization stopped making sense decades ago, so poll() was added. The big innovation with poll() is that it takes an array of descriptors like a normal function, instead of this wacky bitmask thing. So its O(number of fds), and it doesn't matter whether you're checking fd #1 or fd #1000.

EXCEPT windows has a totally different history. On Windows, sockets are represented as handles. And handles are just like fds, EXCEPT that handles are allowed to have arbitrary values; they didn't copy POSIX's weird (and expensive) rule about alwaysv using the smallest possible integer.

So when Windows went to implement select(), the bitmask optimization never made any sense at all – even if you only have 1 socket, its handle might be, like, 0x9f57be3a or something. So you'd need a bitmask with 2**32 entries, which is silly.

So on Windows, select() is secretly poll(). They copied the FD_* macros for compatibility, but fd_set is really just an array of opaque values + an explicit length, and you can pass in as many or as few as you want.

I know this is mostly rehashing conclusions that are in the thread already but I thought it might make more sense to have it laid out all together.
msg345656 - (view) Author: Andrei Zene (andzn) * Date: 2019-06-15 04:17
That's actually a great explanation Nathaniel. Thanks for putting that all together.
History
Date User Action Args
2019-06-15 04:17:42andznsetmessages: + msg345656
2019-06-12 18:38:28njssetnosy: + njs
messages: + msg345401
2019-06-10 12:30:26andznsetmessages: + msg345127
2019-06-06 13:59:34eryksunsetmessages: + msg344815
2019-06-06 13:09:35andznsetmessages: + msg344813
2019-06-05 18:56:16steve.dowersetmessages: + msg344760
2019-06-05 18:04:26pitrousetnosy: + eryksun
2019-06-05 18:04:19pitrousetnosy: - eryksun
messages: + msg344758
2019-06-05 18:03:10eryksunsetnosy: + eryksun
messages: + msg344757
2019-06-05 17:58:31asvetlovsetmessages: + msg344756
2019-06-05 16:21:33steve.dowersetmessages: + msg344744
2019-06-05 16:07:35vstinnersetmessages: + msg344742
2019-06-05 16:07:03vstinnersetpull_requests: + pull_request13722
2019-06-05 13:32:12asvetlovsetnosy: + asvetlov
messages: + msg344720
2019-06-05 12:53:53andznsetmessages: + msg344716
2019-06-05 12:44:38vstinnersetmessages: + msg344715
2019-06-05 12:42:00andznsetnosy: + andzn
messages: + msg344713
2019-06-05 12:41:40andznsetkeywords: + patch
stage: patch review
pull_requests: + pull_request13719
2017-11-06 19:48:06desbmasetnosy: + desbma
messages: + msg305672
2016-11-17 17:25:07pitrousetnosy: + pitrou
messages: + msg281058
2016-11-15 23:42:24steve.dowersetmessages: + msg280902
2016-11-15 22:57:12vstinnersetnosy: + vstinner
messages: + msg280901
2016-11-15 22:55:41zach.waresetnosy: + paul.moore, tim.golden, zach.ware, steve.dower
components: + Build, Windows
2016-11-15 22:52:38David Hirschfeldcreate