classification
Title: Implementing Solaris "/dev/poll" in the "select" module
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jcea Nosy List: exarkun, giampaolo.rodola, jcea, neologix, pitrou, python-dev, rosslagerwall
Priority: normal Keywords: needs review, patch

Created on 2009-07-01 16:49 by jcea, last changed 2011-11-14 18:18 by jcea. This issue is now closed.

Files
File name Uploaded Description Edit
528fdd816160.diff jcea, 2011-11-07 14:48 review
518b32ce893e.diff jcea, 2011-11-08 20:56 review
2506e49b9f71.diff jcea, 2011-11-09 12:11 review
9d687fdd924d.diff jcea, 2011-11-09 19:39 review
11f08326afd0.diff jcea, 2011-11-14 12:41 review
Repositories containing patches
http://hg.jcea.es/cpython-2011/#devpoll-issue6397
Messages (22)
msg89989 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2009-07-01 16:49
In Python 2.6 we added support for Linux "epoll" and *BSD "kqueue" in
the select module. I think we should add support for Solaris "poll"
interface too.

What do you think?.

I volunteer to do the work, if you agree this is a feature we want to
have. I think so.
msg89991 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2009-07-01 17:23
Solaris 10 introduced "The Event Completion Framework".  I am not
particularly familiar with Solaris, so I couldn't say whether it would
be better to target this or the older /dev/poll.  Some documentation
suggests that "The Event Completion Framework" is somewhat preferred:

http://developers.sun.com/solaris/articles/event_completion.html

It suggests that /dev/poll is not as performant, but I'm not sure I
believe it.  One feature which it does seem to have which puts it ahead
of /dev/poll is the support of various non-fd event sources (eg, timers).
msg146041 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-10-20 18:13
I want to move this forward.

Apparently, "/dev/poll" could be actually used transparently in python "select.poll()" implementation. The semantics seems to be the same, so we could use the "poll" syscall or "/dev/poll" statically at compiling time, or dinamically at poll object creation time (try to open /dev/poll and go to "poll" syscall if that fails).

Some details:

http://developers.sun.com/solaris/articles/using_devpoll.html
http://developers.sun.com/solaris/articles/polling_efficient.html

I agree that Solaris 10 event framework would be nice to support too, but that would be another feature request.
msg146445 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-10-26 17:20
First version of the patch. Review 0ee4386d8f51.diff http://bugs.python.org/file23526/0ee4386d8f51.diff

Details:

1. Current code aliases "devpoll" in platforms with "/dev/poll" (Solaris and derivatives). Considering all the other points, I think that providing a separate "select.devpoll" object would be a better idea. Opinions?.

2. I am not providing the exception contract documented in "select.poll". For instance, "poll.unregister" of a "fd" not previously registered doesn't raise an exception. Idem "poll.modify", etc. I could do it, but is pointless extra code.

3. I have added a boolean "select.isdevpoll" to indicate if "selec.poll" uses "poll()" or "/dev/poll". Ugly.

4. I release the GIL when waiting for the fds, but not when registering/unregistering fds, etc. I guess that the syscall would be very fast, but I haven't actually measure it.

5. The internal REMOVE is because if you "register" several times the same fd, the events are ORed. To avoid that I do an internal REMOVE first. I should scan the pollfds internally and update the events inplace. Or provide a separate "devpoll" and document this fact...

6. If the number of active fds is bigger that SIZE_DEVPOLL, only SIZE_DEVPOLL fds are returned. If you "poll" several times, you get the SAME fds, if they are still active. So, others active fds can suffer of starvation. Solutions: selftunning (if the module provide 20 slots and the OS provide 20 active fds, call again with 40 free slots, etc) or provide a separate "devpoll" and document this fact, posibly providing a "maxsize" parameter. "select.epoll" uses FD_SETSIZE, directly.

With this, I am starting to think that providing a separate "select.devpoll" is actually a better idea.

Opinions?.
msg146457 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-10-26 21:22
I have decided to segregate "select.devpoll" to a separate object, like "select.epoll".
msg146459 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-10-26 21:26
Solved points 1, 3 and 4.

2 will be solved with the documentation.

5 and 6 still pending.
msg146461 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-10-26 23:20
Documentation added. That solves 2 and 5.

I still have to solve 6.
msg147226 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-07 14:56
Please, review.

With current code, each devpoll object has capacity for managing 256 fds, by default. This is about 2048 bytes. The cost seems reasonable, since a normal program will have only a few devpoll objects around. I have considered an optional parameter to tune this, but interaction with rlimit is messy. Even if we manage 65536 fds, the memory cost is about 512Kbytes per devpoll, and you surely can affort it if you are actually managing 65536 descriptors...

The code is not threadsafe. It doesn't crash, but concurrent use of a devpoll has undefined results.

Please, review for integration.
msg147316 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-08 20:58
Please, check the new changeset, with all your feedback. Thanks!.
msg147350 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-09 12:12
Another changeset. Hopefully, final :-).

Please, review.
msg147354 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2011-11-09 13:39
+   increases this value, c:func:`devpoll` will return a possible
+   incomplete list of active file descriptors.

I think this should change to:

+   increases this value, c:func:`devpoll` will return a possibly
+   incomplete list of active file descriptors.

or even better:

+   increases this value, c:func:`devpoll` may return an
+   incomplete list of active file descriptors.

Cheers
msg147355 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-09 14:44
Thanks, Ross. Your suggestion has been committed in my branch.

Waiting for more feedback.
msg147358 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2011-11-09 16:52
Is write()ing a devpoll fd a blocking operation in the kernel?
Does it need to have Py_BEGIN_ALLOW_THREADS around it?
The same question applies for open()ing it.

Obviously, the ioctl() call *is* blocking :-)
msg147359 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2011-11-09 17:38
Also, you can use Py_RETURN_NONE instead of:
+    Py_INCREF(Py_None);
+    return Py_None;
msg147365 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-09 19:14
They are operations potentially slow, there are not realtime specifications.

My machine is quite capable (16 CPUs), but let's see what a bit of DTRACE scripting tells us:

First, let's time the open:

"""
syscall::open*:entry
/copyinstr(arg0)=="/dev/poll"/
{
    self->ts = timestamp;
}

syscall::open*:return
/self->ts/
{
    @stats = quantize(timestamp-self->ts);
    self->ts = 0;
}
"""

The result, times in nanoseconds:

"""
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4497743  
            8192 |                                         17358    
           16384 |                                         1592     
           32768 |                                         2046     
           65536 |                                         812      
          131072 |                                         374      
          262144 |                                         0 
"""

Most "open"s finish in 4 microseconds, but we have a handful of "slow" openings of hundred of microseconds.

Anyway, argueing with the GIL here is a nonissue, since the "open" is only done at object creation, and a sane program will only create a handful of devpoll objets.

Let's see now the write:

"""
syscall::open*:entry
/copyinstr(arg0)=="/dev/poll"/
{
    self->ts = timestamp;
}

syscall::open*:return
/self->ts/
{
    @stats = quantize(timestamp-self->ts);
    self->ts = 0;
}

"""

The results for a single descriptor registered:

"""
           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4102701  
            1024 |                                         28514    
            2048 |                                         11269    
            4096 |                                         537      
            8192 |                                         245      
           16384 |                                         377      
           32768 |                                         193      
           65536 |                                         134      
          131072 |                                         71       
          262144 |                                         0  
"""

Most writes are really fast, half a microsecond, but we have sporadic latencies of a hundred of microseconds.

Re-registering 200 sockets per loop, I have:

"""
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |                                         50       
            2048 |                                         94       
            4096 |                                         157      
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1112314  
           16384 |@                                        22977    
           32768 |                                         1584     
           65536 |                                         842      
          131072 |                                         168      
          262144 |                                         0 
"""

Most writes takes around 8 microseconds.

So now the problem is to estimate how much time I need to release & readquire the GIL. For that, and while we don't have DTRACE integration in python (my next project, I hope), I change sourcecode to add a manual probe using the Solaris High Resolution timers, I get a median time of around 550 nanoseconds (half a microsecond), with a few spikes, when the GIL is not contested. 

So, unlocking the GIL is adding around half microsecond to the syscalls. This is fast, but comparable with the syscall actual duration. Liberating the GIL is doubling the time to register&unregister a socket (supposely we are doing a single socket per iteration, something realistic in real code).

Being realistic, any work we do with the descriptors (like actually reading or writing it) is going to make this 0.5 microsecond gain pointless. Freeing the GIL, we are protected of any kernel contention, and playing for the rules :-).

Anyway I am open to feedback.

PS: I also checked with GIL contention, and results are comparable. That is surprising, maybe related to the GIL improvements in 3.3. I haven't investigated the issue further.
msg147366 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-09 19:45
New changeset, after Ross feedback. Thanks!.
msg147367 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2011-11-09 19:46
That was thorough :-) Seems OK though.

+    if (n < size) {
+        PyErr_SetString(PyExc_IOError, "failed to write all pollfds. "
+                "Please, report in http://bugs.python.org/");

If n < size, it's not a Python error is it? I would say it's the OS's fault.

Otherwise, it looks good... although I don't have currently access to a Solaris box to test it.
msg147368 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-09 19:51
The timing for the GIL I am providing is for release&acquiring. That is, all the work. In fact I am having in count too the syscall inside the release&acquire, to account for cache issues.

That is, between the release and the acquire, there is a real syscall. Of course, I am not timing it.

So, the timing I am providing is accurate.
msg147369 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-09 20:01
> That was thorough :-) Seems OK though.
> 
> +    if (n < size) {
> +        PyErr_SetString(PyExc_IOError, "failed to write all pollfds. "
> +                "Please, report in http://bugs.python.org/");
> 
> If n < size, it's not a Python error is it? I would say it's the OS's fault.

No, but it's a Python error if Python wrongly assumes that write() won't
return a partial result. Ideally write() should be retried in a loop
until everything is written out.
msg147377 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-09 21:50
The problem with partial writes is that the data is not an unstructured stream of bytes, but a concrete binary dump. You can not simply retry again.

My bet is that "/dev/poll" neves does partial writes.

If I am mistaken, I will bug the Illumos people to help me to solve it. So far, this seems a theorical problema.

Just in case it is not, we raise an exception and urge the programmer to report us the issue.

The other option would be to try to be clever and do things like retrying if the partial write is a multiple of the struct size, but what if it not so?.

Instead of guessing and play games, I rather prefer to know if this is actually a problem in the wild. In my tests, with 65530 sockets, I never saw this "impossible" exception.
msg147589 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2011-11-14 12:46
New changeset. The only change is:

"""
diff --git a/Modules/selectmodule.c b/Modules/selectmodule.c
--- a/Modules/selectmodule.c
+++ b/Modules/selectmodule.c
@@ -685,8 +685,16 @@
         return -1;
     }
     if (n < size) {
-        PyErr_SetString(PyExc_IOError, "failed to write all pollfds. "
-                "Please, report in http://bugs.python.org/");
+        /*
+        ** Data writed to /dev/poll is a binary data structure. It is not
+        ** clear what to do if a partial write occurred. For now, raise
+        ** an exception and see if we actually found this problem in
+        ** the wild.
+        */
+        PyErr_Format(PyExc_IOError, "failed to write all pollfds. "
+                "Please, report at http://bugs.python.org/. "
+                "Data to report: Size tried: %d, actual size written: %d",
+                size, n);
         return -1;
     }
     return 0;
"""

If there are no disagreements, I will commit to default (3.3) soon (in a few hours/a day).

Thanks!.
msg147626 - (view) Author: Roundup Robot (python-dev) Date: 2011-11-14 18:18
New changeset 8f7ab4bf7ad9 by Jesus Cea in branch 'default':
Issue #6397: Support '/dev/poll' polling objects in select module, under Solaris & derivatives.
http://hg.python.org/cpython/rev/8f7ab4bf7ad9
History
Date User Action Args
2011-11-14 18:18:54jceasetstatus: open -> closed
resolution: fixed
2011-11-14 18:18:04python-devsetnosy: + python-dev
messages: + msg147626
2011-11-14 12:46:41jceasetmessages: + msg147589
2011-11-14 12:41:32jceasetfiles: + 11f08326afd0.diff
2011-11-09 21:50:34jceasetmessages: + msg147377
2011-11-09 20:01:17pitrousetmessages: + msg147369
2011-11-09 19:51:26jceasetmessages: + msg147368
2011-11-09 19:46:59rosslagerwallsetmessages: + msg147367
2011-11-09 19:45:15jceasetmessages: + msg147366
2011-11-09 19:39:33jceasetfiles: + 9d687fdd924d.diff
2011-11-09 19:14:20jceasetmessages: + msg147365
2011-11-09 17:40:39pitrousetnosy: + pitrou
2011-11-09 17:38:30rosslagerwallsetmessages: + msg147359
2011-11-09 16:52:11rosslagerwallsetmessages: + msg147358
2011-11-09 14:44:12jceasetmessages: + msg147355
2011-11-09 13:39:11rosslagerwallsetmessages: + msg147354
2011-11-09 12:12:02jceasetmessages: + msg147350
2011-11-09 12:11:25jceasetfiles: + 2506e49b9f71.diff
2011-11-08 20:58:01jceasetmessages: + msg147316
2011-11-08 20:57:30jceasetfiles: - c47a846bed1a.diff
2011-11-08 20:56:15jceasetfiles: + 518b32ce893e.diff
2011-11-08 17:58:46jceasetfiles: + c47a846bed1a.diff
2011-11-08 17:18:54giampaolo.rodolasetnosy: + giampaolo.rodola
2011-11-07 19:09:25rosslagerwallsetnosy: + rosslagerwall
2011-11-07 15:05:24pitrousetnosy: + neologix
2011-11-07 14:56:48jceasetmessages: + msg147226
2011-11-07 14:49:03jceasetfiles: + 528fdd816160.diff
2011-11-07 14:48:07jceasetfiles: - 0b701eb5e9e3.diff
2011-10-26 23:24:47jceasettitle: Implementing Solaris "poll" in the "select" module -> Implementing Solaris "/dev/poll" in the "select" module
2011-10-26 23:20:34jceasetmessages: + msg146461
2011-10-26 23:18:50jceasetfiles: + 0b701eb5e9e3.diff
2011-10-26 23:18:27jceasetfiles: - 6becc4e3eece.diff
2011-10-26 21:28:04jceasetfiles: + 6becc4e3eece.diff
2011-10-26 21:27:30jceasetfiles: - 0ee4386d8f51.diff
2011-10-26 21:26:37jceasetmessages: + msg146459
2011-10-26 21:22:06jceasetmessages: + msg146457
2011-10-26 17:21:07jceasetkeywords: + needs review
stage: needs patch -> patch review
2011-10-26 17:20:30jceasetmessages: + msg146445
2011-10-26 17:05:24jceasetfiles: + 0ee4386d8f51.diff
2011-10-26 17:01:56jceasetfiles: - 6ea157b9d110.diff
2011-10-26 16:43:58jceasetfiles: + 6ea157b9d110.diff
2011-10-26 15:35:23jceasetfiles: - d014fd90a487.diff
2011-10-26 15:33:16jceasetfiles: + d014fd90a487.diff
keywords: + patch
2011-10-26 15:32:11jceasethgrepos: + hgrepo86
2011-10-20 18:13:26jceasetversions: + Python 3.3, - Python 2.7, Python 3.2
2011-10-20 18:13:17jceasetmessages: + msg146041
2009-07-01 17:23:59exarkunsetnosy: + exarkun
messages: + msg89991
2009-07-01 16:49:13jceacreate