classification
Title: multiprocessing's overlapped PipeConnection on Windows
Type: behavior Stage: resolved
Components: Extension Modules Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: brian.curtin, jnoller, pitrou, python-dev, sbt, tim.golden
Priority: normal Keywords: patch

Created on 2011-06-13 18:53 by sbt, last changed 2014-03-20 08:28 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
test_pipe_poll.py sbt, 2011-06-13 18:53
pipe_poll.patch sbt, 2011-06-13 19:10 review
pipe_poll_2.patch sbt, 2011-06-14 22:40 review
pipe_interruptible.patch sbt, 2011-06-15 13:17
sigint_event.patch sbt, 2011-06-26 20:27 review
pipe_poll_fix.patch sbt, 2011-11-20 18:41
PipeIO.zip sbt, 2012-01-27 13:58
pipe_poll_fix.patch sbt, 2012-02-01 14:50
pipe_poll_fix.patch sbt, 2012-03-04 17:43 review
pipe_poll_fix.patch sbt, 2012-03-05 14:48 review
Messages (29)
msg138268 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-13 18:53
There are some problems with the new Windows overlapped implementation
of PipeConnection in the default branch.

1) poll(0) can return False when an empty string is in the pipe: if
   the next message in the pipe is b"" then PeekNamedPipe() returns
   (0, 0) which is indistiguishable from the case where the pipe is
   empty.

   This affects versions 2.6-3.2 of Python on Windows as well.  For
   old versions I would just document the fact that poll() will fail
   to see that there is data in the pipe if the next message is an
   empty string.  In practice this is not a big deal since pipes are
   primarily used for pickled objects which will never be the empty
   string.

2) _poll() forgets to check self._buffered, so it can return False
   even if there is buffered data.

3) Even if (2) is fixed, _poll() with a non-zero timeout can mess up
   message boundaries if the pipe contains messages of length zero or
   one.  To fix this overlapped_GetOverlappedResult() needs to be
   changed to not swallow and forget ERROR_MORE_DATA errors.

4) In _poll() there is a race condition: if the read operation
   completes after WaitForMultipleObjects() timesout, but before the
   operation is cancelled, then the data read will be discarded.

5) The code assumes that if CancelIo()/CancelIoEx() returns
   successfully then the OVERLAPPED structure and associated buffer
   may be deallocated immediately.  But the documentation says that if
   CancelIo()/CancelIoEx() succeeds then cancellation has only been
   *requested*:

     http://msdn.microsoft.com/en-us/library/aa363792%28v=vs.85%29.aspx

     If [CancelIoEx()] succeeds, the return value is nonzero. The
     cancel operation for all pending I/O operations issued by the
     calling thread for the specified file handle was successfully
     *requested*. The application must not free or reuse the
     OVERLAPPED structure associated with the canceled I/O operations
     until they have *completed*. The thread can use the
     GetOverlappedResult function to determine when the I/O operations
     themselves have been completed.

   We should wait for cancellation to *complete* by using
   GetOverlappedResult(..., TRUE), otherwise we risk a crash.  If the
   operation was cancelled then FALSE is returned with GetLastError()
   == ERROR_OPERATION_ABORTED.

   I would suggest making it a programming error for the overlapped
   object to be deallocated while the operation is still pending, and
   to print a RuntimeError if that happens.  (This does not prevent us
   from still trying to prevent a crash using CancelIoEx().)

6) Not a bug exactly, but poll(timeout) is no longer interruptible
   with Ctrl-C.  This also means that Queue.get() is no longer
   interruptible.

--

The unit tests attached pass on Unix.  On Windows versions up to 3.2,
only test_empty_string fails.  On Windows version 3.3 all three fail.
msg138270 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-13 19:10
The attached patch hopefully fixes problems (1)-(5), but I have never used overlapped I/O before.  test_pipe_poll.py passes with these changes.
msg138285 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-06-13 22:46
Hello,

Thanks for the patch.

>    I would suggest making it a programming error for the overlapped
>    object to be deallocated while the operation is still pending, and
>    to print a RuntimeError if that happens.  (This does not prevent us
>    from still trying to prevent a crash using CancelIoEx().)

Sounds ok to me.

> 6) Not a bug exactly, but poll(timeout) is no longer interruptible
>    with Ctrl-C.  This also means that Queue.get() is no longer
>    interruptible.

There is infrastructure in _multiprocessing to handle Ctrl-C. Look for
"sigint_event" in Modules/_multiprocessing/*. This could be the topic of
a separate issue and patch.

> The unit tests attached pass on Unix.  On Windows versions up to 3.2,
> only test_empty_string fails.  On Windows version 3.3 all three fail.

Could you make the tests part of Lib/test/test_multiprocessing.py?

Also, what is the rationale for the following change:

-            elif timeout == 0.0:
+            elif timeout == 0.0 and nleft != 0:
                 return False
msg138355 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-14 22:40
> Also, what is the rationale for the following change:
> 
> -            elif timeout == 0.0:
> +            elif timeout == 0.0 and nleft != 0:
>                  return False

If PeekNamedPipe() returns (navail, nleft) there are 3 cases:

1) navail > 0: just return True
2) navail = nleft = 0: either pipe is empty or next message is empty
   so fall through to slow path to find out which.
3) navail = 0, nleft > 0: a non-empty message is coming, but nothing 
   is available yet.

The check is a shortcut for case 3, although it probably never 
occurs.  I've removed that check in the new patch, and added the 
unit tests to test_multiprocessing.

> There is infrastructure in _multiprocessing to handle Ctrl-C. Look 
> for "sigint_event" in Modules/_multiprocessing/*. This could be 
> the topic of a separate issue and patch.

I will look in to it.
msg138363 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-15 13:17
pipe_interruptible.patch is a patch to support to making poll() interruptible.  It applies on top of pipe_poll_2.patch.

I am not sure what the guarantees are for when KeyboardInterrupt will be raised.

I would have done it a bit differently if I knew a good way to test whether the current thread is the main one.  Maybe there should be something like PyThread_is_main() and/or threading.ismain().
msg138403 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-06-15 22:29
> pipe_interruptible.patch is a patch to support to making poll()
> interruptible.  It applies on top of pipe_poll_2.patch.

Hmm, it seems to me that it should be done in _poll() instead. Otherwise, recv() will not be interruptible, will it?

Also, after looking at this again, it seems sigint_event would be better provided (at the C level) by the Python core, e.g. as _PyOS_SigintEvent. The time module already uses a similar mechanism.

By setting the event in the sigint handler itself, rather than in an auxiliary HandlerRoutine callback function, it should also ensure that, when the event is set, the "tripped" bit in signalmodule.c has been set, meaning we wouldn't need to Sleep() anymore.

(I also wonder why the event isn't auto-reset. Otherwise, there's a race condition - which is probably harmless in most cases, but still.)
msg138441 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-16 14:34
> Hmm, it seems to me that it should be done in _poll() instead. 
> Otherwise, recv() will not be interruptible, will it?

Or maybe WaitForMultipleObjects() should be changed to also wait on sigint_event if called by the main thread.

> Also, after looking at this again, it seems sigint_event would be 
> better provided (at the C level) by the Python core, e.g. as 
> _PyOS_SigintEvent. The time module already uses a similar mechanism.

> By setting the event in the sigint handler itself, rather than in an 
> auxiliary HandlerRoutine callback function, it should also ensure 
> that, when the event is set, the "tripped" bit in signalmodule.c 
> has been set, meaning we wouldn't need to Sleep() anymore.

> (I also wonder why the event isn't auto-reset. Otherwise, there's a 
> race condition - which is probably harmless in most cases, but still.)

If it is an auto-reset event then we must worry about a non-main thread stealing the event.  Maybe _PyOS_SigintEvent() should return NULL if called by a non-main thread, and be documented with a loud usage warning.
msg138455 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-06-16 15:46
> > Hmm, it seems to me that it should be done in _poll() instead. 
> > Otherwise, recv() will not be interruptible, will it?
> 
> Or maybe WaitForMultipleObjects() should be changed to also wait on
> sigint_event if called by the main thread.

Indeed, this may be even better.

> If it is an auto-reset event then we must worry about a non-main
> thread stealing the event.  Maybe _PyOS_SigintEvent() should return
> NULL if called by a non-main thread, and be documented with a loud
> usage warning.

You are right, we need a manual reset *or* we must ensure that every
user of _PyOS_SigintEvent only does so from the main thread.
msg138524 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-17 17:00
> You are right, we need a manual reset *or* we must ensure that every
> user of _PyOS_SigintEvent only does so from the main thread.

On second thoughts, even using an auto-reset event, resetting the event before waiting is unavoidable.  Otherwise you are liable to catch an event that was set a long time ago.
msg139213 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-26 20:27
sigint_event.patch is a patch to make
_multiprocessing.win32.WaitForMultipleObjects interruptible.  It
applies directly on to default.

The patch also adds functions _PyOS_SigintEvent and _PyOS_IsMainThread
which are implemented in signalmodule.c and declared in intrcheck.c.

_PyOS_SigintEvent returns a manual reset event (cast to void*) which
is set whenever SIGINT is received.  It is Windows only.

_PyOS_IsMainThread returns 0 or 1 according to whether the current
thread is the main thread.

The time and _multiprocessing modules have been updated to use these
functions.

Note that WaitForMultipleObjects has a bWaitAll parameter.  When this
is true, all handles in the array are waited for, and
WaitForMultipleObjects is not interruptible.
msg139215 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-06-26 20:43
I have noticed a few more problems.

* Because poll() isn't thread safe on Windows, neither is Queue.empty().  Since a queue's pipe will never contain empty messages, this can be fixed easily by using (a wrapper for) win32.PeekNamedPipe().

* PipeListener/PipeClient have not been updated to use overlapped I/O.

* If more than one process is to read from a pipe connection then (even with proper synchronisation) we cannot safely use poll() since it can leave a partial message in the pipe.
msg148004 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-11-20 18:41
Here is an updated patch (pipe_poll_fix.patch) which should be applied on top of sigint_event.patch.

It fixes the problems with PipeConnection.poll() and Queue.empty() and makes PipeListener.accept() use overlapped I/O.  This should make all the pipe releated blocking functions/methods interruptible on Windows.
msg148085 - (view) Author: Roundup Robot (python-dev) Date: 2011-11-21 20:36
New changeset 08953a04b2e6 by Antoine Pitrou in branch 'default':
Issue #12328: Under Windows, refactor handling of Ctrl-C events and
http://hg.python.org/cpython/rev/08953a04b2e6
msg148086 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-21 20:38
I've re-added the fast uncontended path in semaphore.c and committed sigint_event.patch. Now I'm gonna take a look at pipe_poll_fix.patch...
msg148158 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-23 00:03
> Here is an updated patch (pipe_poll_fix.patch) which should be applied
> on top of sigint_event.patch.
> 
> It fixes the problems with PipeConnection.poll() and Queue.empty() and
> makes PipeListener.accept() use overlapped I/O.  This should make all
> the pipe releated blocking functions/methods interruptible on Windows.

I have the feeling that if we have to call GetLastError() at the Python
level, then there's something wrong with the APIs we're exposing from
the C extension. 
I see you check for ERROR_OPERATION_ABORTED. Is there any situation
where this can happen?
Also, it seems strange to call ov.cancel() and then
ov.GetOverlappedResult(). AFAICT, those two operations should be
mutually exclusive: you call the former if e.g. WaitForMultipleObjects
raised an exception, the latter otherwise.
msg148180 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-11-23 11:44
> I have the feeling that if we have to call GetLastError() at the
> Python level, then there's something wrong with the APIs we're 
> exposing from the C extension. 
> I see you check for ERROR_OPERATION_ABORTED. Is there any situation
> where this can happen?

There are three "expected" error conditions:

ERROR_OPERATION_ABORTED: operation stopped by CancelIo(Ex)()
ERROR_MORE_DATA: operation complete, but only got part of the message
ERROR_IO_INCOMPLETE: operation still has not finished

In the win32 api you need GetLastError() to distinguish between these cases, but maybe we can expose something better.

> Also, it seems strange to call ov.cancel() and then
> ov.GetOverlappedResult(). AFAICT, those two operations should be
> mutually exclusive: you call the former if e.g. WaitForMultipleObjects
> raised an exception, the latter otherwise.

If WaitForMultipleObjects() returns WAIT_OBJECT_0 + 0 then, as you say, there is no need to call ov.cancel(), but it is harmless to call it if the operation has already completed. 

The cases aren't really mutually exclusive: if you call ov.cancel() you *must* still do ov.GetOverlappedResult(True) to check for the race where the operation completes after the wait times out, but before ov.cancel() can stop it.  (Your original implementation had that bug -- see point (5) in my original bug report.)
msg148187 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-23 16:01
> There are three "expected" error conditions:
> 
> ERROR_OPERATION_ABORTED: operation stopped by CancelIo(Ex)()
> ERROR_MORE_DATA: operation complete, but only got part of the message
> ERROR_IO_INCOMPLETE: operation still has not finished
> 
> In the win32 api you need GetLastError() to distinguish between these
> cases, but maybe we can expose something better.

It seems to me that ERROR_OPERATION_ABORTED is a "true" error, and so
should raise an exception.

> The cases aren't really mutually exclusive: if you call ov.cancel()
> you *must* still do ov.GetOverlappedResult(True) to check for the race
> where the operation completes after the wait times out, but before
> ov.cancel() can stop it.  (Your original implementation had that bug
> -- see point (5) in my original bug report.)

Ah, right. Thanks for the explanation.
msg148215 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011-11-23 23:38
> It seems to me that ERROR_OPERATION_ABORTED is a "true" error, and so
> should raise an exception.

I guess so, although we do expect it whenever poll() times out.  What exception would be appropriate? BlockingIOError? TimeoutError?
msg148218 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-24 00:27
> > It seems to me that ERROR_OPERATION_ABORTED is a "true" error, and so
> > should raise an exception.
> 
> I guess so, although we do expect it whenever poll() times out.  What
> exception would be appropriate? BlockingIOError? TimeoutError?

I would say either an OSError with the appropriate winerror, or a
specific exception (_multiprocessing.win32.OperationAbortedError?).
Or perhaps a ValueError.
msg152095 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-01-27 13:58
Quite honestly I don't like the way that polling a pipe reads a partial message from the pipe.  If at all possible, polling should not modify the pipe.

I think the cleanest thing would be to switch to byte oriented pipes on Windows and create PipeIO which subclasses RawIOBase.  (Since socket handles are really just overlapped file handles, PipeIO works for them too as long as closesocket() is used instead of CloseHandle().)  On Unix FileIO would be used instead.  Then Connection can just be a thin wrapper around a file object.

Polling on Windows can then be done by creating a wrapper for an overlapped object which represents a zero length read, and can be used with WaitForMultipleObjects().  This lets us implement a select-like wait() function in Python which works with both sockets and Connection objects.

Attached is an extension implementing PipeIO (and the overlapped wrapper), a Python module implementing Connection and wait(), and a test.
msg152433 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-02-01 14:50
I have done an updated patch.  (It does *not* switch to using bytes oriented pipes as I suggested in the previous message.)

The patch also adds a wait() function with signature

    wait(object_list, timeout=None)

for polling multiple objects at once.  On Unix it is just a wrapper for

    select.select(object_list, [], [], timeout)

except that it retries when it gets EINTR.  wait() works with "connected" sockets too, although on Windows it does not work for "listening" sockets.

The patch removes SentinelReady and changes concurrent.futures to use wait() instead.

Polling is now done by issuing zero length overlapped reads.  This means that the pipe is not modified except possibly if a zero length message is removed.

I changed ReadFile(), WriteFile() and GetOverlappedResult() to return pairs, the second entry of which is zero or an "expected" error code.  (Unexpected errors still raise an exception.)  This avoids the need to ever use GetLastError().
msg154849 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-03 20:00
Hmm, I tried to apply the latest patch to the default branch and it failed.
It also seems the patch was done against a changeset (508bc675af63) which doesn't exist in the repo...
msg154894 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-03-04 13:49
> Hmm, I tried to apply the latest patch to the default branch and it 
> failed.  It also seems the patch was done against a changeset 
> (508bc675af63) which doesn't exist in the repo...

I will do an updated patch against a "public" changeset.  (I usually use a patch for the project files to disable those extensions which my windows setup cannot compile.)
msg154903 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-03-04 17:43
Updated patch against 2822765e48a7.
msg154950 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-03-05 14:48
Updated patch addressing Antoine's comments.
msg154968 - (view) Author: Roundup Robot (python-dev) Date: 2012-03-05 18:32
New changeset 75c27daa592e by Antoine Pitrou in branch 'default':
Issue #12328: Fix multiprocessing's use of overlapped I/O on Windows.
http://hg.python.org/cpython/rev/75c27daa592e
msg154970 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-05 18:33
Committed, thanks!
msg155265 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-09 21:23
One of our buildbots seems to have recurring issues with the timeout issues. I don't know if the machine is very loaded:


======================================================================
FAIL: test_wait_integer (test.test_multiprocessing.TestWait)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_multiprocessing.py", line 2617, in test_wait_integer
    self.assertLess(delta, expected + 2)
AssertionError: 8.592355012893677 not less than 7

======================================================================
FAIL: test_wait_timeout (test.test_multiprocessing.TestWait)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_multiprocessing.py", line 2590, in test_wait_timeout
    self.assertLess(delta, expected + 0.5)
AssertionError: 1.532202959060669 not less than 1.5


http://www.python.org/dev/buildbot/all/builders/x86%20XP-4%203.x
msg214187 - (view) Author: Roundup Robot (python-dev) Date: 2014-03-20 08:28
New changeset bdad874195d6 by Victor Stinner in branch '3.4':
Isuse #12328, #20978: Add _winapi.WAIT_ABANDONED_0 symbol, needed by
http://hg.python.org/cpython/rev/bdad874195d6

New changeset 2e4692a762d5 by Victor Stinner in branch 'default':
(Merge 3.4) Issue #12328, #20978: Add _winapi.WAIT_ABANDONED_0 symbol, needed
http://hg.python.org/cpython/rev/2e4692a762d5
History
Date User Action Args
2014-03-20 08:28:43python-devsetmessages: + msg214187
2012-03-09 21:23:56pitrousetmessages: + msg155265
2012-03-05 18:33:14pitrousetstatus: open -> closed
resolution: fixed
messages: + msg154970

stage: patch review -> resolved
2012-03-05 18:32:42python-devsetmessages: + msg154968
2012-03-05 14:48:47sbtsetfiles: + pipe_poll_fix.patch

messages: + msg154950
2012-03-04 17:43:28sbtsetfiles: + pipe_poll_fix.patch

messages: + msg154903
2012-03-04 13:49:36sbtsetmessages: + msg154894
2012-03-03 20:00:06pitrousetmessages: + msg154849
2012-02-01 14:50:31sbtsetfiles: + pipe_poll_fix.patch

messages: + msg152433
2012-01-27 13:58:50sbtsetfiles: + PipeIO.zip

messages: + msg152095
2011-11-24 00:27:28pitrousetmessages: + msg148218
2011-11-23 23:38:23sbtsetmessages: + msg148215
2011-11-23 16:01:25pitrousetmessages: + msg148187
2011-11-23 11:44:22sbtsetmessages: + msg148180
2011-11-23 00:03:25pitrousetmessages: + msg148158
2011-11-21 20:38:27pitrousetstage: patch review
messages: + msg148086
versions: - Python 2.6, Python 3.1, Python 2.7, Python 3.2
2011-11-21 20:36:32python-devsetnosy: + python-dev
messages: + msg148085
2011-11-20 18:41:04sbtsetfiles: + pipe_poll_fix.patch

messages: + msg148004
2011-06-26 20:43:35sbtsetmessages: + msg139215
2011-06-26 20:27:48sbtsetfiles: + sigint_event.patch

messages: + msg139213
2011-06-17 17:00:37sbtsetmessages: + msg138524
2011-06-16 15:46:45pitrousetmessages: + msg138455
2011-06-16 14:34:21sbtsetmessages: + msg138441
2011-06-15 22:29:02pitrousetnosy: + tim.golden, brian.curtin
messages: + msg138403
2011-06-15 13:17:18sbtsetfiles: + pipe_interruptible.patch

messages: + msg138363
2011-06-14 22:40:26sbtsetfiles: + pipe_poll_2.patch

messages: + msg138355
2011-06-13 22:46:09pitrousetmessages: + msg138285
2011-06-13 19:10:36sbtsetfiles: + pipe_poll.patch
keywords: + patch
messages: + msg138270
2011-06-13 18:53:16sbtcreate