Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing's overlapped PipeConnection on Windows #56537

Closed
sbt mannequin opened this issue Jun 13, 2011 · 29 comments
Closed

multiprocessing's overlapped PipeConnection on Windows #56537

sbt mannequin opened this issue Jun 13, 2011 · 29 comments
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@sbt
Copy link
Mannequin

sbt mannequin commented Jun 13, 2011

BPO 12328
Nosy @pitrou, @tjguk, @briancurtin
Files
  • test_pipe_poll.py
  • pipe_poll.patch
  • pipe_poll_2.patch
  • pipe_interruptible.patch
  • sigint_event.patch
  • pipe_poll_fix.patch
  • PipeIO.zip
  • pipe_poll_fix.patch
  • pipe_poll_fix.patch
  • pipe_poll_fix.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-03-05.18:33:14.011>
    created_at = <Date 2011-06-13.18:53:16.048>
    labels = ['extension-modules', 'type-bug']
    title = "multiprocessing's overlapped PipeConnection on Windows"
    updated_at = <Date 2014-03-20.08:28:43.153>
    user = 'https://bugs.python.org/sbt'

    bugs.python.org fields:

    activity = <Date 2014-03-20.08:28:43.153>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-03-05.18:33:14.011>
    closer = 'pitrou'
    components = ['Extension Modules']
    creation = <Date 2011-06-13.18:53:16.048>
    creator = 'sbt'
    dependencies = []
    files = ['22345', '22350', '22366', '22367', '22488', '23736', '24340', '24388', '24730', '24737']
    hgrepos = []
    issue_num = 12328
    keywords = ['patch']
    message_count = 29.0
    messages = ['138268', '138270', '138285', '138355', '138363', '138403', '138441', '138455', '138524', '139213', '139215', '148004', '148085', '148086', '148158', '148180', '148187', '148215', '148218', '152095', '152433', '154849', '154894', '154903', '154950', '154968', '154970', '155265', '214187']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'tim.golden', 'jnoller', 'brian.curtin', 'python-dev', 'sbt']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12328'
    versions = ['Python 3.3']

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 13, 2011

    There are some problems with the new Windows overlapped implementation
    of PipeConnection in the default branch.

    1. poll(0) can return False when an empty string is in the pipe: if
      the next message in the pipe is b"" then PeekNamedPipe() returns
      (0, 0) which is indistiguishable from the case where the pipe is
      empty.

      This affects versions 2.6-3.2 of Python on Windows as well. For
      old versions I would just document the fact that poll() will fail
      to see that there is data in the pipe if the next message is an
      empty string. In practice this is not a big deal since pipes are
      primarily used for pickled objects which will never be the empty
      string.

    2. _poll() forgets to check self._buffered, so it can return False
      even if there is buffered data.

    3. Even if (2) is fixed, _poll() with a non-zero timeout can mess up
      message boundaries if the pipe contains messages of length zero or
      one. To fix this overlapped_GetOverlappedResult() needs to be
      changed to not swallow and forget ERROR_MORE_DATA errors.

    4. In _poll() there is a race condition: if the read operation
      completes after WaitForMultipleObjects() timesout, but before the
      operation is cancelled, then the data read will be discarded.

    5. The code assumes that if CancelIo()/CancelIoEx() returns
      successfully then the OVERLAPPED structure and associated buffer
      may be deallocated immediately. But the documentation says that if
      CancelIo()/CancelIoEx() succeeds then cancellation has only been
      *requested*:

      http://msdn.microsoft.com/en-us/library/aa363792%28v=vs.85%29.aspx

      If [CancelIoEx()] succeeds, the return value is nonzero. The
      cancel operation for all pending I/O operations issued by the
      calling thread for the specified file handle was successfully
      requested. The application must not free or reuse the
      OVERLAPPED structure associated with the canceled I/O operations
      until they have completed. The thread can use the
      GetOverlappedResult function to determine when the I/O operations
      themselves have been completed.

      We should wait for cancellation to *complete* by using
      GetOverlappedResult(..., TRUE), otherwise we risk a crash. If the
      operation was cancelled then FALSE is returned with GetLastError()
      == ERROR_OPERATION_ABORTED.

      I would suggest making it a programming error for the overlapped
      object to be deallocated while the operation is still pending, and
      to print a RuntimeError if that happens. (This does not prevent us
      from still trying to prevent a crash using CancelIoEx().)

    6. Not a bug exactly, but poll(timeout) is no longer interruptible
      with Ctrl-C. This also means that Queue.get() is no longer
      interruptible.

    --

    The unit tests attached pass on Unix. On Windows versions up to 3.2,
    only test_empty_string fails. On Windows version 3.3 all three fail.

    @sbt sbt mannequin added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Jun 13, 2011
    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 13, 2011

    The attached patch hopefully fixes problems (1)-(5), but I have never used overlapped I/O before. test_pipe_poll.py passes with these changes.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 13, 2011

    Hello,

    Thanks for the patch.

    I would suggest making it a programming error for the overlapped
    object to be deallocated while the operation is still pending, and
    to print a RuntimeError if that happens. (This does not prevent us
    from still trying to prevent a crash using CancelIoEx().)

    Sounds ok to me.

    1. Not a bug exactly, but poll(timeout) is no longer interruptible
      with Ctrl-C. This also means that Queue.get() is no longer
      interruptible.

    There is infrastructure in _multiprocessing to handle Ctrl-C. Look for
    "sigint_event" in Modules/_multiprocessing/*. This could be the topic of
    a separate issue and patch.

    The unit tests attached pass on Unix. On Windows versions up to 3.2,
    only test_empty_string fails. On Windows version 3.3 all three fail.

    Could you make the tests part of Lib/test/test_multiprocessing.py?

    Also, what is the rationale for the following change:

    •        elif timeout == 0.0:
      

    + elif timeout == 0.0 and nleft != 0:
    return False

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 14, 2011

    Also, what is the rationale for the following change:

    •        elif timeout == 0.0:
      
    •        elif timeout == 0.0 and nleft != 0:
                 return False
      

    If PeekNamedPipe() returns (navail, nleft) there are 3 cases:

    1. navail > 0: just return True
    2. navail = nleft = 0: either pipe is empty or next message is empty
      so fall through to slow path to find out which.
    3. navail = 0, nleft > 0: a non-empty message is coming, but nothing
      is available yet.

    The check is a shortcut for case 3, although it probably never
    occurs. I've removed that check in the new patch, and added the
    unit tests to test_multiprocessing.

    There is infrastructure in _multiprocessing to handle Ctrl-C. Look
    for "sigint_event" in Modules/_multiprocessing/*. This could be
    the topic of a separate issue and patch.

    I will look in to it.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 15, 2011

    pipe_interruptible.patch is a patch to support to making poll() interruptible. It applies on top of pipe_poll_2.patch.

    I am not sure what the guarantees are for when KeyboardInterrupt will be raised.

    I would have done it a bit differently if I knew a good way to test whether the current thread is the main one. Maybe there should be something like PyThread_is_main() and/or threading.ismain().

    @pitrou
    Copy link
    Member

    pitrou commented Jun 15, 2011

    pipe_interruptible.patch is a patch to support to making poll()
    interruptible. It applies on top of pipe_poll_2.patch.

    Hmm, it seems to me that it should be done in _poll() instead. Otherwise, recv() will not be interruptible, will it?

    Also, after looking at this again, it seems sigint_event would be better provided (at the C level) by the Python core, e.g. as _PyOS_SigintEvent. The time module already uses a similar mechanism.

    By setting the event in the sigint handler itself, rather than in an auxiliary HandlerRoutine callback function, it should also ensure that, when the event is set, the "tripped" bit in signalmodule.c has been set, meaning we wouldn't need to Sleep() anymore.

    (I also wonder why the event isn't auto-reset. Otherwise, there's a race condition - which is probably harmless in most cases, but still.)

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 16, 2011

    Hmm, it seems to me that it should be done in _poll() instead.
    Otherwise, recv() will not be interruptible, will it?

    Or maybe WaitForMultipleObjects() should be changed to also wait on sigint_event if called by the main thread.

    Also, after looking at this again, it seems sigint_event would be
    better provided (at the C level) by the Python core, e.g. as
    _PyOS_SigintEvent. The time module already uses a similar mechanism.

    By setting the event in the sigint handler itself, rather than in an
    auxiliary HandlerRoutine callback function, it should also ensure
    that, when the event is set, the "tripped" bit in signalmodule.c
    has been set, meaning we wouldn't need to Sleep() anymore.

    (I also wonder why the event isn't auto-reset. Otherwise, there's a
    race condition - which is probably harmless in most cases, but still.)

    If it is an auto-reset event then we must worry about a non-main thread stealing the event. Maybe _PyOS_SigintEvent() should return NULL if called by a non-main thread, and be documented with a loud usage warning.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 16, 2011

    > Hmm, it seems to me that it should be done in _poll() instead.
    > Otherwise, recv() will not be interruptible, will it?

    Or maybe WaitForMultipleObjects() should be changed to also wait on
    sigint_event if called by the main thread.

    Indeed, this may be even better.

    If it is an auto-reset event then we must worry about a non-main
    thread stealing the event. Maybe _PyOS_SigintEvent() should return
    NULL if called by a non-main thread, and be documented with a loud
    usage warning.

    You are right, we need a manual reset *or* we must ensure that every
    user of _PyOS_SigintEvent only does so from the main thread.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 17, 2011

    You are right, we need a manual reset *or* we must ensure that every
    user of _PyOS_SigintEvent only does so from the main thread.

    On second thoughts, even using an auto-reset event, resetting the event before waiting is unavoidable. Otherwise you are liable to catch an event that was set a long time ago.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 26, 2011

    sigint_event.patch is a patch to make
    _multiprocessing.win32.WaitForMultipleObjects interruptible. It
    applies directly on to default.

    The patch also adds functions _PyOS_SigintEvent and _PyOS_IsMainThread
    which are implemented in signalmodule.c and declared in intrcheck.c.

    _PyOS_SigintEvent returns a manual reset event (cast to void*) which
    is set whenever SIGINT is received. It is Windows only.

    _PyOS_IsMainThread returns 0 or 1 according to whether the current
    thread is the main thread.

    The time and _multiprocessing modules have been updated to use these
    functions.

    Note that WaitForMultipleObjects has a bWaitAll parameter. When this
    is true, all handles in the array are waited for, and
    WaitForMultipleObjects is not interruptible.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jun 26, 2011

    I have noticed a few more problems.

    • Because poll() isn't thread safe on Windows, neither is Queue.empty(). Since a queue's pipe will never contain empty messages, this can be fixed easily by using (a wrapper for) win32.PeekNamedPipe().

    • PipeListener/PipeClient have not been updated to use overlapped I/O.

    • If more than one process is to read from a pipe connection then (even with proper synchronisation) we cannot safely use poll() since it can leave a partial message in the pipe.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Nov 20, 2011

    Here is an updated patch (pipe_poll_fix.patch) which should be applied on top of sigint_event.patch.

    It fixes the problems with PipeConnection.poll() and Queue.empty() and makes PipeListener.accept() use overlapped I/O. This should make all the pipe releated blocking functions/methods interruptible on Windows.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 21, 2011

    New changeset 08953a04b2e6 by Antoine Pitrou in branch 'default':
    Issue bpo-12328: Under Windows, refactor handling of Ctrl-C events and
    http://hg.python.org/cpython/rev/08953a04b2e6

    @pitrou
    Copy link
    Member

    pitrou commented Nov 21, 2011

    I've re-added the fast uncontended path in semaphore.c and committed sigint_event.patch. Now I'm gonna take a look at pipe_poll_fix.patch...

    @pitrou
    Copy link
    Member

    pitrou commented Nov 23, 2011

    Here is an updated patch (pipe_poll_fix.patch) which should be applied
    on top of sigint_event.patch.

    It fixes the problems with PipeConnection.poll() and Queue.empty() and
    makes PipeListener.accept() use overlapped I/O. This should make all
    the pipe releated blocking functions/methods interruptible on Windows.

    I have the feeling that if we have to call GetLastError() at the Python
    level, then there's something wrong with the APIs we're exposing from
    the C extension.
    I see you check for ERROR_OPERATION_ABORTED. Is there any situation
    where this can happen?
    Also, it seems strange to call ov.cancel() and then
    ov.GetOverlappedResult(). AFAICT, those two operations should be
    mutually exclusive: you call the former if e.g. WaitForMultipleObjects
    raised an exception, the latter otherwise.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Nov 23, 2011

    I have the feeling that if we have to call GetLastError() at the
    Python level, then there's something wrong with the APIs we're
    exposing from the C extension.
    I see you check for ERROR_OPERATION_ABORTED. Is there any situation
    where this can happen?

    There are three "expected" error conditions:

    ERROR_OPERATION_ABORTED: operation stopped by CancelIo(Ex)()
    ERROR_MORE_DATA: operation complete, but only got part of the message
    ERROR_IO_INCOMPLETE: operation still has not finished

    In the win32 api you need GetLastError() to distinguish between these cases, but maybe we can expose something better.

    Also, it seems strange to call ov.cancel() and then
    ov.GetOverlappedResult(). AFAICT, those two operations should be
    mutually exclusive: you call the former if e.g. WaitForMultipleObjects
    raised an exception, the latter otherwise.

    If WaitForMultipleObjects() returns WAIT_OBJECT_0 + 0 then, as you say, there is no need to call ov.cancel(), but it is harmless to call it if the operation has already completed.

    The cases aren't really mutually exclusive: if you call ov.cancel() you *must* still do ov.GetOverlappedResult(True) to check for the race where the operation completes after the wait times out, but before ov.cancel() can stop it. (Your original implementation had that bug -- see point (5) in my original bug report.)

    @pitrou
    Copy link
    Member

    pitrou commented Nov 23, 2011

    There are three "expected" error conditions:

    ERROR_OPERATION_ABORTED: operation stopped by CancelIo(Ex)()
    ERROR_MORE_DATA: operation complete, but only got part of the message
    ERROR_IO_INCOMPLETE: operation still has not finished

    In the win32 api you need GetLastError() to distinguish between these
    cases, but maybe we can expose something better.

    It seems to me that ERROR_OPERATION_ABORTED is a "true" error, and so
    should raise an exception.

    The cases aren't really mutually exclusive: if you call ov.cancel()
    you *must* still do ov.GetOverlappedResult(True) to check for the race
    where the operation completes after the wait times out, but before
    ov.cancel() can stop it. (Your original implementation had that bug
    -- see point (5) in my original bug report.)

    Ah, right. Thanks for the explanation.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Nov 23, 2011

    It seems to me that ERROR_OPERATION_ABORTED is a "true" error, and so
    should raise an exception.

    I guess so, although we do expect it whenever poll() times out. What exception would be appropriate? BlockingIOError? TimeoutError?

    @pitrou
    Copy link
    Member

    pitrou commented Nov 24, 2011

    > It seems to me that ERROR_OPERATION_ABORTED is a "true" error, and so
    > should raise an exception.

    I guess so, although we do expect it whenever poll() times out. What
    exception would be appropriate? BlockingIOError? TimeoutError?

    I would say either an OSError with the appropriate winerror, or a
    specific exception (_multiprocessing.win32.OperationAbortedError?).
    Or perhaps a ValueError.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Jan 27, 2012

    Quite honestly I don't like the way that polling a pipe reads a partial message from the pipe. If at all possible, polling should not modify the pipe.

    I think the cleanest thing would be to switch to byte oriented pipes on Windows and create PipeIO which subclasses RawIOBase. (Since socket handles are really just overlapped file handles, PipeIO works for them too as long as closesocket() is used instead of CloseHandle().) On Unix FileIO would be used instead. Then Connection can just be a thin wrapper around a file object.

    Polling on Windows can then be done by creating a wrapper for an overlapped object which represents a zero length read, and can be used with WaitForMultipleObjects(). This lets us implement a select-like wait() function in Python which works with both sockets and Connection objects.

    Attached is an extension implementing PipeIO (and the overlapped wrapper), a Python module implementing Connection and wait(), and a test.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Feb 1, 2012

    I have done an updated patch. (It does *not* switch to using bytes oriented pipes as I suggested in the previous message.)

    The patch also adds a wait() function with signature

        wait(object_list, timeout=None)

    for polling multiple objects at once. On Unix it is just a wrapper for

        select.select(object_list, [], [], timeout)

    except that it retries when it gets EINTR. wait() works with "connected" sockets too, although on Windows it does not work for "listening" sockets.

    The patch removes SentinelReady and changes concurrent.futures to use wait() instead.

    Polling is now done by issuing zero length overlapped reads. This means that the pipe is not modified except possibly if a zero length message is removed.

    I changed ReadFile(), WriteFile() and GetOverlappedResult() to return pairs, the second entry of which is zero or an "expected" error code. (Unexpected errors still raise an exception.) This avoids the need to ever use GetLastError().

    @pitrou
    Copy link
    Member

    pitrou commented Mar 3, 2012

    Hmm, I tried to apply the latest patch to the default branch and it failed.
    It also seems the patch was done against a changeset (508bc675af63) which doesn't exist in the repo...

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Mar 4, 2012

    Hmm, I tried to apply the latest patch to the default branch and it
    failed. It also seems the patch was done against a changeset
    (508bc675af63) which doesn't exist in the repo...

    I will do an updated patch against a "public" changeset. (I usually use a patch for the project files to disable those extensions which my windows setup cannot compile.)

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Mar 4, 2012

    Updated patch against 2822765e48a7.

    @sbt
    Copy link
    Mannequin Author

    sbt mannequin commented Mar 5, 2012

    Updated patch addressing Antoine's comments.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 5, 2012

    New changeset 75c27daa592e by Antoine Pitrou in branch 'default':
    Issue bpo-12328: Fix multiprocessing's use of overlapped I/O on Windows.
    http://hg.python.org/cpython/rev/75c27daa592e

    @pitrou
    Copy link
    Member

    pitrou commented Mar 5, 2012

    Committed, thanks!

    @pitrou pitrou closed this as completed Mar 5, 2012
    @pitrou
    Copy link
    Member

    pitrou commented Mar 9, 2012

    One of our buildbots seems to have recurring issues with the timeout issues. I don't know if the machine is very loaded:

    ======================================================================
    FAIL: test_wait_integer (test.test_multiprocessing.TestWait)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_multiprocessing.py", line 2617, in test_wait_integer
        self.assertLess(delta, expected + 2)
    AssertionError: 8.592355012893677 not less than 7

    ======================================================================
    FAIL: test_wait_timeout (test.test_multiprocessing.TestWait)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_multiprocessing.py", line 2590, in test_wait_timeout
        self.assertLess(delta, expected + 0.5)
    AssertionError: 1.532202959060669 not less than 1.5

    http://www.python.org/dev/buildbot/all/builders/x86%20XP-4%203.x

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 20, 2014

    New changeset bdad874195d6 by Victor Stinner in branch '3.4':
    Isuse bpo-12328, bpo-20978: Add _winapi.WAIT_ABANDONED_0 symbol, needed by
    http://hg.python.org/cpython/rev/bdad874195d6

    New changeset 2e4692a762d5 by Victor Stinner in branch 'default':
    (Merge 3.4) Issue bpo-12328, bpo-20978: Add _winapi.WAIT_ABANDONED_0 symbol, needed
    http://hg.python.org/cpython/rev/2e4692a762d5

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant