Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x #77875

Closed
vstinner opened this issue May 30, 2018 · 32 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir topic-asyncio

Comments

@vstinner
Copy link
Member

BPO 33694
Nosy @vstinner, @ned-deily, @miss-islington
PRs
  • WIP: bpo-33694, asyncio: fix proactor transport pause/resume #7479
  • bpo-33694: asyncio: Fix race in Proactor's Transport.set_protocol() #7486
  • bpo-33694: Fix race condition on proactor recv() #7498
  • [3.7] bpo-33694: Fix race condition in asyncio proactor (GH-7498) #7499
  • bpo-33694: Fix typo in asyncio helper function name #7522
  • [3.7] bpo-33694: Fix typo in helper function name (GH-7522) #7523
  • bpo-33694: Remove test_asyncio ProactorDatagramTransportTests #16288
  • [3.8] bpo-33694: Remove test_asyncio ProactorDatagramTransportTests (GH-16288) #16305
  • Files
  • Screen Shot 2018-05-30 at 11.38.38 AM.png
  • race.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-06-09.16:44:39.217>
    created_at = <Date 2018-05-30.11:19:38.205>
    labels = ['3.8', '3.7', 'tests', 'expert-asyncio']
    title = 'test_asyncio: test_start_tls_server_1() fails on Python on x86 Windows7 3.7 and 3.x'
    updated_at = <Date 2019-09-20.21:27:31.745>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-09-20.21:27:31.745>
    actor = 'miss-islington'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-06-09.16:44:39.217>
    closer = 'vstinner'
    components = ['Tests', 'asyncio']
    creation = <Date 2018-05-30.11:19:38.205>
    creator = 'vstinner'
    dependencies = []
    files = ['47626', '47632']
    hgrepos = []
    issue_num = 33694
    keywords = ['patch']
    message_count = 32.0
    messages = ['318164', '318165', '318193', '318194', '318195', '318209', '318424', '318569', '318705', '318707', '318809', '318869', '318872', '318873', '318874', '318875', '318876', '318915', '318921', '318977', '318979', '318993', '319034', '319035', '319155', '319175', '319209', '319220', '319299', '319450', '352891', '352894']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'ned.deily', 'miss-islington']
    pr_nums = ['7479', '7486', '7498', '7499', '7522', '7523', '16288', '16305']
    priority = None
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue33694'
    versions = ['Python 3.7', 'Python 3.8']

    @vstinner
    Copy link
    Member Author

    test_asyncio.test_start_tls_server_1() got many fixes recently: see bpo-32458 and bpo-33674... but it still fails on Python on x86 Windows7 3.x at revision bb9474f which contains all these fixes.

    The test fails even when test_asyncio is re-run alone (not when other tests run in parallel).

    http://buildbot.python.org/all/#/builders/58/builds/930

    ======================================================================
    ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
        self.loop.run_until_complete(run_main())
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\asyncio\base_events.py", line 566, in run_until_complete
        raise RuntimeError('Event loop stopped before Future completed.')
    RuntimeError: Event loop stopped before Future completed.

    ======================================================================
    FAIL: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
        self.fail('unexpected calls to loop.call_exception_handler()')
    AssertionError: unexpected calls to loop.call_exception_handler()

    The test fails also on x86 Windows7 3.7:

    ======================================================================
    ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
        self.loop.run_until_complete(run_main())
      File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\asyncio\base_events.py", line 566, in run_until_complete
        raise RuntimeError('Event loop stopped before Future completed.')
    RuntimeError: Event loop stopped before Future completed.

    ======================================================================
    FAIL: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
        self.fail('unexpected calls to loop.call_exception_handler()')
    AssertionError: unexpected calls to loop.call_exception_handler()

    Moreover, 3.7 got an additional failure:

    ======================================================================
    ERROR: test_pipe_handle (test.test_asyncio.test_windows_utils.PipeTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_asyncio\test_windows_utils.py", line 73, in test_pipe_handle
        raise RuntimeError('expected ERROR_INVALID_HANDLE')
    RuntimeError: expected ERROR_INVALID_HANDLE

    The Windows7 buildbot is known to be slow. For example, test_value() of test_multiprocessing_spawn failed with a timeout (15 min) on x86 Windows7 3.x:

    0:48:12 [276/416/1] test_multiprocessing_spawn crashed (Exit code 1)
    Timeout (0:15:00)!
    Thread 0x000008e4 (most recent call first):
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\multiprocessing\popen_spawn_win32.py", line 80 in wait
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\multiprocessing\process.py", line 140 in join
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\_test_multiprocessing.py", line 1925 in test_value

    Maybe something is wrong with my karma, I don't know. Or test_asyncio just hate me.

    @vstinner vstinner added 3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir topic-asyncio labels May 30, 2018
    @vstinner
    Copy link
    Member Author

    The test fails also on x86 Windows7 3.7:
    http://buildbot.python.org/all/#/builders/111/builds/297

    @1st1
    Copy link
    Member

    1st1 commented May 30, 2018

    I can't reproduce test_start_tls_server_1 fails when I do (screenshot attached)

    1. run test_asyncio
    2. run test_asyncio.test_sslproto
    3. run test_asyncio.test_sslproto -m test_start_tls_server_1

    I run them in Windows 7 VM on Mac OS.

    All other tests pass fine for me except test_sendfile_close_peer_in_the_middle_of_receiving. Andrew, please look at this, this is your test, and I'm not sure I follow why we use SNDBUF at all. IMO it should be possible to write a test without using that (or if not, maybe time to use mocks).

    With regards to start_tls tests -- I give up, I don't have time to look into this before 3.7.0. I suggest to mask the test on the specific win buildbot it crashes on and fix it before 3.7.1 (or maybe when Windows users come up with a reliable way to reproduce the bug).

    @asvetlov
    Copy link
    Contributor

    I used SNDBUF to enforce send buffer overloading.
    It is not required by sendfile tests but I thought that better to have non-mocked way to test such situations.
    We can remove the socket buffers size manipulation at all without any problem.

    @1st1
    Copy link
    Member

    1st1 commented May 30, 2018

    We can remove the socket buffers size manipulation at all without any problem.

    When I tried to do that I think I was having more failures with that test. But really up to you.

    @vstinner
    Copy link
    Member Author

    I used SNDBUF to enforce send buffer overloading.

    What is that? Would you mind to elaborate? I'm curious :-)

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 1, 2018

    The test also failed on AMD64 Windows8.1 Refleaks 3.7. This buildbot is also very slow because it runs reference leak hunting. It confirms that the remaining issue is a race condition which is only seen when the system becomes slow.

    http://buildbot.python.org/all/#/builders/132/builds/154

    ======================================================================
    ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
        self.loop.run_until_complete(run_main())
      File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\asyncio\base_events.py", line 566, in run_until_complete
        raise RuntimeError('Event loop stopped before Future completed.')
    RuntimeError: Event loop stopped before Future completed.

    ======================================================================
    FAIL: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\buildarea\3.7.ware-win81-release.refleak\build\lib\test\test_asyncio\functional.py", line 42, in tearDown
        self.fail('unexpected calls to loop.call_exception_handler()')
    AssertionError: unexpected calls to loop.call_exception_handler()

    @pablogsal
    Copy link
    Member

    The test fails also on x86 Windows7 3.x:
    http://buildbot.python.org/all/#/builders/58/builds/947

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 4, 2018

    A variant of the bug?

    AMD64 Windows8.1 Non-Debug 3.x:
    http://buildbot.python.org/all/#/builders/12/builds/948

    ======================================================================
    ERROR: test_create_connection_ssl_failed_certificate (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\buildarea\3.x.ware-win81-release\build\lib\test\test_asyncio\test_sslproto.py", line 623, in test_create_connection_ssl_failed_certificate
        self.loop.run_until_complete(client(srv.addr))
      File "D:\buildarea\3.x.ware-win81-release\build\lib\asyncio\base_events.py", line 566, in run_until_complete
        raise RuntimeError('Event loop stopped before Future completed.')
    RuntimeError: Event loop stopped before Future completed.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 4, 2018

    AMD64 Windows8.1 Non-Debug 3.x
    ERROR: test_create_connection_ssl_failed_certificate (test.test_asyncio.test_sslproto.ProactorStartTLSTests)

    Not surprising: same error on AMD64 Windows8.1 Non-Debug 3.7:
    http://buildbot.python.org/all/#builders/133/builds/304

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 6, 2018

    Another variant seen on AppVeyor (PR 7439, master branch):

    https://ci.appveyor.com/project/python/cpython/build/3.8build16980

    ======================================================================
    ERROR: test_create_connection_ssl_failed_certificate (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\projects\cpython\lib\test\test_asyncio\test_sslproto.py", line 623, in test_create_connection_ssl_failed_certificate
        self.loop.run_until_complete(client(srv.addr))
      File "C:\projects\cpython\lib\asyncio\base_events.py", line 566, in run_until_complete
        raise RuntimeError('Event loop stopped before Future completed.')
    RuntimeError: Event loop stopped before Future completed.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    The bug can be reproduced on Linux using this patch:

    diff --git a/Lib/test/test_asyncio/test_sslproto.py b/Lib/test/test_asyncio/test_sslproto.py
    index 78ab1eb822..735313152c 100644
    --- a/Lib/test/test_asyncio/test_sslproto.py
    +++ b/Lib/test/test_asyncio/test_sslproto.py
    @@ -471,6 +471,8 @@ class BaseStartTLS(func_tests.FunctionalTestCaseMixin):
     
                 self.assertEqual(proto.data, b'')
     
    +            await asyncio.sleep(1)
    +
                 new_tr = await self.loop.start_tls(
                     tr, proto, server_context,
                     server_side=True,

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    Call stack when the asyncio client connects to the server when the bug occurs:

    SSLProtocol.connection_made()
    -> SSLProtocol._start_handshake()
    -> SSLProtocol._process_write_backlog()
    -> SSLPipe.do_handshake()
    -> SSLPipe.feed_ssldata(b'', only_handshake=True)
    -> SSLObject.do_handshake()
    -> C SSL_do_handshake() returns immediately because the socket is non-blocking

    It seems like SSLObject.do_handshake() is only attempted once and... then nothing. The client is supposed to send data, retry the handshake, or something, but it does *nothing*. So the handshake never completes and the test hangs to later fail with a timeout.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    The issue occurs when the server calls start_tls() because the client calls start_tls(). In that case, ServerProto.data_received() is called with the server TLS handshake and so the client SSLProtocol never this this handshake from server...

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    Sorry, the patch to reproduce the issue on Linux is wrong: it introduces a bug in the test. Ignore this comment.

    --

    It seems like I found the root cause: pause_reading() / resume_reading() on the transport is not safe. Sometimes, we loose data. See the "TODO" below...

    class _ProactorReadPipeTransport(_ProactorBasePipeTransport,
                                     transports.ReadTransport):
        """Transport for read pipes."""
        (...)
        def pause_reading(self):
            if self._closing or self._paused:
                return
            self._paused = True
    
            if self._read_fut is not None and not self._read_fut.done():
                # TODO: This is an ugly hack to cancel the current read future
                # *and* avoid potential race conditions, as read cancellation
                # goes through `future.cancel()` and `loop.call_soon()`.
                # We then use this special attribute in the reader callback to
                # exit *immediately* without doing any cleanup/rescheduling.
                self._read_fut.__asyncio_cancelled_on_pause__ = True
    
                self._read_fut.cancel()
                self._read_fut = None
                self._reschedule_on_resume = True
    
            if self._loop.get_debug():
                logger.debug("%r pauses reading", self)

    If you remove the "ugly hack", the test no longer hangs...

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    _ProactorReadPipeTransport.set_transport():

            if self.is_reading():
                # reset reading callback / buffers / self._read_fut
                self.pause_reading()
                self.resume_reading()

    This method cancelled the pending overlapped WSARecv(), and then create a new overlapped WSARecv().

    Even after CancelIoEx(old overlapped), the IOCP loop still gets an event for the completion of the recv. Problem: since the Python future is cancelled, the even is ignored and so 176 bytes are lost.

    I'm surprised that an overlapped WSARecv() cancelled by CancelIoEx() still returns data when IOCP polls for events.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    Something else. The bug occurs when CancelIoEx() (on the current overlapped WSARecv()) fails internally with ERROR_NOT_FOUND. According to overlapped.c, it means:

    /* CancelIoEx returns ERROR_NOT_FOUND if the I/O completed in-between */

    HasOverlappedIoCompleted() returns 0 in that case.

    The problem is that currently, Overlapped.cancel() also returns None in that case, and later the asyncio IOCP loop ignores the completion event and so drops incoming received data.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    Yury, Andrew, Ned: I set the priority to release blocker because I'm scared by what I saw. The START TLS has a race condition in its ProactorEventLoop implementation. But the bug doesn't see to be specific to START TLS, but rather to transport.set_transport(), and even more generally to transport.pause_reading() / transport.resume_reading(). The bug is quite severe: we loose data and it's really hard to know why (I spent a few hours to add many many print and try to reproduce on a very tiny reliable unit test). As an asyncio user, I expect that transports are 100% reliable, and I would first look into my look (like looking into start_tls() implementation in my case).

    If the bug was very specific to start_tls(), I would suggest to "just" "disable" start_tls() on ProactorEventLoop (sorry, Windows!). But since the data loss seems to concern basically any application using ProactorEventLoop, I don't see any simple workaround.

    My hope is that a fix can be written shortly to not block the 3.7.0 final release for too long :-(

    Yury, Andrew: Can you please just confirm that it's a regression and that a release blocker is justified?

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    race.py: simple echo client and server sending packets in both directions. Pause/resume reading the client transport every 100 ms to trigger the bug.

    Using ProactorEventLoop and 2000 packets of 16 KiB, I easily reproduce the bug.

    So again, it's nothing related to start_tls(), start_tls() was just one way to spot the bug.

    The bug is in Proactor transport: the cancellation of overlapped WSARecv() sometime drops packets. The bug occurs when CancelIoEx() fails with ERROR_NOT_FOUND which means that the I/O (WSARecv()) completed.

    One solution would be to not cancel WSARecv() on pause_reading(): wait until the current WSARecv() completes, store data something but don't pass it to protocol.data_received()!, and no schedule a new WSARecv(). Once reading is resumed: call protocol.data_received() and schedule a new WSARecv().

    That would be a workaround. I don't know how to really fix WSARecv() cancellation without loosing data. A good start would be to modify Overlapped.cancel() to return a boolean to notice if the overlapped I/O completed even if we just cancelled it. Currently, the corner case (CancelIoEx() fails with ERROR_NOT_FOUND) is silently ignored, and then the IOCP loop silently ignores the event of completed I/O...

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    New changeset 79790bc by Victor Stinner in branch 'master':
    bpo-33694: Fix race condition in asyncio proactor (GH-7498)
    79790bc

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2018

    New changeset 8fa398d by Victor Stinner (Miss Islington (bot)) in branch '3.7':
    bpo-33694: Fix race condition in asyncio proactor (GH-7498) (GH-7499)
    8fa398d

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 8, 2018

    I fixed the root issue, a race condition in ProactorEventLoop. But I prefer to keep the issue open a few days to see if the bug is really gone from all CIs.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 8, 2018

    New changeset ff6c077 by Victor Stinner in branch 'master':
    bpo-33694: Fix typo in helper function name (GH-7522)
    ff6c077

    @miss-islington
    Copy link
    Contributor

    New changeset 17beebc by Miss Islington (bot) in branch '3.7':
    bpo-33694: Fix typo in helper function name (GH-7522)
    17beebc

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 9, 2018

    I ran attached race.py on the 3.6 branch:

    • 2000 packets of 16 KiB: ok
    • 50 packets of 16 MiB: ok

    It seems like Python 3.6 (at least the 3.6 development branch) doesn't have this bug.

    It didn't see the bug recently, so I close the issue.

    Python 3.6 has a different issue: pause_reading() cannot be called twice, same issue with resume_reading(). But I'm not sure that it can be called a bug. Python 3.7 allows to call these methods twice and to call these methods while closing: they just do nothing in that case.

    I close the issue.

    @1st1
    Copy link
    Member

    1st1 commented Jun 9, 2018

    pause_reading and resume_reading became idempotent only in 3.7

    @vstinner
    Copy link
    Member Author

    pause_reading and resume_reading became idempotent only in 3.7

    Do you consider that it's a new feature or a bugfix? Should we backport this change to 3.6? Currently, race.py fails with an error because of the 3.6 behaviour. (I modified asyncio to be able to run race.py on 3.6.)

    @1st1
    Copy link
    Member

    1st1 commented Jun 10, 2018

    Since it a minor change we can reconsider it as s bug fix. Feel free to make a pr.

    @vstinner
    Copy link
    Member Author

    "Since it a minor change we can reconsider it as s bug fix. Feel free
    to make a pr."

    Ok, I created #7629 for bpo-32356

    @vstinner
    Copy link
    Member Author

    New changeset 142e3c0 by Victor Stinner in branch '3.6':
    [3.6] bpo-32356: idempotent pause_/resume_reading (GH-4914) (GH-7629)
    142e3c0

    @vstinner
    Copy link
    Member Author

    New changeset bc2256e by Victor Stinner in branch 'master':
    bpo-33694: Remove test_asyncio ProactorDatagramTransportTests (GH-16288)
    bc2256e

    @miss-islington
    Copy link
    Contributor

    New changeset b50edac by Miss Islington (bot) in branch '3.8':
    bpo-33694: Remove test_asyncio ProactorDatagramTransportTests (GH-16288)
    b50edac

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir topic-asyncio
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants