Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseSelectorEventLoop.sock_{recv,sendall}() don't remove their callbacks when canceled #74250

Closed
abacabadabacaba mannequin opened this issue Apr 13, 2017 · 41 comments
Closed
Assignees
Labels
3.9 only security fixes 3.10 only security fixes topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@abacabadabacaba
Copy link
Mannequin

abacabadabacaba mannequin commented Apr 13, 2017

BPO 30064
Nosy @fantix, @ned-deily, @asvetlov, @ambv, @1st1, @achimnol, @miss-islington, @aeros
PRs
  • bpo-30064: Refactor sock_* asyncio API #10419
  • bpo-30064: Fix asyncio loop.sock_* race condition issue #20369
  • [3.9] bpo-30064: Fix asyncio loop.sock_* race condition issue (GH-20369) #20460
  • bpo-30064: Fix unstable asyncio "racing" socket tests #20485
  • [3.9] bpo-30064: Fix unstable asyncio "racing" socket tests (GH-20485) #20487
  • bpo-30064: Properly skip unstable loop.sock_connect() racing test #20494
  • [3.9] bpo-30064: Properly skip unstable loop.sock_connect() racing test (GH-20494) #20503
  • bpo-30064: Fix slow asyncio sock test #20868
  • [3.9] bpo-30064: Fix slow asyncio sock test (GH-20868) #20869
  • Files
  • cancelled-sock-recv-race.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/asvetlov'
    closed_at = None
    created_at = <Date 2017-04-13.08:55:09.854>
    labels = ['type-bug', '3.9', '3.10', 'expert-asyncio']
    title = "BaseSelectorEventLoop.sock_{recv,sendall}() don't remove their callbacks when canceled"
    updated_at = <Date 2020-06-15.16:37:16.856>
    user = 'https://bugs.python.org/abacabadabacaba'

    bugs.python.org fields:

    activity = <Date 2020-06-15.16:37:16.856>
    actor = 'vstinner'
    assignee = 'asvetlov'
    closed = False
    closed_date = None
    closer = None
    components = ['asyncio']
    creation = <Date 2017-04-13.08:55:09.854>
    creator = 'abacabadabacaba'
    dependencies = []
    files = ['49146']
    hgrepos = []
    issue_num = 30064
    keywords = ['patch']
    message_count = 41.0
    messages = ['291593', '326294', '329746', '355272', '355273', '357989', '358011', '358098', '358108', '362637', '362671', '362690', '368553', '368701', '368702', '370123', '370131', '370143', '370147', '370159', '370161', '370205', '370215', '370218', '370219', '370234', '370268', '370269', '371240', '371247', '371467', '371478', '371479', '371485', '371489', '371490', '371491', '371492', '371495', '371498', '371499']
    nosy_count = 9.0
    nosy_names = ['fantix', 'ned.deily', 'asvetlov', 'abacabadabacaba', 'lukasz.langa', 'yselivanov', 'achimnol', 'miss-islington', 'aeros']
    pr_nums = ['10419', '20369', '20460', '20485', '20487', '20494', '20503', '20868', '20869']
    priority = 'high'
    resolution = None
    stage = 'commit review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue30064'
    versions = ['Python 3.9', 'Python 3.10']

    @abacabadabacaba
    Copy link
    Mannequin Author

    abacabadabacaba mannequin commented Apr 13, 2017

    Code:

        import asyncio as a
        import socket as s
    
        @a.coroutine
        def coro():
            s1, s2 = s.socketpair()
            s1.setblocking(False)
            s2.setblocking(False)
            try:
                yield from a.wait_for(loop.sock_recv(s2, 1), 1)
            except a.TimeoutError:
                pass
            yield from loop.sock_sendall(s1, b'\x00')
            yield
            s1.close()
            s2.close()
    
        loop = a.get_event_loop()
        loop.run_until_complete(coro())

    Result:

        Exception in callback BaseSelectorEventLoop._sock_recv(<Future cancelled>, True, <socket.socke...2049, proto=0>, 1)
        handle: <Handle BaseSelectorEventLoop._sock_recv(<Future cancelled>, True, <socket.socke...2049, proto=0>, 1)>
        Traceback (most recent call last):
          File "/usr/lib/python3.6/asyncio/events.py", line 127, in _run
            self._callback(*self._args)
          File "/usr/lib/python3.6/asyncio/selector_events.py", line 378, in _sock_recv
            self.remove_reader(fd)
          File "/usr/lib/python3.6/asyncio/selector_events.py", line 342, in remove_reader
            return self._remove_reader(fd)
          File "/usr/lib/python3.6/asyncio/selector_events.py", line 279, in _remove_reader
            key = self._selector.get_key(fd)
          File "/usr/lib/python3.6/selectors.py", line 189, in get_key
            return mapping[fileobj]
          File "/usr/lib/python3.6/selectors.py", line 70, in __getitem__
            fd = self._selector._fileobj_lookup(fileobj)
          File "/usr/lib/python3.6/selectors.py", line 224, in _fileobj_lookup
            return _fileobj_to_fd(fileobj)
          File "/usr/lib/python3.6/selectors.py", line 41, in _fileobj_to_fd
            raise ValueError("Invalid file descriptor: {}".format(fd))
        ValueError: Invalid file descriptor: -1

    @abacabadabacaba abacabadabacaba mannequin added topic-asyncio type-bug An unexpected behavior, bug, or error labels Apr 13, 2017
    @1st1
    Copy link
    Member

    1st1 commented Sep 24, 2018

    See also https://bugs.python.org/issue34795

    @1st1 1st1 added the 3.8 only security fixes label Sep 24, 2018
    @asvetlov asvetlov self-assigned this Oct 9, 2018
    @asvetlov
    Copy link
    Contributor

    New changeset 7438792 by Andrew Svetlov in branch 'master':
    bpo-30064: Refactor sock_* asyncio API (bpo-10419)
    7438792

    @1st1
    Copy link
    Member

    1st1 commented Oct 23, 2019

    I think this broke asyncio a bit. These two uvloop sock_* API tests no longer pass on asyncio 3.8:

    Andrew, since this was your patch, please take a look.

    @1st1 1st1 reopened this Oct 23, 2019
    @1st1
    Copy link
    Member

    1st1 commented Oct 23, 2019

    I'll elevate the status so that we don't forget before 3.8.1 is too close

    @ned-deily
    Copy link
    Member

    I'll elevate the status so that we don't forget before 3.8.1 is too close

    Andrew, Yury: ping. 3.8.1 cutoff is approaching.

    @asvetlov
    Copy link
    Contributor

    asvetlov commented Dec 8, 2019

    Not sure if failed uvloop tests are correct.
    The scenario is like the following:

    1. Suppose we have an unblocking socket connected to peer.
    2. Create a task for reading data:
      task = asyncio.create_task(loop.sock_read(sock, 1))
      Note, the task is not started yet.
    3. Just after it, cancel the task without waiting for the actual cancellation
      task.cancel()
    4. Finally, read from the socket again:
      data = await loop.sock_read(sock, 1)

    If I put context switch (await asyncio.sleep(0)) between 3) and 4) *or* replace direct sock_read() call in 4) with creation a task (data = await asyncio.create_task(loop.sock_read(sock, 1))) the cancellation of former read is performed and test passes.

    I very doubt if any sane code is organizing like this test: start delayed reading, cancel it and read again.

    The worse, neither previous not current sock_read() implementation doesn't prevent the concurrent reading which basically delivers data in an unpredictable order. Honestly, I'm not sure if we need to provide the concurrency guarantee for such low-level functions. The code built on top of these low-level primitives should handle the cuncurrent access problem IMHO.

    @ambv
    Copy link
    Contributor

    ambv commented Dec 9, 2019

    Note: this is going to miss Python 3.8.1 as I'm releasing 3.8.1rc1 right now. Please address this before 3.8.2 (due in February).

    @asvetlov
    Copy link
    Contributor

    asvetlov commented Dec 9, 2019

    As I wrote, I don't think that we should fix the code but the failed uvloop test is slightly weird instead.

    Maybe I'm wrong, I need to communicate with Yuri anyway.

    @ambv
    Copy link
    Contributor

    ambv commented Feb 25, 2020

    Downgrading priority on this one.

    @1st1
    Copy link
    Member

    1st1 commented Feb 25, 2020

    I very doubt if any sane code is organizing like this test: start delayed reading, cancel it and read again.

    Hm, cancellation should work correctly no matter how "sane" or "insane" the user code is.

    The worse, neither previous not current sock_read() implementation doesn't prevent the concurrent reading which basically delivers data in an unpredictable order.

    But we're not discussing using a socket concurrently -- asyncio explicitly does not support that for the sock_ api.

    AFAICT this issue is about consequent cancel operation not working as expected in asyncio, no?

    @asvetlov
    Copy link
    Contributor

    This is a very interesting question.
    In asyncio, task cancellation is not executed on task.cancel() call immediately but the task is marked for cancellation.
    The actual cancellation requires a context switch (await asyncio.sleep(0) or similar) to propagate asyncio.CancelledError into the task code and unwind the exception.

    As I said, the test needs a context switch between cancel() call and analyzing the state after the task cancellation.

    Futures are canceled immediately, that's why the previous implementation passed uvloop's test suite untouched. Unfortunately, it had own flaws.
    Also please note, sock_connect()/sock_accept() were implemented this way about two years before my changes, they also suffer from cancel-without-actual-cancellation problems as well.

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented May 9, 2020

    Following the discussion - looks like add_reader() will cancel the previous handle on the same fd if found, so I think it is reasonable to solve this cancellation-race issue by checking if the handle is cancelled as shown in the attached patch (tests in uvloop passed with this patch). WDTY?

    @achimnol
    Copy link
    Mannequin

    achimnol mannequin commented May 12, 2020

    I just encountered this issue when doing "sys.exit(1)" on a Click-based CLI program that internally uses asyncio event loop via wrapped via a context manager, on Python 3.8.2.

    Using uvloop or adding "time.sleep(0.1)" before "sys.exit(1)" removes the error.

    @achimnol
    Copy link
    Mannequin

    achimnol mannequin commented May 12, 2020

    And I suspect that this issue is something simliar to what I did in a recent janus PR:
    https://github.com/aio-libs/janus/blob/ec8592b91254971473b508313fb91b01623f13d7/janus/__init__.py#L84
    to give a chance for specific callbacks to execute via an extra context switch.

    @1st1
    Copy link
    Member

    1st1 commented May 27, 2020

    New changeset 210a137 by Fantix King in branch 'master':
    bpo-30064: Fix asyncio loop.sock_* race condition issue (bpo-20369)
    210a137

    @miss-islington
    Copy link
    Contributor

    New changeset 3a2667d by Miss Islington (bot) in branch '3.9':
    bpo-30064: Fix asyncio loop.sock_* race condition issue (GH-20369)
    3a2667d

    @vstinner
    Copy link
    Member

    The new test_sock_client_racing() test seems to have... a race condition...

    AMD64 FreeBSD Non-Debug 3.9:
    https://buildbot.python.org/all/#/builders/750/builds/11

    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.KqueueEventLoopTests)
    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.PollEventLoopTests)
    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.SelectEventLoopTests)

    Example:

    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.PollEventLoopTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 216, in _basetest_sock_connect_racing
        await self.loop.sock_connect(sock, addr)
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 500, in sock_connect
        return await fut
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 505, in _sock_connect
        sock.connect(address)
    ConnectionRefusedError: [Errno 61] Connection refused
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 253, in test_sock_client_racing
        self.loop.run_until_complete(asyncio.wait_for(
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/base_events.py", line 642, in run_until_complete
        return future.result()
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/tasks.py", line 496, in wait_for
        return fut.result()
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 219, in _basetest_sock_connect_racing
        await self.loop.sock_connect(sock, addr)
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 500, in sock_connect
        return await fut
      File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 505, in _sock_connect
        sock.connect(address)
    ConnectionRefusedError: [Errno 61] Connection refused

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented May 27, 2020

    Ouch ... looks like FreeBSD also needs a few more retries than a single retry. I'll test on a FreeBSD and create a PR for that.

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented May 28, 2020

    Not a simple one - FreeBSD is returning ECONNREFUSED immediately when trying to connect to a non-listening port on the loopback address:

    https://lists.freebsd.org/pipermail/freebsd-current/2005-May/049876.html

    Then all following connect attempts on the same socket return ECONNREFUSED regardless of whether the port is listening or not. I think I'll have to skip this test on FreeBSD.

    @aeros
    Copy link
    Contributor

    aeros commented May 28, 2020

    The new test_sock_client_racing() test seems to have... a race condition...

    I'm also a bit skeptical about relying on time.sleep(0.01) at the end of the loop in _basetest_sock_recv_racing(), but I don't have an idea for a better approach at the moment.

    @vstinner
    Copy link
    Member

    I created #20485 to skip the unstable tests until someone fix them. IMO it's better to skip the test than reverting the fix, since the fix does fix a race condition.

    The test is still failing on buildbots which makes analysis of buildbot failures more difficult. For example, a SQLite change got 3 different 4 different bugs on buildbots:
    #20448 (comment)

    See the "revert on fail" policy for buildbots:
    https://pythondev.readthedocs.io/ci.html#revert-on-fail

    I'm also a bit skeptical about relying on time.sleep(0.01) at the end of the loop in _basetest_sock_recv_racing()

    Yeah, please avoid sleep as a synchronization primitive:
    https://pythondev.readthedocs.io/unstable_tests.html#don-t-use-sleep-as-synchronization

    @vstinner
    Copy link
    Member

    New changeset 84ee7e1 by Victor Stinner in branch 'master':
    bpo-30064: Fix unstable asyncio "racing" socket tests (GH-20485)
    84ee7e1

    @miss-islington
    Copy link
    Contributor

    New changeset 1d82f00 by Miss Islington (bot) in branch '3.9':
    bpo-30064: Fix unstable asyncio "racing" socket tests (GH-20485)
    1d82f00

    @vstinner
    Copy link
    Member

    bpo-30064: Fix unstable asyncio "racing" socket tests (GH-20485)

    Sorry. You should read "Skip" tests, not "Fix" tests :-p

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented May 28, 2020

    Thanks for the comments! Added PR 20494 to properly fix/skip the test for (hopefully) all platforms.

    @1st1
    Copy link
    Member

    1st1 commented May 28, 2020

    New changeset dc4eee9 by Fantix King in branch 'master':
    bpo-30064: Properly skip unstable loop.sock_connect() racing test (GH-20494)
    dc4eee9

    @miss-islington
    Copy link
    Contributor

    New changeset 6637bd4 by Miss Islington (bot) in branch '3.9':
    bpo-30064: Properly skip unstable loop.sock_connect() racing test (GH-20494)
    6637bd4

    @vstinner
    Copy link
    Member

    Another kind of failure on x86 Gentoo Installed with X 3.x:
    https://buildbot.python.org/all/#/builders/128/builds/1030

    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.EPollEventLoopTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
        await self.loop.sock_sendall(sock, b'world')
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/selector_events.py", line 460, in sock_sendall
        return await fut
    asyncio.exceptions.CancelledError
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/tasks.py", line 507, in wait_for
        fut.result()
    asyncio.exceptions.CancelledError
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
        self.loop.run_until_complete(asyncio.wait_for(
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/base_events.py", line 642, in run_until_complete
        return future.result()
      File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/tasks.py", line 509, in wait_for
        raise exceptions.TimeoutError() from exc
    asyncio.exceptions.TimeoutError

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented Jun 10, 2020

    I'm checking now

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented Jun 13, 2020

    I've been testing on local VMs with Gentoo and Ubuntu, but the issue cannot be reproduced. The error log is not really revealing anything useful, other than that the send was blocked when it was supposed to be successful. As it is not constantly failing on buildbot (https://buildbot.python.org/all/#/builders/128), I don't know if it is worth it to add some more debug code to this test in master and wait for the next hang. Please kindly advise.

    @aeros
    Copy link
    Contributor

    aeros commented Jun 14, 2020

    I was able to replicate the failure locally by running the large number of jobs in parallel. We typically use this to test for tricky race conditions, to simulate maximum load:

    [aeros:~/repos/cpython]$ ./python -m test test_asyncio.test_sock_lowlevel --match test_sock_client_racing -j100 -F -v
    # [snip]
    ======================================================================
    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.EPollEventLoopTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
        await self.loop.sock_sendall(sock, b'world')
      File "/home/aeros/repos/cpython/Lib/asyncio/selector_events.py", line 460, in sock_sendall
        return await fut
    asyncio.exceptions.CancelledError
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 507, in wait_for
        fut.result()
    asyncio.exceptions.CancelledError
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
        self.loop.run_until_complete(asyncio.wait_for(
      File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 642, in run_until_complete
        return future.result()
      File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 509, in wait_for
        raise exceptions.TimeoutError() from exc
    asyncio.exceptions.TimeoutError

    ======================================================================
    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.PollEventLoopTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
        await self.loop.sock_sendall(sock, b'world')
      File "/home/aeros/repos/cpython/Lib/asyncio/selector_events.py", line 460, in sock_sendall
        return await fut
    asyncio.exceptions.CancelledError
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 507, in wait_for
        fut.result()
    asyncio.exceptions.CancelledError
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
        self.loop.run_until_complete(asyncio.wait_for(
      File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 642, in run_until_complete
        return future.result()
      File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 509, in wait_for
        raise exceptions.TimeoutError() from exc
    asyncio.exceptions.TimeoutError

    ======================================================================
    ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.SelectEventLoopTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
        self.loop.run_until_complete(asyncio.wait_for(
      File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 629, in run_until_complete
        self.run_forever()
      File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 596, in run_forever
        self._run_once()
      File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 1854, in _run_once
        event_list = self._selector.select(timeout)
      File "/home/aeros/repos/cpython/Lib/selectors.py", line 323, in select
        r, w, _ = self._select(self._readers, self._writers, [], timeout)
    OSError: [Errno 9] Bad file descriptor

    I'll try to add some debug lugs to find the source of the failure(s). This one might be a bit tricky to assess though.

    @fantix I'd recommend trying to run the local test with the above command, and increasing the parallel job count until it occurs. If it takes more than a minute or two for the tests to run, slowly lower the number. At the moment, the failure has only occurred once in the buildbots, but we might need to temporarily disable it if it occurs repeatedly or find some form of workaround (as it will potentially mask other bugs).

    @aeros
    Copy link
    Contributor

    aeros commented Jun 14, 2020

    I believe that I might have identified a simple fix for the issue, but first I'd like to clarify on something:

    What exactly is the reason that we need to use asyncio.wait_for() with a timeout of 10 seconds in these tests? We typically avoid using short duration timeouts on individual tests, largely because we don't want them to fail based on the performance of the device they're ran on. This can be a concern for slower devices running the test suite w/ parallel jobs.

    If asyncio.wait_for() is necessary in this case, I'd advise increasing the timeout substantially. I didn't experiment with a wide variety of numbers, but increasing the duration from 10 to 60 seconds seems to have fixed the failure locally, so that might be a decent starting point. If it's not necessary, I'd prefer to simply remove the asyncio.wait_for()s in test_sock_client_racing.

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented Jun 14, 2020

    Oh thank you for the clue, that's super helpful! Let me try locally.

    This fix PR was originally done to solve a hanging call in a race condition, so the test would hang forever if the fix is not working as expected. The 10 seconds wait_for() is to make sure that we don't end up with a test hanging forever.

    If enlarging the timeout works in some of your cases, that really inspires me on a potential cause of the issue in the test. To simulate the race condition specifically for loop.sock_sendall(), I needed a connected socket with a full send buffer so that the next sock_sendall() call would block. To achieve this, I wrote something like this:

        with self.assertRaises(BlockingIOError):
            while True:
                sock.send(b' ' * 5)

    And another loop in a subtask to consume the data:

        while not data:
            data = await self.loop.sock_recv(server, 1024)
            data = data.strip()

    It might be that it cost too much time in these loops (especially the 2nd one because the logs showed that it was blocked on calling sock_sendall()). But that's just a guess yet, let me add some more debug logs and test locally.

    @aeros
    Copy link
    Contributor

    aeros commented Jun 14, 2020

    No problem at all, and thanks for clarifying.

    If enlarging the timeout works in some of your cases, that really inspires me on a potential cause of the issue in the test. To simulate the race condition specifically for loop.sock_sendall(), I needed a connected socket with a full send buffer so that the next sock_sendall() call would block. To achieve this, I wrote something like this:

    It might be that it cost too much time in these loops (especially the 2nd one because the logs showed that it was blocked on calling sock_sendall()). But that's just a guess yet, let me add some more debug logs and test locally.

    That seems reasonable to me and would make the most sense at a glance. If this is confirmed to be the source of the issue, we should be able to just reduce the size of the socket's send buffer via SO_SNDBUF before sending to it so that it fills up much more quickly.

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented Jun 14, 2020

    OK I think I solved the problem by improving the two loops using smarter buffer sizes:

                # fill the buffer until sending 5 chars would block
                size = 8192
                while size > 5:
                    with self.assertRaises(BlockingIOError):
                        while True:
                            sock.send(b' ' * size)
                    size = int(size / 2)

    and

                # receive everything that is not a space
                async def recv_all():
                    rv = b''
                    while True:
                        buf = await self.loop.sock_recv(server, 8192)
                        if not buf:
                            return rv
                        rv += buf.strip()

    This test is now running ~25x faster, and the load test crashed my laptop a few times before it hits a timeout.

    Last question before PR pls: given the fact that this test is to cover a fixed case when loop.sock_*() was hanging forever, should I keep the wait_for(..., timeout=10)?

    @fantix
    Copy link
    Mannequin

    fantix mannequin commented Jun 14, 2020

    and yes, updated the SO_SNDBUF too

    @aeros
    Copy link
    Contributor

    aeros commented Jun 14, 2020

    Last question before PR pls: given the fact that this test is to cover a fixed case when loop.sock_*() was hanging forever, should I keep the wait_for(..., timeout=10)?

    For now, I think we can keep the asyncio.wait_for() w/ 10 the sec timeout. I'll have to run the tests locally though w/ the patch to double check (once the PR is opened), and it may still end up needing to be increased if the test fails again in the future. It should be okay though.

    @miss-islington
    Copy link
    Contributor

    New changeset 8f04a84 by Fantix King in branch 'master':
    bpo-30064: Fix slow asyncio sock test (GH-20868)
    8f04a84

    @miss-islington
    Copy link
    Contributor

    New changeset 83cd968 by Miss Islington (bot) in branch '3.9':
    bpo-30064: Fix slow asyncio sock test (GH-20868)
    83cd968

    @aeros
    Copy link
    Contributor

    aeros commented Jun 14, 2020

    Thanks again for working on this @fantix, and for the continued vigilance on the issue after the test failures occurred in the buildbots. I think this recent PR-20868 will do the trick. At the very least, it will make the failure much more rare than before, which is definitely an improvement.

    @aeros aeros added 3.9 only security fixes 3.10 only security fixes and removed 3.8 only security fixes labels Jun 14, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes topic-asyncio type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    No branches or pull requests

    8 participants