classification
Title: BaseSelectorEventLoop.sock_{recv,sendall}() don't remove their callbacks when canceled
Type: behavior Stage: commit review
Components: asyncio Versions: Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: asvetlov Nosy List: abacabadabacaba, achimnol, aeros, asvetlov, fantix, lukasz.langa, miss-islington, ned.deily, yselivanov
Priority: high Keywords: patch

Created on 2017-04-13 08:55 by abacabadabacaba, last changed 2020-06-15 16:37 by vstinner.

Files
File name Uploaded Description Edit
cancelled-sock-recv-race.patch fantix, 2020-05-09 22:52
Pull Requests
URL Status Linked Edit
PR 10419 merged asvetlov, 2018-11-08 22:41
PR 20369 merged fantix, 2020-05-24 22:58
PR 20460 merged miss-islington, 2020-05-27 19:47
PR 20485 merged vstinner, 2020-05-28 13:47
PR 20487 merged miss-islington, 2020-05-28 14:09
PR 20494 merged fantix, 2020-05-28 16:08
PR 20503 merged miss-islington, 2020-05-28 21:56
PR 20868 merged fantix, 2020-06-14 07:05
PR 20869 merged miss-islington, 2020-06-14 07:49
Messages (41)
msg291593 - (view) Author: Evgeny Kapun (abacabadabacaba) Date: 2017-04-13 08:55
Code:

    import asyncio as a
    import socket as s

    @a.coroutine
    def coro():
        s1, s2 = s.socketpair()
        s1.setblocking(False)
        s2.setblocking(False)
        try:
            yield from a.wait_for(loop.sock_recv(s2, 1), 1)
        except a.TimeoutError:
            pass
        yield from loop.sock_sendall(s1, b'\x00')
        yield
        s1.close()
        s2.close()

    loop = a.get_event_loop()
    loop.run_until_complete(coro())

Result:

    Exception in callback BaseSelectorEventLoop._sock_recv(<Future cancelled>, True, <socket.socke...2049, proto=0>, 1)
    handle: <Handle BaseSelectorEventLoop._sock_recv(<Future cancelled>, True, <socket.socke...2049, proto=0>, 1)>
    Traceback (most recent call last):
      File "/usr/lib/python3.6/asyncio/events.py", line 127, in _run
        self._callback(*self._args)
      File "/usr/lib/python3.6/asyncio/selector_events.py", line 378, in _sock_recv
        self.remove_reader(fd)
      File "/usr/lib/python3.6/asyncio/selector_events.py", line 342, in remove_reader
        return self._remove_reader(fd)
      File "/usr/lib/python3.6/asyncio/selector_events.py", line 279, in _remove_reader
        key = self._selector.get_key(fd)
      File "/usr/lib/python3.6/selectors.py", line 189, in get_key
        return mapping[fileobj]
      File "/usr/lib/python3.6/selectors.py", line 70, in __getitem__
        fd = self._selector._fileobj_lookup(fileobj)
      File "/usr/lib/python3.6/selectors.py", line 224, in _fileobj_lookup
        return _fileobj_to_fd(fileobj)
      File "/usr/lib/python3.6/selectors.py", line 41, in _fileobj_to_fd
        raise ValueError("Invalid file descriptor: {}".format(fd))
    ValueError: Invalid file descriptor: -1
msg326294 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2018-09-24 21:57
See also https://bugs.python.org/issue34795
msg329746 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2018-11-12 17:00
New changeset 74387926072abf338a4c1cec1bf0501fc65bbee5 by Andrew Svetlov in branch 'master':
bpo-30064: Refactor sock_* asyncio API (#10419)
https://github.com/python/cpython/commit/74387926072abf338a4c1cec1bf0501fc65bbee5
msg355272 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-10-23 22:42
I think this broke asyncio a bit.  These two uvloop sock_* API tests no longer pass on asyncio 3.8:

* https://github.com/MagicStack/uvloop/blob/master/tests/test_sockets.py#L192

* https://github.com/MagicStack/uvloop/blob/master/tests/test_sockets.py#L238

Andrew, since this was your patch, please take a look.
msg355273 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-10-23 22:58
I'll elevate the status so that we don't forget before 3.8.1 is too close
msg357989 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-12-07 20:48
> I'll elevate the status so that we don't forget before 3.8.1 is too close

Andrew, Yury: ping. 3.8.1 cutoff is approaching.
msg358011 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-12-08 16:37
Not sure if failed uvloop tests are correct.
The scenario is like the following:
1. Suppose we have an unblocking socket connected to peer.
2. Create a task for reading data: 
  task = asyncio.create_task(loop.sock_read(sock, 1))
 Note, the task is not started yet.
3. Just after it, cancel the task without waiting for the actual cancellation
  task.cancel()
4. Finally, read from the socket again:
  data = await loop.sock_read(sock, 1)

If I put context switch (await asyncio.sleep(0)) between 3) and 4) *or* replace direct sock_read() call in 4) with creation a task (data = await asyncio.create_task(loop.sock_read(sock, 1))) the cancellation of former read is performed and test passes.

I very doubt if any sane code is organizing like this test: start delayed reading, cancel it and read again.

The worse, neither previous not current sock_read() implementation doesn't prevent the concurrent reading which basically delivers data in an unpredictable order. Honestly, I'm not sure if we need to provide the concurrency guarantee for such low-level functions.  The code built on top of these low-level primitives should handle the cuncurrent access problem IMHO.
msg358098 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-12-09 14:19
Note: this is going to miss Python 3.8.1 as I'm releasing 3.8.1rc1 right now.  Please address this before 3.8.2 (due in February).
msg358108 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-12-09 14:59
As I wrote, I don't think that we should fix the code but the failed uvloop test is slightly weird instead.

Maybe I'm wrong, I need to communicate with Yuri anyway.
msg362637 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2020-02-25 12:10
Downgrading priority on this one.
msg362671 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-02-25 23:38
> I very doubt if any sane code is organizing like this test: start delayed reading, cancel it and read again.

Hm, cancellation should work correctly no matter how "sane" or "insane" the user code is.

> The worse, neither previous not current sock_read() implementation doesn't prevent the concurrent reading which basically delivers data in an unpredictable order.

But we're not discussing using a socket concurrently -- asyncio explicitly does not support that for the sock_ api. 

AFAICT this issue is about consequent cancel operation not working as expected in asyncio, no?
msg362690 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2020-02-26 12:39
This is a very interesting question.
In asyncio, task cancellation is not executed on `task.cancel()` call immediately but the task is *marked* for cancellation. 
The actual cancellation requires a context switch (await asyncio.sleep(0) or similar) to propagate `asyncio.CancelledError` into the task code and unwind the exception.

As I said, the test needs a context switch between `cancel()` call and analyzing the state after the task cancellation.

Futures are canceled immediately, that's why the previous implementation passed uvloop's test suite untouched. Unfortunately, it had own flaws.
Also please note, sock_connect()/sock_accept() were implemented this way about two years before my changes, they also suffer from cancel-without-actual-cancellation problems as well.
msg368553 - (view) Author: Fantix King (fantix) * Date: 2020-05-09 22:52
Following the discussion - looks like `add_reader()` will cancel the previous `handle` on the same `fd` if found, so I think it is reasonable to solve this cancellation-race issue by checking if the `handle` is cancelled as shown in the attached patch (tests in uvloop passed with this patch). WDTY?
msg368701 - (view) Author: Joongi Kim (achimnol) * Date: 2020-05-12 05:13
I just encountered this issue when doing "sys.exit(1)" on a Click-based CLI program that internally uses asyncio event loop via wrapped via a context manager, on Python 3.8.2.

Using uvloop or adding "time.sleep(0.1)" before "sys.exit(1)" removes the error.
msg368702 - (view) Author: Joongi Kim (achimnol) * Date: 2020-05-12 05:17
And I suspect that this issue is something simliar to what I did in a recent janus PR:
https://github.com/aio-libs/janus/blob/ec8592b91254971473b508313fb91b01623f13d7/janus/__init__.py#L84
to give a chance for specific callbacks to execute via an extra context switch.
msg370123 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-05-27 19:47
New changeset 210a137396979d747c2602eeef46c34fc4955448 by Fantix King in branch 'master':
bpo-30064: Fix asyncio loop.sock_* race condition issue (#20369)
https://github.com/python/cpython/commit/210a137396979d747c2602eeef46c34fc4955448
msg370131 - (view) Author: miss-islington (miss-islington) Date: 2020-05-27 20:39
New changeset 3a2667d91e33493ccde113ddf5092afefc3c89fa by Miss Islington (bot) in branch '3.9':
bpo-30064: Fix asyncio loop.sock_* race condition issue (GH-20369)
https://github.com/python/cpython/commit/3a2667d91e33493ccde113ddf5092afefc3c89fa
msg370143 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-27 22:17
The new test_sock_client_racing() test seems to have... a race condition...

AMD64 FreeBSD Non-Debug 3.9:
https://buildbot.python.org/all/#/builders/750/builds/11

ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.KqueueEventLoopTests)
ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.PollEventLoopTests)
ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.SelectEventLoopTests)

Example:

ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.PollEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 216, in _basetest_sock_connect_racing
    await self.loop.sock_connect(sock, addr)
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 500, in sock_connect
    return await fut
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 505, in _sock_connect
    sock.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 253, in test_sock_client_racing
    self.loop.run_until_complete(asyncio.wait_for(
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/tasks.py", line 496, in wait_for
    return fut.result()
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 219, in _basetest_sock_connect_racing
    await self.loop.sock_connect(sock, addr)
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 500, in sock_connect
    return await fut
  File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 505, in _sock_connect
    sock.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
msg370147 - (view) Author: Fantix King (fantix) * Date: 2020-05-27 22:51
Ouch ... looks like FreeBSD also needs a few more retries than a single retry. I'll test on a FreeBSD and create a PR for that.
msg370159 - (view) Author: Fantix King (fantix) * Date: 2020-05-28 00:50
Not a simple one - FreeBSD is returning ECONNREFUSED immediately when trying to connect to a non-listening port on the loopback address:

https://lists.freebsd.org/pipermail/freebsd-current/2005-May/049876.html

Then all following connect attempts on the same socket return ECONNREFUSED regardless of whether the port is listening or not. I think I'll have to skip this test on FreeBSD.
msg370161 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-05-28 02:21
> The new test_sock_client_racing() test seems to have... a race condition...

I'm also a bit skeptical about relying on `time.sleep(0.01)` at the end of the loop in `_basetest_sock_recv_racing()`, but I don't have an idea for a better approach at the moment.
msg370205 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-28 13:48
I created https://github.com/python/cpython/pull/20485 to skip the unstable tests until someone fix them. IMO it's better to skip the test than reverting the fix, since the fix does fix a race condition.

The test is still failing on buildbots which makes analysis of buildbot failures more difficult. For example, a SQLite change got 3 different 4 different bugs on buildbots:
https://github.com/python/cpython/pull/20448#issuecomment-635350054

See the "revert on fail" policy for buildbots:
https://pythondev.readthedocs.io/ci.html#revert-on-fail


> I'm also a bit skeptical about relying on `time.sleep(0.01)` at the end of the loop in `_basetest_sock_recv_racing()`

Yeah, please avoid sleep as a synchronization primitive:
https://pythondev.readthedocs.io/unstable_tests.html#don-t-use-sleep-as-synchronization
msg370215 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-28 14:08
New changeset 84ee7e1573d166fe7a9be676813e12523b62ab24 by Victor Stinner in branch 'master':
bpo-30064: Fix unstable asyncio "racing" socket tests (GH-20485)
https://github.com/python/cpython/commit/84ee7e1573d166fe7a9be676813e12523b62ab24
msg370218 - (view) Author: miss-islington (miss-islington) Date: 2020-05-28 14:28
New changeset 1d82f003678816ff8dd822452ec91669844d2d09 by Miss Islington (bot) in branch '3.9':
bpo-30064: Fix unstable asyncio "racing" socket tests (GH-20485)
https://github.com/python/cpython/commit/1d82f003678816ff8dd822452ec91669844d2d09
msg370219 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-28 14:29
> bpo-30064: Fix unstable asyncio "racing" socket tests (GH-20485)

Sorry. You should read "Skip" tests, not "Fix" tests :-p
msg370234 - (view) Author: Fantix King (fantix) * Date: 2020-05-28 16:11
Thanks for the comments! Added PR 20494 to properly fix/skip the test for (hopefully) all platforms.
msg370268 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-05-28 21:56
New changeset dc4eee9e266267498a6b783a0abccc23c06f2b87 by Fantix King in branch 'master':
bpo-30064: Properly skip unstable loop.sock_connect() racing test (GH-20494)
https://github.com/python/cpython/commit/dc4eee9e266267498a6b783a0abccc23c06f2b87
msg370269 - (view) Author: miss-islington (miss-islington) Date: 2020-05-28 22:17
New changeset 6637bd45163024f2187e40d2cc12c473f78bf5da by Miss Islington (bot) in branch '3.9':
bpo-30064: Properly skip unstable loop.sock_connect() racing test (GH-20494)
https://github.com/python/cpython/commit/6637bd45163024f2187e40d2cc12c473f78bf5da
msg371240 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-06-10 20:14
Another kind of failure on x86 Gentoo Installed with X 3.x:
https://buildbot.python.org/all/#/builders/128/builds/1030

ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.EPollEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
    await self.loop.sock_sendall(sock, b'world')
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/selector_events.py", line 460, in sock_sendall
    return await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/tasks.py", line 507, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
    self.loop.run_until_complete(asyncio.wait_for(
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/tasks.py", line 509, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
msg371247 - (view) Author: Fantix King (fantix) * Date: 2020-06-10 20:40
I'm checking now
msg371467 - (view) Author: Fantix King (fantix) * Date: 2020-06-13 19:38
I've been testing on local VMs with Gentoo and Ubuntu, but the issue cannot be reproduced. The error log is not really revealing anything useful, other than that the send was blocked when it was supposed to be successful. As it is not constantly failing on buildbot (https://buildbot.python.org/all/#/builders/128), I don't know if it is worth it to add some more debug code to this test in master and wait for the next hang. Please kindly advise.
msg371478 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-06-14 00:17
I was able to replicate the failure locally by running the large number of jobs in parallel. We typically use this to test for tricky race conditions, to simulate maximum load:

[aeros:~/repos/cpython]$ ./python -m test test_asyncio.test_sock_lowlevel --match test_sock_client_racing -j100 -F -v
# [snip]
======================================================================
ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.EPollEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
    await self.loop.sock_sendall(sock, b'world')
  File "/home/aeros/repos/cpython/Lib/asyncio/selector_events.py", line 460, in sock_sendall
    return await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 507, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
    self.loop.run_until_complete(asyncio.wait_for(
  File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 509, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

======================================================================
ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.PollEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
    await self.loop.sock_sendall(sock, b'world')
  File "/home/aeros/repos/cpython/Lib/asyncio/selector_events.py", line 460, in sock_sendall
    return await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 507, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
    self.loop.run_until_complete(asyncio.wait_for(
  File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 509, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

======================================================================
ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.SelectEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
    self.loop.run_until_complete(asyncio.wait_for(
  File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 629, in run_until_complete
    self.run_forever()
  File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 596, in run_forever
    self._run_once()
  File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 1854, in _run_once
    event_list = self._selector.select(timeout)
  File "/home/aeros/repos/cpython/Lib/selectors.py", line 323, in select
    r, w, _ = self._select(self._readers, self._writers, [], timeout)
OSError: [Errno 9] Bad file descriptor

I'll try to add some debug lugs to find the source of the failure(s). This one might be a bit tricky to assess though.

@Fantix I'd recommend trying to run the local test with the above command, and increasing the parallel job count until it occurs. If it takes more than a minute or two for the tests to run, slowly lower the number. At the moment, the failure has only occurred once in the buildbots, but we might need to temporarily disable it if it occurs repeatedly or find some form of workaround (as it will potentially mask other bugs).
msg371479 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-06-14 01:14
I believe that I might have identified a simple fix for the issue, but first I'd like to clarify on something:

What exactly is the reason that we need to use asyncio.wait_for() with a timeout of 10 seconds in these tests? We typically avoid using short duration timeouts on individual tests, largely because we don't want them to fail based on the performance of the device they're ran on. This can be a concern for slower devices running the test suite w/ parallel jobs.

If asyncio.wait_for() is necessary in this case, I'd advise increasing the timeout substantially. I didn't experiment with a wide variety of numbers, but increasing the duration from 10 to 60 seconds seems to have fixed the failure locally, so that might be a decent starting point. If it's not necessary, I'd prefer to simply remove the asyncio.wait_for()s in test_sock_client_racing.
msg371485 - (view) Author: Fantix King (fantix) * Date: 2020-06-14 04:06
Oh thank you for the clue, that's super helpful! Let me try locally.

This fix PR was originally done to solve a hanging call in a race condition, so the test would hang forever if the fix is not working as expected. The 10 seconds wait_for() is to make sure that we don't end up with a test hanging forever.

If enlarging the timeout works in some of your cases, that really inspires me on a potential cause of the issue in the test. To simulate the race condition specifically for loop.sock_sendall(), I needed a connected socket with a full send buffer so that the next sock_sendall() call would block. To achieve this, I wrote something like this:

    with self.assertRaises(BlockingIOError):
        while True:
            sock.send(b' ' * 5)

And another loop in a subtask to consume the data:

    while not data:
        data = await self.loop.sock_recv(server, 1024)
        data = data.strip()

It might be that it cost too much time in these loops (especially the 2nd one because the logs showed that it was blocked on calling sock_sendall()). But that's just a guess yet, let me add some more debug logs and test locally.
msg371489 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-06-14 06:20
No problem at all, and thanks for clarifying.

> If enlarging the timeout works in some of your cases, that really inspires me on a potential cause of the issue in the test. To simulate the race condition specifically for loop.sock_sendall(), I needed a connected socket with a full send buffer so that the next sock_sendall() call would block. To achieve this, I wrote something like this:

> It might be that it cost too much time in these loops (especially the 2nd one because the logs showed that it was blocked on calling sock_sendall()). But that's just a guess yet, let me add some more debug logs and test locally.

That seems reasonable to me and would make the most sense at a glance. If this is confirmed to be the source of the issue, we should be able to just reduce the size of the socket's send buffer via SO_SNDBUF before sending to it so that it fills up much more quickly.
msg371490 - (view) Author: Fantix King (fantix) * Date: 2020-06-14 06:31
OK I think I solved the problem by improving the two loops using smarter buffer sizes:

            # fill the buffer until sending 5 chars would block
            size = 8192
            while size > 5:
                with self.assertRaises(BlockingIOError):
                    while True:
                        sock.send(b' ' * size)
                size = int(size / 2)

and

            # receive everything that is not a space
            async def recv_all():
                rv = b''
                while True:
                    buf = await self.loop.sock_recv(server, 8192)
                    if not buf:
                        return rv
                    rv += buf.strip()

This test is now running ~25x faster, and the load test crashed my laptop a few times before it hits a timeout.

Last question before PR pls: given the fact that this test is to cover a fixed case when loop.sock_*() was hanging forever, should I keep the wait_for(..., timeout=10)?
msg371491 - (view) Author: Fantix King (fantix) * Date: 2020-06-14 06:41
and yes, updated the SO_SNDBUF too
msg371492 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-06-14 06:45
> Last question before PR pls: given the fact that this test is to cover a fixed case when loop.sock_*() was hanging forever, should I keep the wait_for(..., timeout=10)?

For now, I think we can keep the asyncio.wait_for() w/ 10 the sec timeout. I'll have to run the tests locally though w/ the patch to double check (once the PR is opened), and it may still end up needing to be increased if the test fails again in the future. It should be okay though.
msg371495 - (view) Author: miss-islington (miss-islington) Date: 2020-06-14 07:44
New changeset 8f04a84755babe516ebb5304904ea7c15b865c80 by Fantix King in branch 'master':
bpo-30064: Fix slow asyncio sock test (GH-20868)
https://github.com/python/cpython/commit/8f04a84755babe516ebb5304904ea7c15b865c80
msg371498 - (view) Author: miss-islington (miss-islington) Date: 2020-06-14 08:08
New changeset 83cd968f5cc59ad60b665080087f50fefac17a07 by Miss Islington (bot) in branch '3.9':
bpo-30064: Fix slow asyncio sock test (GH-20868)
https://github.com/python/cpython/commit/83cd968f5cc59ad60b665080087f50fefac17a07
msg371499 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-06-14 08:13
Thanks again for working on this @Fantix, and for the continued vigilance on the issue after the test failures occurred in the buildbots. I think this recent PR-20868 will do the trick. At the very least, it will make the failure much more rare than before, which is definitely an improvement.
History
Date User Action Args
2020-06-15 16:37:16vstinnersetnosy: - vstinner
2020-06-14 08:13:38aerossetstage: patch review -> commit review
messages: + msg371499
versions: + Python 3.9, Python 3.10, - Python 3.8
2020-06-14 08:08:29miss-islingtonsetmessages: + msg371498
2020-06-14 07:49:04miss-islingtonsetpull_requests: + pull_request20058
2020-06-14 07:44:10miss-islingtonsetmessages: + msg371495
2020-06-14 07:05:37fantixsetpull_requests: + pull_request20057
2020-06-14 06:45:04aerossetmessages: + msg371492
2020-06-14 06:41:45fantixsetmessages: + msg371491
2020-06-14 06:31:21fantixsetmessages: + msg371490
2020-06-14 06:20:47aerossetmessages: + msg371489
2020-06-14 04:06:58fantixsetmessages: + msg371485
2020-06-14 01:14:40aerossetmessages: + msg371479
2020-06-14 00:17:45aerossetmessages: + msg371478
2020-06-13 19:38:46fantixsetmessages: + msg371467
2020-06-10 20:40:42fantixsetmessages: + msg371247
2020-06-10 20:14:49vstinnersetmessages: + msg371240
2020-05-28 22:17:40miss-islingtonsetmessages: + msg370269
2020-05-28 21:56:57miss-islingtonsetpull_requests: + pull_request19749
2020-05-28 21:56:48yselivanovsetmessages: + msg370268
2020-05-28 16:11:10fantixsetmessages: + msg370234
2020-05-28 16:08:06fantixsetpull_requests: + pull_request19743
2020-05-28 14:29:04vstinnersetmessages: + msg370219
2020-05-28 14:28:23miss-islingtonsetmessages: + msg370218
2020-05-28 14:09:09miss-islingtonsetpull_requests: + pull_request19736
2020-05-28 14:08:57vstinnersetmessages: + msg370215
2020-05-28 13:48:52vstinnersetmessages: + msg370205
2020-05-28 13:47:27vstinnersetpull_requests: + pull_request19734
2020-05-28 02:21:40aerossetmessages: + msg370161
2020-05-28 00:50:44fantixsetmessages: + msg370159
2020-05-27 22:51:03fantixsetmessages: + msg370147
2020-05-27 22:17:25vstinnersetnosy: + vstinner
messages: + msg370143
2020-05-27 20:39:07miss-islingtonsetmessages: + msg370131
2020-05-27 19:47:47miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request19713
2020-05-27 19:47:37yselivanovsetmessages: + msg370123
2020-05-24 22:58:32fantixsetstage: patch review
pull_requests: + pull_request19633
2020-05-12 05:28:07corona10setnosy: + aeros
2020-05-12 05:17:25achimnolsetmessages: + msg368702
2020-05-12 05:13:01achimnolsetnosy: + achimnol
messages: + msg368701
2020-05-09 22:52:06fantixsetfiles: + cancelled-sock-recv-race.patch
nosy: + fantix
messages: + msg368553

2020-02-26 12:39:03asvetlovsetmessages: + msg362690
2020-02-25 23:38:54yselivanovsetmessages: + msg362671
2020-02-25 12:10:38lukasz.langasetpriority: deferred blocker -> high

messages: + msg362637
2019-12-09 14:59:33asvetlovsetmessages: + msg358108
2019-12-09 14:19:33lukasz.langasetnosy: + lukasz.langa
messages: + msg358098
2019-12-08 16:37:21asvetlovsetmessages: + msg358011
2019-12-07 20:48:28ned.deilysetnosy: + ned.deily
messages: + msg357989
2019-10-23 22:58:45yselivanovsetpriority: normal -> deferred blocker

messages: + msg355273
2019-10-23 22:42:58yselivanovsetstatus: closed -> open
resolution: fixed ->
messages: + msg355272

stage: resolved -> (no value)
2018-11-12 17:02:04asvetlovsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-11-12 17:00:25asvetlovsetmessages: + msg329746
2018-11-08 22:41:33asvetlovsetkeywords: + patch
stage: patch review
pull_requests: + pull_request9699
2018-10-09 08:57:06asvetlovsetassignee: asvetlov
2018-09-24 21:57:11yselivanovsetnosy: + asvetlov

messages: + msg326294
versions: + Python 3.8, - Python 3.5, Python 3.6
2017-04-13 08:55:09abacabadabacabacreate