New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BaseSelectorEventLoop.sock_{recv,sendall}() don't remove their callbacks when canceled #74250
Comments
Code: import asyncio as a
import socket as s
@a.coroutine
def coro():
s1, s2 = s.socketpair()
s1.setblocking(False)
s2.setblocking(False)
try:
yield from a.wait_for(loop.sock_recv(s2, 1), 1)
except a.TimeoutError:
pass
yield from loop.sock_sendall(s1, b'\x00')
yield
s1.close()
s2.close()
loop = a.get_event_loop()
loop.run_until_complete(coro()) Result: Exception in callback BaseSelectorEventLoop._sock_recv(<Future cancelled>, True, <socket.socke...2049, proto=0>, 1)
handle: <Handle BaseSelectorEventLoop._sock_recv(<Future cancelled>, True, <socket.socke...2049, proto=0>, 1)>
Traceback (most recent call last):
File "/usr/lib/python3.6/asyncio/events.py", line 127, in _run
self._callback(*self._args)
File "/usr/lib/python3.6/asyncio/selector_events.py", line 378, in _sock_recv
self.remove_reader(fd)
File "/usr/lib/python3.6/asyncio/selector_events.py", line 342, in remove_reader
return self._remove_reader(fd)
File "/usr/lib/python3.6/asyncio/selector_events.py", line 279, in _remove_reader
key = self._selector.get_key(fd)
File "/usr/lib/python3.6/selectors.py", line 189, in get_key
return mapping[fileobj]
File "/usr/lib/python3.6/selectors.py", line 70, in __getitem__
fd = self._selector._fileobj_lookup(fileobj)
File "/usr/lib/python3.6/selectors.py", line 224, in _fileobj_lookup
return _fileobj_to_fd(fileobj)
File "/usr/lib/python3.6/selectors.py", line 41, in _fileobj_to_fd
raise ValueError("Invalid file descriptor: {}".format(fd))
ValueError: Invalid file descriptor: -1 |
See also https://bugs.python.org/issue34795 |
I think this broke asyncio a bit. These two uvloop sock_* API tests no longer pass on asyncio 3.8:
Andrew, since this was your patch, please take a look. |
I'll elevate the status so that we don't forget before 3.8.1 is too close |
Andrew, Yury: ping. 3.8.1 cutoff is approaching. |
Not sure if failed uvloop tests are correct.
If I put context switch (await asyncio.sleep(0)) between 3) and 4) *or* replace direct sock_read() call in 4) with creation a task (data = await asyncio.create_task(loop.sock_read(sock, 1))) the cancellation of former read is performed and test passes. I very doubt if any sane code is organizing like this test: start delayed reading, cancel it and read again. The worse, neither previous not current sock_read() implementation doesn't prevent the concurrent reading which basically delivers data in an unpredictable order. Honestly, I'm not sure if we need to provide the concurrency guarantee for such low-level functions. The code built on top of these low-level primitives should handle the cuncurrent access problem IMHO. |
Note: this is going to miss Python 3.8.1 as I'm releasing 3.8.1rc1 right now. Please address this before 3.8.2 (due in February). |
As I wrote, I don't think that we should fix the code but the failed uvloop test is slightly weird instead. Maybe I'm wrong, I need to communicate with Yuri anyway. |
Downgrading priority on this one. |
Hm, cancellation should work correctly no matter how "sane" or "insane" the user code is.
But we're not discussing using a socket concurrently -- asyncio explicitly does not support that for the sock_ api. AFAICT this issue is about consequent cancel operation not working as expected in asyncio, no? |
This is a very interesting question. As I said, the test needs a context switch between Futures are canceled immediately, that's why the previous implementation passed uvloop's test suite untouched. Unfortunately, it had own flaws. |
Following the discussion - looks like |
I just encountered this issue when doing "sys.exit(1)" on a Click-based CLI program that internally uses asyncio event loop via wrapped via a context manager, on Python 3.8.2. Using uvloop or adding "time.sleep(0.1)" before "sys.exit(1)" removes the error. |
And I suspect that this issue is something simliar to what I did in a recent janus PR: |
The new test_sock_client_racing() test seems to have... a race condition... AMD64 FreeBSD Non-Debug 3.9: ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.KqueueEventLoopTests) Example: ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.PollEventLoopTests) Traceback (most recent call last):
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 216, in _basetest_sock_connect_racing
await self.loop.sock_connect(sock, addr)
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 500, in sock_connect
return await fut
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 505, in _sock_connect
sock.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 253, in test_sock_client_racing
self.loop.run_until_complete(asyncio.wait_for(
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/tasks.py", line 496, in wait_for
return fut.result()
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/test/test_asyncio/test_sock_lowlevel.py", line 219, in _basetest_sock_connect_racing
await self.loop.sock_connect(sock, addr)
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 500, in sock_connect
return await fut
File "/usr/home/buildbot/python/3.9.koobs-freebsd-9e36.nondebug/build/Lib/asyncio/selector_events.py", line 505, in _sock_connect
sock.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused |
Ouch ... looks like FreeBSD also needs a few more retries than a single retry. I'll test on a FreeBSD and create a PR for that. |
Not a simple one - FreeBSD is returning ECONNREFUSED immediately when trying to connect to a non-listening port on the loopback address: https://lists.freebsd.org/pipermail/freebsd-current/2005-May/049876.html Then all following connect attempts on the same socket return ECONNREFUSED regardless of whether the port is listening or not. I think I'll have to skip this test on FreeBSD. |
I'm also a bit skeptical about relying on |
I created #20485 to skip the unstable tests until someone fix them. IMO it's better to skip the test than reverting the fix, since the fix does fix a race condition. The test is still failing on buildbots which makes analysis of buildbot failures more difficult. For example, a SQLite change got 3 different 4 different bugs on buildbots: See the "revert on fail" policy for buildbots:
Yeah, please avoid sleep as a synchronization primitive: |
Sorry. You should read "Skip" tests, not "Fix" tests :-p |
Thanks for the comments! Added PR 20494 to properly fix/skip the test for (hopefully) all platforms. |
Another kind of failure on x86 Gentoo Installed with X 3.x: ERROR: test_sock_client_racing (test.test_asyncio.test_sock_lowlevel.EPollEventLoopTests) Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
await self.loop.sock_sendall(sock, b'world')
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/selector_events.py", line 460, in sock_sendall
return await fut
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/tasks.py", line 507, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
self.loop.run_until_complete(asyncio.wait_for(
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.10/asyncio/tasks.py", line 509, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError |
I'm checking now |
I've been testing on local VMs with Gentoo and Ubuntu, but the issue cannot be reproduced. The error log is not really revealing anything useful, other than that the send was blocked when it was supposed to be successful. As it is not constantly failing on buildbot (https://buildbot.python.org/all/#/builders/128), I don't know if it is worth it to add some more debug code to this test in master and wait for the next hang. Please kindly advise. |
I was able to replicate the failure locally by running the large number of jobs in parallel. We typically use this to test for tricky race conditions, to simulate maximum load: [aeros:~/repos/cpython]$ ./python -m test test_asyncio.test_sock_lowlevel --match test_sock_client_racing -j100 -F -v Traceback (most recent call last):
File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
await self.loop.sock_sendall(sock, b'world')
File "/home/aeros/repos/cpython/Lib/asyncio/selector_events.py", line 460, in sock_sendall
return await fut
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 507, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
self.loop.run_until_complete(asyncio.wait_for(
File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 509, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError ====================================================================== Traceback (most recent call last):
File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 200, in _basetest_sock_send_racing
await self.loop.sock_sendall(sock, b'world')
File "/home/aeros/repos/cpython/Lib/asyncio/selector_events.py", line 460, in sock_sendall
return await fut
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 507, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
self.loop.run_until_complete(asyncio.wait_for(
File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "/home/aeros/repos/cpython/Lib/asyncio/tasks.py", line 509, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError ====================================================================== Traceback (most recent call last):
File "/home/aeros/repos/cpython/Lib/test/test_asyncio/test_sock_lowlevel.py", line 256, in test_sock_client_racing
self.loop.run_until_complete(asyncio.wait_for(
File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 629, in run_until_complete
self.run_forever()
File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 596, in run_forever
self._run_once()
File "/home/aeros/repos/cpython/Lib/asyncio/base_events.py", line 1854, in _run_once
event_list = self._selector.select(timeout)
File "/home/aeros/repos/cpython/Lib/selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
OSError: [Errno 9] Bad file descriptor I'll try to add some debug lugs to find the source of the failure(s). This one might be a bit tricky to assess though. @fantix I'd recommend trying to run the local test with the above command, and increasing the parallel job count until it occurs. If it takes more than a minute or two for the tests to run, slowly lower the number. At the moment, the failure has only occurred once in the buildbots, but we might need to temporarily disable it if it occurs repeatedly or find some form of workaround (as it will potentially mask other bugs). |
I believe that I might have identified a simple fix for the issue, but first I'd like to clarify on something: What exactly is the reason that we need to use asyncio.wait_for() with a timeout of 10 seconds in these tests? We typically avoid using short duration timeouts on individual tests, largely because we don't want them to fail based on the performance of the device they're ran on. This can be a concern for slower devices running the test suite w/ parallel jobs. If asyncio.wait_for() is necessary in this case, I'd advise increasing the timeout substantially. I didn't experiment with a wide variety of numbers, but increasing the duration from 10 to 60 seconds seems to have fixed the failure locally, so that might be a decent starting point. If it's not necessary, I'd prefer to simply remove the asyncio.wait_for()s in test_sock_client_racing. |
Oh thank you for the clue, that's super helpful! Let me try locally. This fix PR was originally done to solve a hanging call in a race condition, so the test would hang forever if the fix is not working as expected. The 10 seconds wait_for() is to make sure that we don't end up with a test hanging forever. If enlarging the timeout works in some of your cases, that really inspires me on a potential cause of the issue in the test. To simulate the race condition specifically for loop.sock_sendall(), I needed a connected socket with a full send buffer so that the next sock_sendall() call would block. To achieve this, I wrote something like this: with self.assertRaises(BlockingIOError):
while True:
sock.send(b' ' * 5) And another loop in a subtask to consume the data: while not data:
data = await self.loop.sock_recv(server, 1024)
data = data.strip() It might be that it cost too much time in these loops (especially the 2nd one because the logs showed that it was blocked on calling sock_sendall()). But that's just a guess yet, let me add some more debug logs and test locally. |
No problem at all, and thanks for clarifying.
That seems reasonable to me and would make the most sense at a glance. If this is confirmed to be the source of the issue, we should be able to just reduce the size of the socket's send buffer via SO_SNDBUF before sending to it so that it fills up much more quickly. |
OK I think I solved the problem by improving the two loops using smarter buffer sizes: # fill the buffer until sending 5 chars would block
size = 8192
while size > 5:
with self.assertRaises(BlockingIOError):
while True:
sock.send(b' ' * size)
size = int(size / 2) and # receive everything that is not a space
async def recv_all():
rv = b''
while True:
buf = await self.loop.sock_recv(server, 8192)
if not buf:
return rv
rv += buf.strip() This test is now running ~25x faster, and the load test crashed my laptop a few times before it hits a timeout. Last question before PR pls: given the fact that this test is to cover a fixed case when loop.sock_*() was hanging forever, should I keep the wait_for(..., timeout=10)? |
and yes, updated the SO_SNDBUF too |
For now, I think we can keep the asyncio.wait_for() w/ 10 the sec timeout. I'll have to run the tests locally though w/ the patch to double check (once the PR is opened), and it may still end up needing to be increased if the test fails again in the future. It should be okay though. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: