classification
Title: In some cases asyncio.wait_for can lead to socket leak.
Type: resource usage Stage: patch review
Components: asyncio Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Elvis.Pranskevichus, Nikita Ilyasov, aaliddell, asvetlov, chris.jerdonek, lukasz.langa, miss-islington, nmatravolgyi, ods, yselivanov
Priority: normal Keywords: patch

Created on 2019-07-23 12:16 by Nikita Ilyasov, last changed 2021-05-24 13:02 by ods.

Files
File name Uploaded Description Edit
example.py Nikita Ilyasov, 2019-07-23 12:16 script that can reproduce the problem
ssl_circular_refs.png ods, 2021-05-24 13:02
Pull Requests
URL Status Linked Edit
PR 21894 merged Elvis.Pranskevichus, 2020-08-15 22:23
PR 21964 merged miss-islington, 2020-08-26 16:43
PR 21965 merged miss-islington, 2020-08-26 16:43
PR 21969 closed Elvis.Pranskevichus, 2020-08-26 19:06
PR 26097 open ods, 2021-05-13 10:41
Messages (12)
msg348328 - (view) Author: Nikita Ilyasov (Nikita Ilyasov) Date: 2019-07-23 12:16
In some cases `asyncio.wait_for` can lead to socket leak.

Condensed example:

```python
async def _create_connection(timeout=60, ssl_obj):
    loop = asyncio.get_event_loop()
    connector = loop.create_connection(MyEchoClientProtocol, '127.0.0.1', 5000, ssl=ssl_obj)
    connector = asyncio.ensure_future(connector)
    tr, pr = await asyncio.wait_for(connector, timeout=timeout, loop=loop)
    return tr, pr

async def main():
    ...
    res = await asyncio.wait_for(_acquire_impl(), timeout=timeout, loop=loop)

```

If my understanding is correct `wait_for` should work in exactly 2 ways

1. the inner task is completed and the outer task will receive the result – transport and protocol in this case
2. The inner task is cancelled and no connection was established

I provided source code for client and server so the problem can be easily reproduced on your system.

certificate and key can be easily generated with `minica`


I found out that if I catch `CancelledError` and add a `done_callback` to the inner task, like so:

```python
    try:
        tr, pr = await asyncio.wait_for(connector, timeout=timeout, loop=loop)
        return tr, pr
    except asyncio.CancelledError as e:
        connector.add_done_callback(_done_callback)
        raise e
```
then inside of `_done_callback` I can access the transport and protocol object and close the transport manually to prevent leaking.


I run `netstat -a | grep 5000 | grep ESTAB | awk '{ print $5 }' | sort | uniq -c | grep 5000` after the script is done and there are many unclosed connections.

The output depends on your hardware so you might need to tweak the timeout parameter
msg348329 - (view) Author: Nikita Ilyasov (Nikita Ilyasov) Date: 2019-07-23 12:19
server example: https://gist.github.com/hexrain/bc92aa70eebc229365f0ce4bcccf7fc4
msg375937 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-08-26 16:42
New changeset a2118a14627256197bddcf4fcecad4c264c1e39d by Elvis Pranskevichus in branch 'master':
bpo-37658: Fix asyncio.wait_for() to respect waited task status (#21894)
https://github.com/python/cpython/commit/a2118a14627256197bddcf4fcecad4c264c1e39d
msg375942 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2020-08-26 17:15
New changeset 9de6be4e2ae605a1deb6fa72d5c5f66b07817e4c by Miss Islington (bot) in branch '3.9':
bpo-37658: Fix asyncio.wait_for() to respect waited task status (GH-21894) (GH-21964)
https://github.com/python/cpython/commit/9de6be4e2ae605a1deb6fa72d5c5f66b07817e4c
msg375947 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-08-26 18:26
New changeset 6e1954cd8286e083e7f8d09516d91b6b15769a4e by Miss Islington (bot) in branch '3.8':
bpo-37658: Fix asyncio.wait_for() to respect waited task status (GH-21894) (#21965)
https://github.com/python/cpython/commit/6e1954cd8286e083e7f8d09516d91b6b15769a4e
msg379542 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2020-10-24 21:22
Issue #42130 that was recently filed appears related to this change.
msg393566 - (view) Author: Denis S. Otkidach (ods) * Date: 2021-05-13 08:39
The current solutions doesn't look correct. It swallows cancelling causing task to hang: https://bugs.python.org/issue42130 .

Proposed test case calls cancel for inner future and set_result for outer task in the same loop step. The old (prior to this patch) behaviour in this case looks correct to me. We have to elaborate on original problem to find the source of actual race.

Therefore I've tried to reproduce the problem with original code sample and published the adapted version of it here: https://github.com/ods/bpo-37658 . Nikita Ilyasov, please correct me if I've misunderstood what you meant. My result is that proposed solution actually doesn't solve the problem, but rather lowers the chance to get it. Namely, I get the traceback quoted in README both with 3.8.5, 3.9.4, and current code in main.
msg393567 - (view) Author: Adam Liddell (aaliddell) Date: 2021-05-13 08:53
Some discussion leading up to that change is here https://github.com/MagicStack/asyncpg/pull/548 and in the issues it links.
msg393584 - (view) Author: Denis S. Otkidach (ods) * Date: 2021-05-13 14:53
The original problem can be fixed by wrapping await into try-except block:

```
async def create_connection(ssl_obj):
    loop = asyncio.get_event_loop()
    connector = loop.create_connection(MyEchoClientProtocol, '127.0.0.1', 5000, ssl=ssl_obj)
    connector = asyncio.ensure_future(connector)
    try:
        tr, pr = await connector
    except asyncio.CancelledError:
        if connector.done():
            tr, pr = connector.result()
            # Uncommenting the following line fixes the problem:
            # tr.close()
        raise
    return tr, pr
```

I've updated my example to reproduce this: https://github.com/ods/bpo-37658/commit/eca3d81d60cbe129ce687674e6451836d567f6b9

I believe it's general problem with maintaining atomicity in async code, and not a bug in `wait_for`. Probably having an interface like `async with loop.create_connection(…) as transport, protocol` might simplify correct usage for this particular case.
msg393586 - (view) Author: Adam Liddell (aaliddell) Date: 2021-05-13 15:20
Wrapping every resource allocating call like that is what we were trying to avoid, since it makes wait_for go from a simple one-line helper to something you have to be very careful with.

Conceptually, a user should expect that wait_for should behave the exact same as awaiting the underlying awaitable, just with auto-cancellation. The problem with the current wait_for is that there is a gap where the underlying task may have completed but a cancellation arrives. In this case, we need to raise the cancellation to be a good asyncio citizen, but the underlying task has no opportunity to act on the cancellation (to free the resource) since it is already complete and cannot be re-entered. So the resource returned by the completed task gets stuck in limbo, since we can't return it and we can't assume a generic 'close' behaviour.

See my comment in the PR for a suggestion about an alternative structure for wait_for, which may avoid this gap and hence prevent the leak (but I have not tested it!)
msg393687 - (view) Author: nmatravolgyi (nmatravolgyi) * Date: 2021-05-14 19:41
I did a little research on how the `wait_for` could be improved. There are four alternate implementations you can find here: https://github.com/Traktormaster/wait-for2/blob/issue37658/wait_for2/impl.py

The repository (on the linked branch) has tox set-up and a test that asserts the behaviour of builtin and alternate implementations.

Quick summary:
  - wait_for (builtin): either leaks resources or ignores cancellation
  - wait_for (preferred alt): works as expected (I'm going to discuss this below)
  - wait_for_special_raise: works, but has pitfalls
  - wait_for_no_waiter*: These were proposed by @aaliddell on the related PR: https://github.com/python/cpython/pull/26097#issuecomment-840497455 I could not make them work, but I might be missing something. I think the fact that the inner coroutine gets wrapped into a Future introduces the race-condition, but I don't know how to test that. The general idea of this would be the best if an implementation was possible.


About the actually working alternate implementation I made:
In my opinion there is no way to implicitly handle losing a result properly in case of a cancellation, since it arbitrarily breaks the flow of execution. I'd look into having `wait_for(...)` support cleanup callbacks when a cancellation and completion happens at the same time. Something like:

```python

async def create_something():
    # this acquires some resource that needs explicit cleanup after some work
    return object()

def cleanup_something(inst):
    inst.close()

t = asyncio.ensure_future(create_something())
x = await asyncio.wait_for(t, timeout=10, cancel_handler=cleanup_something)
try:
    pass  # normal work with x
finally:
    cleanup_something(x)  # cleanup at normal execution
```

The inner task is still responsible for handling the resource before it returns, which means if the inner task is cancelled, there must be no leak if the implementation is correct. If no cancellation happens, everything is "fine". Finally, the waiter task would be able to handle its cancellation and the simultaneous completion of the inner task if the caller code provides a callback.

Unfortunately this requires the user of `wait_for` to be aware of this race-condition. However it makes it possible to be handled properly when the waited future's result requires a cleanup.
msg394249 - (view) Author: Denis S. Otkidach (ods) * Date: 2021-05-24 13:02
Re: msg393586
> See my comment in the PR for a suggestion about an alternative structure for wait_for, which may avoid this gap and hence prevent the leak (but I have not tested it!)

Unfortunately this didn't close the gap, so the leak persisted.

I did some research on the source of the error. There are circular references between instances of _SelectorSocketTransport and _SSLProtocolTransport (see attached image), therefore these objects are destroyed by garbage collector. The order might be arbitrary. When _SelectorSocketTransport is destroyed after _SSLProtocolTransport everything is ok, but with the opposite order we get the error printed. Here is description on how to reproduce this:
https://github.com/ods/bpo-37658#circular-references-when-using-ssltls
History
Date User Action Args
2021-05-24 13:02:10odssetfiles: + ssl_circular_refs.png

messages: + msg394249
2021-05-14 19:41:13nmatravolgyisetnosy: + nmatravolgyi
messages: + msg393687
2021-05-13 15:20:09aaliddellsetmessages: + msg393586
2021-05-13 14:53:36odssetmessages: + msg393584
2021-05-13 10:41:25odssetpull_requests: + pull_request24738
2021-05-13 08:53:43aaliddellsetmessages: + msg393567
2021-05-13 08:39:05odssetnosy: + ods
messages: + msg393566
2020-10-24 21:22:53chris.jerdoneksetnosy: + chris.jerdonek
messages: + msg379542
2020-08-26 19:06:35Elvis.Pranskevichussetpull_requests: + pull_request21078
2020-08-26 18:26:37yselivanovsetmessages: + msg375947
2020-08-26 17:15:38lukasz.langasetnosy: + lukasz.langa
messages: + msg375942
2020-08-26 16:43:21miss-islingtonsetpull_requests: + pull_request21075
2020-08-26 16:43:13miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request21074
2020-08-26 16:42:51yselivanovsetmessages: + msg375937
2020-08-16 14:25:32aaliddellsetnosy: + aaliddell
2020-08-15 22:23:50Elvis.Pranskevichussetkeywords: + patch
nosy: + Elvis.Pranskevichus

pull_requests: + pull_request21013
stage: patch review
2019-07-23 12:19:39Nikita Ilyasovsetmessages: + msg348329
2019-07-23 12:16:17Nikita Ilyasovcreate