Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In some cases asyncio.wait_for can lead to socket leak. #81839

Closed
NikitaIlyasov mannequin opened this issue Jul 23, 2019 · 20 comments · Fixed by #98518
Closed

In some cases asyncio.wait_for can lead to socket leak. #81839

NikitaIlyasov mannequin opened this issue Jul 23, 2019 · 20 comments · Fixed by #98518
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@NikitaIlyasov
Copy link
Mannequin

NikitaIlyasov mannequin commented Jul 23, 2019

BPO 37658
Nosy @ods, @asvetlov, @cjerdonek, @elprans, @ambv, @1st1, @miss-islington, @Dreamsorcerer, @aaliddell, @kumaraditya303
PRs
  • bpo-37658: Fix asyncio.wait_for() to respect waited task status #21894
  • [3.9] bpo-37658: Fix asyncio.wait_for() to respect waited task status (GH-21894) #21964
  • [3.8] bpo-37658: Fix asyncio.wait_for() to respect waited task status (GH-21894) #21965
  • [3.7] bpo-37658: Fix asyncio.wait_for() to respect waited task status (GH-21894) #21969
  • bpo-42130: Fix swallowing of cancellation by wait_for #26097
  • bpo-37658: Actually return result in race condition #29202
  • [3.9] bpo-37658: Actually return result in race condition (GH-29202) #29832
  • [3.10] bpo-37658: Actually return result in race condition (GH-29202) #29831
  • Files
  • example.py: script that can reproduce the problem
  • ssl_circular_refs.png
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2019-07-23.12:16:17.363>
    labels = ['expert-asyncio', '3.7', 'performance']
    title = 'In some cases asyncio.wait_for can lead to socket leak.'
    updated_at = <Date 2022-03-11.17:35:41.865>
    user = 'https://bugs.python.org/NikitaIlyasov'

    bugs.python.org fields:

    activity = <Date 2022-03-11.17:35:41.865>
    actor = 'dreamsorcerer'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['asyncio']
    creation = <Date 2019-07-23.12:16:17.363>
    creator = 'Nikita Ilyasov'
    dependencies = []
    files = ['48500', '50061']
    hgrepos = []
    issue_num = 37658
    keywords = ['patch']
    message_count = 18.0
    messages = ['348328', '348329', '375937', '375942', '375947', '379542', '393566', '393567', '393584', '393586', '393687', '394249', '404998', '407248', '407252', '407373', '414797', '414925']
    nosy_count = 12.0
    nosy_names = ['ods', 'asvetlov', 'chris.jerdonek', 'Elvis.Pranskevichus', 'lukasz.langa', 'yselivanov', 'miss-islington', 'Nikita Ilyasov', 'dreamsorcerer', 'aaliddell', 'nmatravolgyi', 'kumaraditya']
    pr_nums = ['21894', '21964', '21965', '21969', '26097', '29202', '29832', '29831']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue37658'
    versions = ['Python 3.6', 'Python 3.7']

    @NikitaIlyasov
    Copy link
    Mannequin Author

    NikitaIlyasov mannequin commented Jul 23, 2019

    In some cases asyncio.wait_for can lead to socket leak.

    Condensed example:

    async def _create_connection(timeout=60, ssl_obj):
        loop = asyncio.get_event_loop()
        connector = loop.create_connection(MyEchoClientProtocol, '127.0.0.1', 5000, ssl=ssl_obj)
        connector = asyncio.ensure_future(connector)
        tr, pr = await asyncio.wait_for(connector, timeout=timeout, loop=loop)
        return tr, pr
    
    async def main():
        ...
        res = await asyncio.wait_for(_acquire_impl(), timeout=timeout, loop=loop)

    If my understanding is correct wait_for should work in exactly 2 ways

    1. the inner task is completed and the outer task will receive the result – transport and protocol in this case
    2. The inner task is cancelled and no connection was established

    I provided source code for client and server so the problem can be easily reproduced on your system.

    certificate and key can be easily generated with minica

    I found out that if I catch CancelledError and add a done_callback to the inner task, like so:

        try:
            tr, pr = await asyncio.wait_for(connector, timeout=timeout, loop=loop)
            return tr, pr
        except asyncio.CancelledError as e:
            connector.add_done_callback(_done_callback)
            raise e

    then inside of _done_callback I can access the transport and protocol object and close the transport manually to prevent leaking.

    I run netstat -a | grep 5000 | grep ESTAB | awk '{ print $5 }' | sort | uniq -c | grep 5000 after the script is done and there are many unclosed connections.

    The output depends on your hardware so you might need to tweak the timeout parameter

    @NikitaIlyasov NikitaIlyasov mannequin added 3.7 (EOL) end of life topic-asyncio performance Performance or resource usage labels Jul 23, 2019
    @NikitaIlyasov
    Copy link
    Mannequin Author

    NikitaIlyasov mannequin commented Jul 23, 2019

    @1st1
    Copy link
    Member

    1st1 commented Aug 26, 2020

    New changeset a2118a1 by Elvis Pranskevichus in branch 'master':
    bpo-37658: Fix asyncio.wait_for() to respect waited task status (bpo-21894)
    a2118a1

    @ambv
    Copy link
    Contributor

    ambv commented Aug 26, 2020

    New changeset 9de6be4 by Miss Islington (bot) in branch '3.9':
    bpo-37658: Fix asyncio.wait_for() to respect waited task status (GH-21894) (GH-21964)
    9de6be4

    @1st1
    Copy link
    Member

    1st1 commented Aug 26, 2020

    New changeset 6e1954c by Miss Islington (bot) in branch '3.8':
    bpo-37658: Fix asyncio.wait_for() to respect waited task status (GH-21894) (bpo-21965)
    6e1954c

    @cjerdonek
    Copy link
    Member

    Issue bpo-42130 that was recently filed appears related to this change.

    @ods
    Copy link
    Mannequin

    ods mannequin commented May 13, 2021

    The current solutions doesn't look correct. It swallows cancelling causing task to hang: https://bugs.python.org/issue42130 .

    Proposed test case calls cancel for inner future and set_result for outer task in the same loop step. The old (prior to this patch) behaviour in this case looks correct to me. We have to elaborate on original problem to find the source of actual race.

    Therefore I've tried to reproduce the problem with original code sample and published the adapted version of it here: https://github.com/ods/bpo-37658 . Nikita Ilyasov, please correct me if I've misunderstood what you meant. My result is that proposed solution actually doesn't solve the problem, but rather lowers the chance to get it. Namely, I get the traceback quoted in README both with 3.8.5, 3.9.4, and current code in main.

    @aaliddell
    Copy link
    Mannequin

    aaliddell mannequin commented May 13, 2021

    Some discussion leading up to that change is here MagicStack/asyncpg#548 and in the issues it links.

    @ods
    Copy link
    Mannequin

    ods mannequin commented May 13, 2021

    The original problem can be fixed by wrapping await into try-except block:

    async def create_connection(ssl_obj):
        loop = asyncio.get_event_loop()
        connector = loop.create_connection(MyEchoClientProtocol, '127.0.0.1', 5000, ssl=ssl_obj)
        connector = asyncio.ensure_future(connector)
        try:
            tr, pr = await connector
        except asyncio.CancelledError:
            if connector.done():
                tr, pr = connector.result()
                # Uncommenting the following line fixes the problem:
                # tr.close()
            raise
        return tr, pr
    

    I've updated my example to reproduce this: ods/bpo-37658@eca3d81

    I believe it's general problem with maintaining atomicity in async code, and not a bug in wait_for. Probably having an interface like async with loop.create_connection(…) as transport, protocol might simplify correct usage for this particular case.

    @aaliddell
    Copy link
    Mannequin

    aaliddell mannequin commented May 13, 2021

    Wrapping every resource allocating call like that is what we were trying to avoid, since it makes wait_for go from a simple one-line helper to something you have to be very careful with.

    Conceptually, a user should expect that wait_for should behave the exact same as awaiting the underlying awaitable, just with auto-cancellation. The problem with the current wait_for is that there is a gap where the underlying task may have completed but a cancellation arrives. In this case, we need to raise the cancellation to be a good asyncio citizen, but the underlying task has no opportunity to act on the cancellation (to free the resource) since it is already complete and cannot be re-entered. So the resource returned by the completed task gets stuck in limbo, since we can't return it and we can't assume a generic 'close' behaviour.

    See my comment in the PR for a suggestion about an alternative structure for wait_for, which may avoid this gap and hence prevent the leak (but I have not tested it!)

    @nmatravolgyi
    Copy link
    Mannequin

    nmatravolgyi mannequin commented May 14, 2021

    I did a little research on how the wait_for could be improved. There are four alternate implementations you can find here: https://github.com/Traktormaster/wait-for2/blob/issue37658/wait_for2/impl.py

    The repository (on the linked branch) has tox set-up and a test that asserts the behaviour of builtin and alternate implementations.

    Quick summary:

    • wait_for (builtin): either leaks resources or ignores cancellation
    • wait_for (preferred alt): works as expected (I'm going to discuss this below)
    • wait_for_special_raise: works, but has pitfalls
    • wait_for_no_waiter*: These were proposed by @aaliddell on the related PR: bpo-42130: Fix swallowing of cancellation by wait_for #26097 (comment) I could not make them work, but I might be missing something. I think the fact that the inner coroutine gets wrapped into a Future introduces the race-condition, but I don't know how to test that. The general idea of this would be the best if an implementation was possible.

    About the actually working alternate implementation I made:
    In my opinion there is no way to implicitly handle losing a result properly in case of a cancellation, since it arbitrarily breaks the flow of execution. I'd look into having wait_for(...) support cleanup callbacks when a cancellation and completion happens at the same time. Something like:

    async def create_something():
        # this acquires some resource that needs explicit cleanup after some work
        return object()
    
    def cleanup_something(inst):
        inst.close()
    
    t = asyncio.ensure_future(create_something())
    x = await asyncio.wait_for(t, timeout=10, cancel_handler=cleanup_something)
    try:
        pass  # normal work with x
    finally:
        cleanup_something(x)  # cleanup at normal execution

    The inner task is still responsible for handling the resource before it returns, which means if the inner task is cancelled, there must be no leak if the implementation is correct. If no cancellation happens, everything is "fine". Finally, the waiter task would be able to handle its cancellation and the simultaneous completion of the inner task if the caller code provides a callback.

    Unfortunately this requires the user of wait_for to be aware of this race-condition. However it makes it possible to be handled properly when the waited future's result requires a cleanup.

    @ods
    Copy link
    Mannequin

    ods mannequin commented May 24, 2021

    Re: msg393586

    See my comment in the PR for a suggestion about an alternative structure for wait_for, which may avoid this gap and hence prevent the leak (but I have not tested it!)

    Unfortunately this didn't close the gap, so the leak persisted.

    I did some research on the source of the error. There are circular references between instances of _SelectorSocketTransport and _SSLProtocolTransport (see attached image), therefore these objects are destroyed by garbage collector. The order might be arbitrary. When _SelectorSocketTransport is destroyed after _SSLProtocolTransport everything is ok, but with the opposite order we get the error printed. Here is description on how to reproduce this:
    https://github.com/ods/bpo-37658#circular-references-when-using-ssltls

    @Dreamsorcerer
    Copy link
    Mannequin

    Dreamsorcerer mannequin commented Oct 25, 2021

    Can I get a review?
    #29202

    Seems like a simple mistake given the original description of this issue:

    1. the inner task is completed and the outer task will receive the result – transport and protocol in this case
    2. The inner task is cancelled and no connection was established

    The try/except blocks clearly add a 3rd condition, where the inner task is completed and a TimeoutError is raised without returning the result.

    @asvetlov
    Copy link
    Contributor

    New changeset 934a826 by Sam Bull in branch 'main':
    bpo-37658: Actually return result in race condition (GH-29202)
    934a826

    @miss-islington
    Copy link
    Contributor

    New changeset 305236e by Miss Islington (bot) in branch '3.10':
    bpo-37658: Actually return result in race condition (GH-29202)
    305236e

    @asvetlov
    Copy link
    Contributor

    New changeset 99a9b34 by Miss Islington (bot) in branch '3.9':
    bpo-37658: Actually return result in race condition (GH-29202) (GH-29832)
    99a9b34

    @kumaraditya303
    Copy link
    Contributor

    @asvetlov Anything left to do here or can this be closed ?

    @Dreamsorcerer
    Copy link
    Mannequin

    Dreamsorcerer mannequin commented Mar 11, 2022

    There is still an issue here. I proposed a possible solution at: #28149 (comment)

    You can also scroll up through the lengthy discussion to see how I reached that conclusion. I've not had time yet to actually try implementing something yet.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added type-bug An unexpected behavior, bug, or error and removed performance Performance or resource usage labels Aug 17, 2022
    @zhuofeng6
    Copy link

    any question about this issue? if not, i think it is better to close it

    @gvanrossum
    Copy link
    Member

    Yes, see above:

    There is still an issue here. I proposed a possible solution at: #28149 (comment)

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-asyncio type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    Successfully merging a pull request may close this issue.

    9 participants