Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio.Semaphore waiters deque doesn't work #90155

Closed
hyzyla mannequin opened this issue Dec 6, 2021 · 30 comments
Closed

asyncio.Semaphore waiters deque doesn't work #90155

hyzyla mannequin opened this issue Dec 6, 2021 · 30 comments
Assignees
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes topic-asyncio

Comments

@hyzyla
Copy link
Mannequin

hyzyla mannequin commented Dec 6, 2021

BPO 45997
Nosy @asvetlov, @1st1, @miss-islington, @hyzyla
PRs
  • bpo-45997: Fix asyncio.Semaphore waiters order #30052
  • bpo-45997: Fix asyncio.Semaphore re-acquiring order #31910
  • [3.10] bpo-45997: Fix asyncio.Semaphore re-acquiring order (GH-31910) #32047
  • [3.9] bpo-45997: Fix asyncio.Semaphore re-acquiring order (GH-31910) #32049
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/asvetlov'
    closed_at = None
    created_at = <Date 2021-12-06.14:04:23.913>
    labels = ['3.11', 'expert-asyncio']
    title = "asyncio.Semaphore waiters deque doesn't work"
    updated_at = <Date 2022-03-22.15:16:39.410>
    user = 'https://github.com/hyzyla'

    bugs.python.org fields:

    activity = <Date 2022-03-22.15:16:39.410>
    actor = 'asvetlov'
    assignee = 'asvetlov'
    closed = False
    closed_date = None
    closer = None
    components = ['asyncio']
    creation = <Date 2021-12-06.14:04:23.913>
    creator = 'hyzyla'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 45997
    keywords = ['patch']
    message_count = 9.0
    messages = ['407809', '407836', '407841', '407843', '415276', '415286', '415772', '415786', '415787']
    nosy_count = 5.0
    nosy_names = ['asvetlov', 'python-dev', 'yselivanov', 'miss-islington', 'hyzyla']
    pr_nums = ['30052', '31910', '32047', '32049']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue45997'
    versions = ['Python 3.11']

    @hyzyla
    Copy link
    Mannequin Author

    hyzyla mannequin commented Dec 6, 2021

    Class asyncio.Semaphore has dequeue in implementation for waiters of semaphore and seems like the intention of this dequeue is to maintain FIFO queue of waiters. But it doesn't work, coroutine that releases semaphore can acquire semaphore again. Below is reproducible example:

    import asyncio
    
    
    async def process(idx: int, semaphore: asyncio.Semaphore) -> None:
        while True:
            async with semaphore:
                print(f'ACQUIRE {idx}')
                await asyncio.sleep(1)
    
    
    async def main() -> None:
        semaphore = asyncio.Semaphore(5)
        await asyncio.gather(*[process(idx, semaphore) for idx in range(20)])
    
    asyncio.run(main())

    In console:

    /home/y.hyzyla/Work/asynciosemaphorebug/venv/bin/python /home/y.hyzyla/Work/asynciosemaphorebug/main.py
    ACQUIRE 0
    ACQUIRE 1
    ACQUIRE 2
    ACQUIRE 3
    ACQUIRE 4
    ACQUIRE 0
    ACQUIRE 1
    ACQUIRE 2
    ACQUIRE 3
    ACQUIRE 4
    ACQUIRE 0
    ACQUIRE 1
    ACQUIRE 2
    ACQUIRE 3
    ACQUIRE 4
    ....
    

    Ugly fix, is to add asyncio.sleep right before semaphore.

        while True:
            await asyncio.sleep(0)
            async with semaphore:
                ...

    Also, I found a comment on Stack Overflow about race condition in Semaphore implementation, but I don't know about the quality of that comment:

    https://stackoverflow.com/questions/55951233/does-pythons-asyncio-lock-acquire-maintain-order#comment112978366_55951304

    @hyzyla hyzyla mannequin added 3.8 only security fixes topic-asyncio labels Dec 6, 2021
    @AlexWaygood AlexWaygood changed the title asyncio.Semaphore waiters deqeueu doesn't work asyncio.Semaphore waiters deque doesn't work Dec 6, 2021
    @AlexWaygood AlexWaygood changed the title asyncio.Semaphore waiters deqeueu doesn't work asyncio.Semaphore waiters deque doesn't work Dec 6, 2021
    @AlexWaygood AlexWaygood added type-bug An unexpected behavior, bug, or error labels Dec 6, 2021
    @asvetlov
    Copy link
    Contributor

    asvetlov commented Dec 6, 2021

    Good point.
    Currently, asyncio lock objects don't provide a strong FIFO guarantee.

    In a tight loop, a task can re-acquire the lock just after releasing even if there are pending waiters that were scheduled earlier. It's true also for Lock, Conditional, Event, etc.

    The solution requires async release method. Since the change is not backward compatible, a new method should be added, e.g. await sem.release_and_wait() for endorsing the context switch if there are pending waiters.

    async context manager can be modified for using the new method without backward compatibility problems easily.

    A hero who can help is welcome!

    @asvetlov asvetlov removed the type-bug An unexpected behavior, bug, or error label Dec 6, 2021
    @asvetlov asvetlov changed the title asyncio.Semaphore waiters deque doesn't work asyncio.Semaphore waiters deqeueu doesn't work Dec 6, 2021
    @asvetlov asvetlov self-assigned this Dec 6, 2021
    @asvetlov asvetlov removed the type-bug An unexpected behavior, bug, or error label Dec 6, 2021
    @asvetlov asvetlov changed the title asyncio.Semaphore waiters deque doesn't work asyncio.Semaphore waiters deqeueu doesn't work Dec 6, 2021
    @asvetlov asvetlov self-assigned this Dec 6, 2021
    @asvetlov
    Copy link
    Contributor

    asvetlov commented Dec 6, 2021

    Or, maybe, there is a way to do everything without changing public API.

    The idea is: _wake_up_next can create a future which is set by *waked up task* on its acquiring. acquire method should wait for this future first before entering in `while self._value < 0:` loop.

    If the future is cancelled, _wake_up_next should be called again and waiting for the next future acquiring to be performed.

    If there is no *acquire waiting* future exists -- do everything as usual.

    All other lock objects should be modified also.

    @asvetlov asvetlov added 3.11 only security fixes type-bug An unexpected behavior, bug, or error and removed 3.8 only security fixes labels Dec 6, 2021
    @asvetlov asvetlov changed the title asyncio.Semaphore waiters deqeueu doesn't work asyncio.Semaphore waiters deque doesn't work Dec 6, 2021
    @asvetlov asvetlov added 3.11 only security fixes type-bug An unexpected behavior, bug, or error and removed 3.8 only security fixes labels Dec 6, 2021
    @asvetlov asvetlov changed the title asyncio.Semaphore waiters deqeueu doesn't work asyncio.Semaphore waiters deque doesn't work Dec 6, 2021
    @hyzyla
    Copy link
    Mannequin Author

    hyzyla mannequin commented Dec 6, 2021

    Thanks for response!

    A hero who can help is welcome!

    I will try to make PR :raising_hand

    @hyzyla hyzyla mannequin added 3.8 only security fixes and removed 3.11 only security fixes type-bug An unexpected behavior, bug, or error labels Dec 6, 2021
    @cykerway
    Copy link
    Contributor

    The issue is semaphores are currently broken in 3.10.7 and other versions. A timeline is given in the PR link.

    To understand the issue, just look at the gather in it. When gather returns, all tasks are done. All tasks interact with semaphore using async with which guarantees its release. So the semaphore has value 1 right after gather returns. But the last async with is not able to acquire this semaphore. This suggets its internal state is corrupted.

    @gvanrossum
    Copy link
    Member

    When people state "X is broken" it gets my hackles up a bit. Can you explain in words what's broken? How badly is it broken? When does user code experience it?

    I see the code you showed but I need an explanation.

    @cykerway
    Copy link
    Contributor

    Can you explain in words what's broken?

    The semaphore is broken. It gets corrupted and no longer usable.

    How badly is it broken?

    The broken semaphore blocks a task when the task should be allowed to proceed. In this sense it is completely unusable. How bad is the result depends on what the task is doing. The symptom could be no terminal output, network connection timeout, or other bad things you can expect when a task cannot proceed.

    When does user code experience it?

    What I found was when a task cancels another task at certain time. This is a race condition so I can't give exact description of the condition that triggers the problem. Someone has documented his own experience here.


    And to add a bit explanation about the code itself, this is the story in plain words: Player Zero and One share a hotel room whose door is locked and there is only one key. They arrive at the door at about the same time. Zero sleeps 0 seconds then grabs the key (at time 0). He goes into the room and sleeps for another 2 seconds. During his sleep One wakes up from his first sleep (at time 1). One tries to grab the key but he can't, because there is only one key currently held by Zero. So One waits outside on the lock. Zero wakes up (at time 2). He gets out of the room, returns the key, then murders One at the door. Now One has died and he has never got the key. A moment later Police Main comes to the door (at time 3) to investigate. He is the only one here, and there is a key, but he just cannot open the door with the key. Why? Because the lock is damaged. Now Main waits forever on the damaged lock.

    Cast:

    • Player Zero: Task process(0, ...)
    • Player One: Task process(1, ...)
    • Police Main: Task main()
    • Lock: Semaphore(1)

    @gvanrossum
    Copy link
    Member

    Presumably the problem is that some code gets interrupted by the cancellation and exits without restoring an invariant. Not surprising given the complexity of the code. But cancellation only happens in await (there is no pre-emption in asyncio) so it should not be hard to find, right?

    @cykerway
    Copy link
    Contributor

    9d59381#diff-0fee1befb15023abc0dad2623effa93a304946796929f6cb445d11a57821e737R368

    Roughly speaking it's _wakeup_scheduled set to True when Zero releases the lock, then the cancelled One simply exits without touching it. So when Main comes it sees _wakeup_scheduled == True and sleeps forever. But this alone doesn't suggest an easy fix to the existing code. The code in PR has gone some trial and error. I don't know if it's optimal but it doesn't seem to suffer either problem (OP's and mine) or introduce new regressions. You are right that these tasks are non-preemptive.

    @gvanrossum
    Copy link
    Member

    I'm beginning to wonder of the root of the problem isn't that the Semaphore implementation decided to remove the completed Future in release() instead of in acquire(). If we compare to Lock, it removes the future in acquire():

            try:
                try:
                    await fut
                finally:
                    self._waiters.remove(fut)
            except exceptions.CancelledError:
                if not self._locked:
                    self._wake_up_first()
                raise

    In its _wake_up_first() it just makes the first future in the queue done. Also, at the top of acquire(), it has a thorough check to see if it can immediately grab the lock:

            if (not self._locked and (self._waiters is None or
                    all(w.cancelled() for w in self._waiters))):
                self._locked = True
                return True

    IOW it only grabs it when nobody else is waiting.

    Adapting this approach for Semaphore would probably improve the code -- no more _wakeup_scheduled counter, no more 'first' flag.

    @kumaraditya303 What do you think? And @cykerway?

    Looking at things from this perspective, I think the Lock class is fair and does not break the lock when acquire() is cancelled.

    I'm less sure about Queue -- like Semaphore, it removes the waiter from the queue of waiters when waking up, but it has much more thorough handling of cancellation (using in except: / ... / raise). Its docs don't make a promise of fairness when multiple tasks are each blocked in get() (or in put(), for that matter -- it's completely symmetrical). So in the end I think we don't have to worry about it.

    @kumaraditya303 kumaraditya303 added 3.10 only security fixes 3.12 bugs and security fixes labels Sep 17, 2022
    @cykerway
    Copy link
    Contributor

    cykerway commented Sep 17, 2022

    That's a great observation. I didn't really pay attention to Lock before. A Lock can be implemented with Semaphore(1). So if we have a good Semaphore then we have a good Lock. The other direction is not true: A good Lock doesn't automatically give a good Semaphore. But in this case, I copied Lock to Semaphore and tried to patch it. I found the patch didn't seem a lot of work.

    The result has been pushed to #93222. It can almost pass all the tests, except that I have to change some asyncio.sleep(0) into asyncio.sleep(0.01). This irks me. I have had this issue several times before. Could anyone here please explain what is the exact difference between using 0 and 0.01 in asyncio.sleep()? And why it matters in tests? I think this implies timing is playing a role in determining test results which doesn't look good to me. I understand that sleep(0) gives control to the event loop. But is there a guarantee that Python asyncio strictly follows the asymmetric stackless coroutine model, with the event loop sitting at the root of all coroutines?

    @gvanrossum
    Copy link
    Member

    The result has been pushed to #93222. It can almost pass all the tests, except that I have to change some asyncio.sleep(0) into asyncio.sleep(0.01). This irks me. I have had this issue several times before. Could anyone here please explain what is the exact difference between using 0 and 0.01 in asyncio.sleep()? And why it matters in tests? I think this implies timing is playing a role in determining test results which doesn't look good to me. I understand that sleep(0) gives control to the event loop. But is there a guarantee that Python asyncio strictly follows the asymmetric stackless coroutine model, with the event loop sitting at the root of all coroutines?

    I hadn't seen this before I reviewed your new version; I'll look at those tests and see if I can understand why sleep(0.01) is now needed. It definitely irks me too. I think the idea is that sleep(0) allows everything that's runnable now to run, whereas sleep(tiny_amount) will effectively also run things that become runnable as a result of the first wave. I'm not sure what asymmetric stackless coroutine model refers to -- is that a textbook concept or something named in the Python docs or just a term you invented on the spot?

    @cykerway
    Copy link
    Contributor

    Hope you can post something here once you completely figure out sleep(0) vs sleep(0.01).

    The coroutine model is a categorization of coroutines across two dimensions: asymmetric/symmetric and stackful/stackless. Asymmetric means the callee coroutine always passes control to its caller on yield. Symmetric means the callee coroutine passes control to anyone it likes. Stackful means coroutine preserves its entire stack on yield (like fiber). Stackless means coroutine is implemented as a state machine. It seems Python generators, async functions, etc. fall into the asymmetric+stackless category. I believe knowing this helps understand how coroutines are scheduled in Python, which in turn helps answer the sleep(0) vs sleep(0.01) question.

    @gvanrossum
    Copy link
    Member

    I think I've figured out why some tests appeared to need sleep(0.01). The wakeup code was still broken if there were multiple waiters: if the first waiter is already awoken, it should just wake up the second one. This is a subtle difference with Lock, where it is an error to release more than once.

    Thanks for the clarification of your coroutine classification scheme -- I'm so used to Python's particular brand of coroutines that I forget the other quadrants of the design space. :-) And yes, once the event loop passes control to a particular coroutine, that coroutine runs without intervention until it next yields -- await is implemented using yield from (PEP 380) which passes control up and down a stack of coroutines (not sure if that makes it stackful).

    @kumaraditya303
    Copy link
    Contributor

    kumaraditya303 commented Sep 20, 2022

    Okay so I finally got time to look into this:

    Currently, the semaphore does not provide FIFO ordering. The was supposed to be fixed by #31910 but it didn't. The correct fix for semaphore should have the following semantics:

    • Immediately acquire the semaphore if it is not locked and there are no waiters. not self.locked() and not self._waiters
    • If there are waiters, it must switch task to allow waking up the task instead of returning and not cause starvation on the next iteration of loop.

    Here's my simplified implementation which maintains the FIFO ordering:

    import asyncio
    import collections
    
    
    class Semaphore:
        def __init__(self, value: int) -> None:
            self.value = value
            self.waiters = collections.deque()
    
        async def __aenter__(self):
            await self.acquire()
    
        async def __aexit__(self, exc_type, exc, tb):
            self.release()
    
        def locked(self) -> bool:
            return self.value == 0
    
        async def acquire(self) -> None:
            if not self.locked() and not self.waiters:
                # No need to wait as the semaphore is not locked
                # and no one is waiting
                self.value -= 1
                return
    
            # if there are waiters or the semaphore is locked
            fut = asyncio.Future()
            self.waiters.append(fut)
            try:
                await fut
            finally:
                self.waiters.remove(fut)
    
            self.value -= 1
            if not self.locked():
                # This is required for strict FIFO ordering
                # otherwise it can cause starvation on the waiting tasks
                # The next loop iteration will wake up the task and switch
                self._wakeup_next()
            return
    
        def _wakeup_next(self) -> None:
            if self.waiters:
                # Wake up the first waiter, it is removed by the waiting task
                waiter = self.waiters[0]
                if not waiter.done():
                    # This schedules the task to be woken up on next loop iteration 
                    # It requires exactly one iteration of loop, See Task.__wakeup in pure python
                    waiter.set_result(None)
    
    
        def release(self) -> None:
            self.value += 1
            self._wakeup_next()
    
    
    async def process(idx: int, semaphore: Semaphore) -> None:
        while True:
            async with semaphore:
                print(f'ACQUIRE {idx}')
                await asyncio.sleep(1)
    
    
    async def main() -> None:
        semaphore = Semaphore(5)
        await asyncio.gather(*[process(idx, semaphore) for idx in range(20)])
    
    asyncio.run(main())

    @cykerway
    Copy link
    Contributor

    You didn't handle cancelled tasks. This program would hang.

    sem = Semaphore(1)
    async def c1(): await sem.acquire()
    async def c2(): await sem.acquire()
    async def c3(): await sem.acquire()
    t1 = create_task(c1())
    t2 = create_task(c2())
    t3 = create_task(c3())
    await sleep(0)
    sem.release()
    sem.release()
    t2.cancel()
    await gather(t1, t2, t3, return_exceptions=True)

    @kumaraditya303
    Copy link
    Contributor

    kumaraditya303 commented Sep 20, 2022

    Right, but I did say "simplified implementation" and I avoided cancellation handling to make understanding of waking of task in FIFO order easier which is the main point of confusion.

    @gvanrossum
    Copy link
    Member

    But the cancellation is what makes it so challenging. :-(

    For example, the acquiring task can go through the following states (not counting the "trivial success" at the top):

    • Blocked
    • Cancelled, but task not yet running
    • Result set, but task not yet running
    • Result set, then task cancelled, but not yet running

    That last state is surprising -- after the future's result is set the future cannot be cancelled, but the task can -- and CancelledError will be thrown into the task in that case. (You can only get there by calling sem.release() and task.cancel() without sleeping in between.)

    I have been trying to come up with a different version that passes all tests, but so far without success. If nothing else, this would give me more confidence that the tests are correct.

    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 22, 2022
    …antee (pythonGH-93222)
    
    The main problem was that an unluckily timed task cancellation could cause
    the semaphore to be stuck. There were also doubts about strict FIFO ordering
    of tasks allowed to pass.
    
    The Semaphore implementation was rewritten to be more similar to Lock.
    Many tests for edge cases (including cancellation) were added.
    (cherry picked from commit 24e0379)
    
    Co-authored-by: Cyker Way <cykerway@gmail.com>
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 22, 2022
    …antee (pythonGH-93222)
    
    The main problem was that an unluckily timed task cancellation could cause
    the semaphore to be stuck. There were also doubts about strict FIFO ordering
    of tasks allowed to pass.
    
    The Semaphore implementation was rewritten to be more similar to Lock.
    Many tests for edge cases (including cancellation) were added.
    (cherry picked from commit 24e0379)
    
    Co-authored-by: Cyker Way <cykerway@gmail.com>
    gvanrossum pushed a commit that referenced this issue Sep 22, 2022
    …93222)
    
    The main problem was that an unluckily timed task cancellation could cause
    the semaphore to be stuck. There were also doubts about strict FIFO ordering
    of tasks allowed to pass.
    
    The Semaphore implementation was rewritten to be more similar to Lock.
    Many tests for edge cases (including cancellation) were added.
    miss-islington added a commit that referenced this issue Sep 22, 2022
    …H-93222)
    
    The main problem was that an unluckily timed task cancellation could cause
    the semaphore to be stuck. There were also doubts about strict FIFO ordering
    of tasks allowed to pass.
    
    The Semaphore implementation was rewritten to be more similar to Lock.
    Many tests for edge cases (including cancellation) were added.
    (cherry picked from commit 24e0379)
    
    Co-authored-by: Cyker Way <cykerway@gmail.com>
    miss-islington added a commit that referenced this issue Sep 22, 2022
    …H-93222)
    
    The main problem was that an unluckily timed task cancellation could cause
    the semaphore to be stuck. There were also doubts about strict FIFO ordering
    of tasks allowed to pass.
    
    The Semaphore implementation was rewritten to be more similar to Lock.
    Many tests for edge cases (including cancellation) were added.
    (cherry picked from commit 24e0379)
    
    Co-authored-by: Cyker Way <cykerway@gmail.com>
    @gvanrossum
    Copy link
    Member

    I blogged a bit about this: https://neopythonic.blogspot.com/2022/10/reasoning-about-asynciosemaphore.html

    pablogsal pushed a commit that referenced this issue Oct 24, 2022
    …H-93222)
    
    The main problem was that an unluckily timed task cancellation could cause
    the semaphore to be stuck. There were also doubts about strict FIFO ordering
    of tasks allowed to pass.
    
    The Semaphore implementation was rewritten to be more similar to Lock.
    Many tests for edge cases (including cancellation) were added.
    (cherry picked from commit 24e0379)
    
    Co-authored-by: Cyker Way <cykerway@gmail.com>
    pablogsal pushed a commit that referenced this issue Oct 24, 2022
    …H-93222)
    
    The main problem was that an unluckily timed task cancellation could cause
    the semaphore to be stuck. There were also doubts about strict FIFO ordering
    of tasks allowed to pass.
    
    The Semaphore implementation was rewritten to be more similar to Lock.
    Many tests for edge cases (including cancellation) were added.
    (cherry picked from commit 24e0379)
    
    Co-authored-by: Cyker Way <cykerway@gmail.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes topic-asyncio
    Projects
    Status: Done
    Development

    No branches or pull requests

    8 participants