This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: test_asyncio: test_async_gen_asyncio_gc_aclose_09() race condition
Type: Stage: resolved
Components: Library (Lib), Tests Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: asvetlov, vstinner, xtreak, yselivanov
Priority: normal Keywords:

Created on 2019-01-17 14:04 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (9)
msg333868 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-01-17 14:04
The test fails once on AMD64 Windows8.1 Non-Debug 3.x when the Python test suite is run in parallel, but pass if the test is run alone (when the system is more "idle").

https://buildbot.python.org/all/#/builders/12/builds/1898

======================================================================
FAIL: test_async_gen_asyncio_gc_aclose_09 (test.test_asyncgen.AsyncGenAsyncioTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\buildarea\3.x.ware-win81-release\build\lib\test\test_asyncgen.py", line 684, in test_async_gen_asyncio_gc_aclose_09
    self.assertEqual(DONE, 1)
AssertionError: 0 != 1


It can reproduce the failure on a very busy Windows using 2 terminals:

* python -m test -F -W -j4 test_asyncgen  test_asyncgen test_asyncgen test_asyncgen
* python -m test -j0 -r -u all 

The first command runs the test 4 times in parallel in a loop until if fails, the second command is just one way to stress the system.


The test is based on time and so has a race condition depending on the exact timing:

    def test_async_gen_asyncio_gc_aclose_09(self):
        DONE = 0

        async def gen():
            nonlocal DONE
            try:
                while True:
                    yield 1
            finally:
                await asyncio.sleep(0.01)
                await asyncio.sleep(0.01)
                DONE = 1

        async def run():
            g = gen()
            await g.__anext__()
            await g.__anext__()
            del g

            await asyncio.sleep(0.1)

        self.loop.run_until_complete(run())
        self.assertEqual(DONE, 1)
msg334575 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-01-30 17:23
The bug can be triggered on Linux using this patch:

diff --git a/Lib/test/test_asyncgen.py b/Lib/test/test_asyncgen.py
index 71b0968c79..5e0084dc32 100644
--- a/Lib/test/test_asyncgen.py
+++ b/Lib/test/test_asyncgen.py
@@ -668,8 +668,8 @@ class AsyncGenAsyncioTest(unittest.TestCase):
                 while True:
                     yield 1
             finally:
-                await asyncio.sleep(0.01)
-                await asyncio.sleep(0.01)
+                await asyncio.sleep(0.001)
+                await asyncio.sleep(0.001)
                 DONE = 1
 
         async def run():
@@ -678,7 +678,7 @@ class AsyncGenAsyncioTest(unittest.TestCase):
             await g.__anext__()
             del g
 
-            await asyncio.sleep(0.1)
+            await asyncio.sleep(0.005)
 
         self.loop.run_until_complete(run())
         self.assertEqual(DONE, 1)


And command:

$ ./python -m test -j8 -F test_asyncgen -m test_async_gen_asyncio_gc_aclose_09
0:00:00 load avg: 7.30 [  1] test_asyncgen passed
0:00:00 load avg: 7.30 [  2] test_asyncgen passed
0:00:00 load avg: 7.30 [  3] test_asyncgen passed
(...)
0:00:02 load avg: 7.30 [ 38] test_asyncgen passed
0:00:02 load avg: 7.30 [ 39/1] test_asyncgen failed
Task was destroyed but it is pending!
task: <Task pending name='Task-2' coro=<<async_generator_athrow without __name__>()> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fb9ad249aa0>()]>>
test test_asyncgen failed -- Traceback (most recent call last):
  File "/home/vstinner/prog/python/master/Lib/test/test_asyncgen.py", line 684, in test_async_gen_asyncio_gc_aclose_09
    self.assertEqual(DONE, 1)
AssertionError: 0 != 1
0:00:02 load avg: 7.30 [ 40/1] test_asyncgen passed
0:00:03 load avg: 7.30 [ 41/1] test_asyncgen passed
0:00:03 load avg: 7.30 [ 42/1] test_asyncgen passed
0:00:03 load avg: 7.30 [ 43/1] test_asyncgen passed
0:00:03 load avg: 7.30 [ 44/1] test_asyncgen passed
0:00:03 load avg: 7.30 [ 45/1] test_asyncgen passed
0:00:03 load avg: 7.30 [ 46/1] test_asyncgen passed
0:00:03 load avg: 7.30 [ 47/1] test_asyncgen passed

== Tests result: FAILURE ==

46 tests OK.

1 test failed:
    test_asyncgen

Total duration: 3 sec 323 ms
Tests result: FAILURE
msg334584 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-01-30 17:42
Seems this used to happen rarely on Appveyor builds too in the past randomly. Past report of the same issue issue32646
msg337783 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-03-12 18:33
I have modified the test with help of Victor to have lower sleep delay to reproduce the bug. Since sleep calls call_later with the delay where the Timer object is created time.monotonic() + delay. I have added a print statement to see when the sleep is scheduled.

Modified test with 0.01 for run and 0.001 for finally : 

    def test_async_gen_asyncio_gc_aclose_09(self):
        DONE = 0
        import time

        async def gen():
            nonlocal DONE
            try:
                while True:
                    yield 1
            finally:
                print(time.monotonic(), "finally sleeps 1 ms")
                await asyncio.sleep(0.001)
                print(time.monotonic(), "finally sleeps 1 more ms")
                await asyncio.sleep(0.001)
                DONE = 1
                print(time.monotonic(), "finally: DONE = 1--")

        async def run():
            g = gen()
            await g.__anext__()
            await g.__anext__()
            print(time.monotonic(), "del g")
            del g

            print(time.monotonic(), "run sleeps 10 ms...")
            await asyncio.sleep(0.01)
            print(time.monotonic(), "run exit")

        self.loop.run_until_complete(run())
        self.assertEqual(DONE, 1)


./python.exe -X dev -m test -j4 -F test_asyncgen -m test_async_gen_asyncio_gc_aclose_09

On running the test I can see that sometimes sometimes DONE is set before run exit.

0:00:00 load avg: 1.87 [  1] test_asyncgen passed
0.738210762 del g
0.739222722 run sleeps 10 ms...
0.7625599120000001 0.01
0.757210842 finally sleeps 1 ms
0.76050695 0.001
0.762445572 finally sleeps 1 more ms
0.765119049 0.001
0.765694536 run exit
0.76704106 finally: DONE = 1--

sometimes DONE is set after run exit.

0:00:00 load avg: 1.87 [  2] test_asyncgen passed
0.739718657 del g
0.740647619 run sleeps 10 ms...
0.764014802 0.01
0.759427252 finally sleeps 1 ms
0.762613547 0.001
0.764573796 finally sleeps 1 more ms
0.767242878 0.001
0.767722044 run exit
0.769133638 finally: DONE = 1--


There are cases where run exit is called before second sleep is even called causing failure.

0:00:03 load avg: 1.96 [ 16/1] test_asyncgen failed
0.685123806 del g
0.685635626 run sleeps 10 ms...
0.698605025 0.01
0.69112447 finally sleeps 1 ms
0.700476131 0.001
0.701259052 run exit
0.702733814 finally sleeps 1 more ms
0.70500493 0.001
Task was destroyed but it is pending!
source_traceback: Object created at (most recent call last):

Sometimes the second sleep scheduled time and run exit time are almost the same by 1 ms.

0:00:04 load avg: 1.96 [ 18/2] test_asyncgen failed
0.701280462 del g
0.701810041 run sleeps 10 ms...
0.714906478 0.01
0.707617751 finally sleeps 1 ms
0.713339727 0.001
0.714532967 finally sleeps 1 more ms
0.7167121870000001 0.001
0.716623958 run exit
Task was destroyed but it is pending!


I am not sure at this point if I should further debug this at Lib/asyncio/ or at _asynciomodule.c regarding the control flow change related code when an object is awaited. I suspect there is a race condition here since the test seems to set DONE=1 with different control flows and also where DONE=1 is not even called. This is more easily reproducible at lower resolution of time like using 0.01 and 0.001 for this test so this feels more like the race condition is exposed at lower time levels and thus increasing the time difference between the test doesn't seem to be a good solution. 

The above test also fails in 3.6 but takes a long time before below trace occurs : 

0:00:42 load avg: 4.26 [159] test_asyncgen passed
16775.541152592 del g
16775.541369282 run sleeps 10 ms...
16775.542167311 finally sleeps 1 ms
16775.544090629 finally sleeps 1 more ms
16775.546002851 finally: DONE = 1--
16775.552762612 run exit
0:00:43 load avg: 4.26 [160/1] test_asyncgen failed
16775.529271686 del g
16775.529405 run sleeps 10 ms...
16775.529833951 finally sleeps 1 ms
16775.539453811 finally sleeps 1 more ms
16775.539972874 run exit
Task was destroyed but it is pending!
task: <Task pending coro=<async_generator_athrow()> wait_for=<Future finished result=None>>
test test_asyncgen failed -- Traceback (most recent call last):
  File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/test/test_asyncgen.py", line 693, in test_async_gen_asyncio_gc_aclose_09
    self.assertEqual(DONE, 1)
AssertionError: 0 != 1


Yury and Andrew, your thoughts would be helpful in debugging this issue and also to check the actual behavior to be tested in this case.
msg337785 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-03-12 18:40
Can you change "await asyncio.sleep(0.01)" to "await asyncio.sleep(1)" and check if the test still fails?  (IOW see if there's a bug in Python's async gen implementation or this is simply caused by slow CI bots).
msg337789 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-03-12 18:52
I changed sleep(0.01) to sleep(1) and even after 350 runs I couldn't see any failure in test. I guess it fails only for less difference in time like 0.01 and 0.001.

$ ./python.exe -X dev -m test -j4 -F test_asyncgen -m test_async_gen_asyncio_gc_aclose_09
[snip]
0:02:46 load avg: 3.71 [356] test_asyncgen passed
0.660945396 del g
0.661572575 run sleeps 1000 ms...
1.664627785 1
0.666234736 finally sleeps 1 ms
0.667970371 0.001
0.669262596 finally sleeps 1 more ms
0.671174578 0.001
0.671972246 finally: DONE = 1--
1.668627272 run exit
msg337792 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-03-12 19:34
> I guess it fails only for less difference in time like 0.01 and 0.001.

Yeah, that's what I was thinking.  0.1 seconds is still too tight, a simple GC pause can skew the timers and break the test logic.

Maybe we should just increase the sleep to 0.5 or 1 second and hope that it will eliminate the race.
msg343217 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-05-22 16:42
This seems to occur rather commonly now such that it fails and then passes on verbose run. Example : https://dev.azure.com/Python/cpython/_build/results?buildId=43325&view=logs&j=c83831cd-3752-5cc7-2f01-8276919eb334&t=5a421c4a-0933-53d5-26b9-04b36ad165eb&l=536

Victor, do you agree on increasing the timeout for sleep to 1 for this test?
msg361569 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-07 10:50
I didn't see the failure recently, I close it.
History
Date User Action Args
2022-04-11 14:59:10adminsetgithub: 79941
2020-02-07 10:50:45vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg361569

stage: resolved
2019-05-22 16:42:57xtreaksetmessages: + msg343217
2019-03-12 19:34:07yselivanovsetmessages: + msg337792
2019-03-12 18:52:32xtreaksetmessages: + msg337789
2019-03-12 18:40:06yselivanovsetmessages: + msg337785
2019-03-12 18:33:16xtreaksetnosy: + asvetlov, yselivanov
messages: + msg337783
2019-01-30 17:42:52xtreaksetnosy: + xtreak
messages: + msg334584
2019-01-30 17:23:17vstinnersetmessages: + msg334575
2019-01-17 14:04:24vstinnercreate