Issue 37224: [subinterpreters] test__xxsubinterpreters fails randomly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81405

classification

Title:	[subinterpreters] test__xxsubinterpreters fails randomly
Type:		Stage:	resolved
Components:	Subinterpreters, Tests	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	aeros, eric.snow, koobs, miss-islington, python-dev, shihai1991, vstinner
Priority:	normal	Keywords:	patch

Created on 2019-06-11 00:58 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
3.x.koobs-freebsd-564d-build-973-test-stdio.txt	koobs, 2020-06-10 05:13
add_printf_to_get_details_from_race_condition.patch	shihai1991, 2021-07-19 15:45

Pull Requests
URL	Status	Linked	Edit
PR 16293	closed	eric.snow, 2019-11-22 23:56
PR 18058	merged	aeros, 2020-01-19 04:43
PR 18318	merged	miss-islington, 2020-02-03 06:00
PR 26598	closed	shihai1991, 2021-06-08 11:12
PR 26717	closed	python-dev, 2021-06-14 10:48
PR 26733	closed	shihai1991, 2021-06-15 11:30
PR 27240	merged	shihai1991, 2021-07-19 16:01
PR 27452	merged	miss-islington, 2021-07-29 17:08
PR 27453	merged	miss-islington, 2021-07-29 17:28

Messages (51)
msg345162 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-11 00:58
https://buildbot.python.org/all/#/builders/224/builds/5 0:46:02 load avg: 1.53 [259/423/1] test__xxsubinterpreters failed -- running: test_httplib (1 min 6 sec), test_mmap (16 min 42 sec) beginning 6 repetitions 123456 ..Exception in thread Thread-28: Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 923, in _bootstrap_inner self.run() File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 865, in run self._target(self._args, *self._kwargs) File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 182 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true (...) FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 765, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised
msg345163 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-11 00:58
See bpo-33356.
msg345219 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-11 10:55
Fail on x86 Gentoo Refleaks 3.x: https://buildbot.python.org/all/#/builders/1/builds/618 3:15:59 load avg: 6.93 [339/423/1] test__xxsubinterpreters failed -- running: test_multiprocessing_spawn (19 min 36 sec) beginning 6 repetitions 123456 .Exception in thread Thread-17: Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 923, in _bootstrap_inner self.run() File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 865, in run self._target(self._args, self._kwargs) File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 108 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 765, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised (...) test_after_destroying (test.test__xxsubinterpreters.ListAllTests) ... ok test_initial (test.test__xxsubinterpreters.ListAllTests) ... ok test_SystemExit (test.test__xxsubinterpreters.RunStringTests) ... ok test_already_running (test.test__xxsubinterpreters.RunStringTests) ... spam Exception in thread Thread-31: Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 923, in _bootstrap_inner self.run() File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 865, in run self._target(self._args, **self._kwargs) File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 190 FAIL test_bad_id (test.test__xxsubinterpreters.RunStringTests) ... ok test_bad_script (test.test__xxsubinterpreters.RunStringTests) ... ok test_bytes_for_script (test.test__xxsubinterpreters.RunStringTests) ... ok test_create_thread (test.test__xxsubinterpreters.RunStringTests) ... ok test_does_not_exist (test.test__xxsubinterpreters.RunStringTests) ... ok test_error_id (test.test__xxsubinterpreters.RunStringTests) ... ok test_execution_namespace_is_main (test.test__xxsubinterpreters.RunStringTests) ... ok test_failure (test.test__xxsubinterpreters.RunStringTests) ... ok test_fork (test.test__xxsubinterpreters.RunStringTests) ... ok test_in_thread (test.test__xxsubinterpreters.RunStringTests) ... ok test_invalid_syntax (test.test__xxsubinterpreters.RunStringTests) ... ok test_main_reused (test.test__xxsubinterpreters.RunStringTests) ... ok test_shared_overwrites (test.test__xxsubinterpreters.RunStringTests) ... ok test_shared_overwrites_default_vars (test.test__xxsubinterpreters.RunStringTests) ... ok test_still_running_at_exit (test.test__xxsubinterpreters.RunStringTests) ... skipped 'blocking forever' test_success (test.test__xxsubinterpreters.RunStringTests) ... ok test_sys_exit (test.test__xxsubinterpreters.RunStringTests) ... ok test_with_shared (test.test__xxsubinterpreters.RunStringTests) ... ok test_bytes (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_int (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_non_shareable_int (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_singletons (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_types (test.test__xxsubinterpreters.ShareableTypeTests) ... ok ====================================================================== FAIL: test_already_running (test.test__xxsubinterpreters.RunStringTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 855, in test_already_running interpreters.run_string(self.id, 'print("spam")') AssertionError: RuntimeError not raised
msg345415 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-12 21:52
Failure on AppVeyor on my PR 14029 (which is unrelated): test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ... Exception in thread Thread-8: Traceback (most recent call last): File "C:\projects\cpython\lib\threading.py", line 923, in _bootstrap_inner self.run() File "C:\projects\cpython\lib\threading.py", line 865, in run self._target(self._args, *self._kwargs) File "C:\projects\cpython\lib\test\test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 FAIL FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\projects\cpython\lib\test\test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true test__xxsubinterpreters passed when run again.
msg347486 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-07-08 07:46
The test still fails times to times: https://buildbot.python.org/all/#/builders/80/builds/644 FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\buildarea\3.x.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true
msg352508 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-09-16 05:15
Upon digging through Modules/_xxsubinterpretersmodule.c, I noticed that on line 2059, `PyInterpreterState_Delete(interp);` is commented out (https://github.com/python/cpython/blob/bf169915ecdd42329726104278eb723a7dda2736/Modules/_xxsubinterpretersmodule.c#L2059). This was done when _xxsubinterpretersmodule.c was first added by Eric Snow (https://github.com/python/cpython/blob/bf169915ecdd42329726104278eb723a7dda2736/Modules/_xxsubinterpretersmodule.c#L2059), so it seems to have been done intentionally but I don't understand why. Is this because `Py_EndInterpreter()` is supposed to shutdown the interpreter, so `PyInterpreterState_Delete()` isn't needed? If so, that still doesn't particularly explain why it was commented out. Perhaps Eric can elaborate.
msg352509 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-09-16 05:17
Note that I'm not particularly experienced with the c-api, I'm just trying to take a stab at understanding why the buildbot failure is occuring.
msg352510 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-09-16 05:20
Clarification: In the above comment when I was referring to the commit made by Eric Snow, I meant to link to https://github.com/python/cpython/commit/7f8bfc9b9a8381ddb768421b5dd5cbd970266190.
msg352514 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-09-16 06:48
Upon further reading of the documentation and some experimentation, it would definitely not make sense to call `PyInterpreterState_Delete` here (since we're only closing the sub-interpreter in the current thread), so that's not the issue. I still have no idea as to why it's commented out there, but that's besides the point. I think I may have found the potential cause of the test failure though. Looking into `test_still_running()`, it's clear that the intention is for a RuntimeError to be raised if `interpreters.destroy()` is called on a currently running interpreter: ``` def test_still_running(self): main, = interpreters.list_all() interp = interpreters.create() with _running(interp): with self.assertRaises(RuntimeError): interpreters.destroy(interp) self.assertTrue(interpreters.is_running(interp)) ``` However, within interp_destroy (https://github.com/python/cpython/blob/56a45142e70a1ccf3233d43cb60c47255252e89a/Modules/_xxsubinterpretersmodule.c#L2024), it is apparent that the RuntimeError is ONLY raised when attempting to destroy the current interpreter: ``` if (interp == current) { PyErr_SetString(PyExc_RuntimeError, "cannot destroy the current interpreter"); return NULL; } ``` When attempting to destroy a running interpreter, NULL is returned without raising the RuntimeError: ``` if (_ensure_not_running(interp) < 0) { return NULL; } ``` This was my first guess at a solution: ``` if (_ensure_not_running(interp) < 0) { PyErr_SetString(PyExc_RuntimeError, "cannot destroy a running interpreter") return NULL; } ``` But, within `_ensure_not_running()` (https://github.com/python/cpython/blob/56a45142e70a1ccf3233d43cb60c47255252e89a/Modules/_xxsubinterpretersmodule.c#L1842), a RuntimeError is supposed to be raised from: ``` if (is_running) { PyErr_Format(PyExc_RuntimeError, "interpreter already running"); return -1; } ``` Initially, I was unsure of the difference between `PyErr_SetString()` and `PyErr_Format()`, so I referred to the documentation. `PyErr_Format()` is similar, but it also returns NULL. If I'm not mistaken doesn't mean that the `return -1` within the `if (is_running)` bock is effectively ignored? If so, this would potentially explain the problem... I think the appropriate solution would be to replace `PyErr_Format()` with `PyErr_SetString()` within `_ensure_not_running()`. Also, I think it would be more useful to additionally raise the RuntimeError within `interp_destroy()` if the interpreter is running, to provide a more useful and specific error message. "cannot destroy a running interpreter" is far more useful for debugging purposes than a more generic "interpreter already running". I plan on opening a PR to make these changes in the next few days. Let me know what you think Victor.
msg352516 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-09-16 06:52
Clarification: "If I'm not mistaken doesn't mean that the `return -1` within ..." Should instead be: "If I'm not mistaken doesn't this mean that the `return -1` within ..." (I am really looking forward to moving issue over to GitHub at some point. It's nice to be able to edit comments and properly format code blocks...)
msg352592 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-09-17 00:11
Upon further consideration, I don't think this will address the issue. If the RuntimeError was not being raised, this failure would be consistent rather than intermittent. I think I have have gotten a bit mixed up, even if the return value of PyErr_Format is NULL, it would not implicitly return NULL upon being called and exit the function early (like a macro could) . I'm not experienced with programming in C, I only started learning it more recently (Python, Java, and C# have been my primary languages) when I started contributing to CPython in June. Instead, I suspect this is likely a concurrency issue, where multiple threads are trying to access the same sub-interpreter during ``interp_destroy()``. The solution will likely involve finding the correct place for a lock. I'll continue to work on this and see if I can find a solution.
msg352835 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-09-20 05:03
Is there a currently reliable way of accessing the GIL functions within the sub-interpreters, without causing deadlock issues? I was trying to follow the advice in the documentation (https://docs.python.org/3/c-api/init.html?highlight=global%20interpreter%20lock#bugs-and-caveats). "It is highly recommended that you don’t switch sub-interpreters between a pair of matching PyGILState_Ensure() and PyGILState_Release() calls." But it seemed that any attempt to use any of the PyGIL* calls within ``interp_destroy()`` in a meaningful way resulted in a deadlock, even if it was done away from the sub-interpreter switching. My next idea would be to add a conditional check to see if the current thread has ownership of the GIL, and using ``PyEval_RestoreThread()`` to acquire it if it doesn't. This would be followed by releasing the GIL with ``PyThreadState_Get()`` at the end of the function. I'll try experimenting with that idea next.
msg354658 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-14 21:24
I marked bpo-33356 as a duplicate of this issue: * Windows10 3.x: test_already_running() failed once, but passed when run again: ... * AMD64 FreeBSD 10-STABLE Non-Debug 3.x: ...
msg354659 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-14 21:26
I closed bpo-38154 as a duplicate of this issue: * AMD64 FreeBSD CURRENT Shared 3.x: ... Kyle Stanley proposed a fix: PR 16293.
msg354661 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-14 21:27
AMD64 FreeBSD CURRENT Shared 3.x: https://buildbot.python.org/all/#/builders/168/builds/1630 FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) 1:07:09 load avg: 8.18 Re-running failed tests in verbose mode 1:07:09 load avg: 8.18 Re-running test__xxsubinterpreters in verbose mode FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) FAIL: test_already_running (test.test__xxsubinterpreters.RunStringTests)
msg354662 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-14 21:28
AMD64 FreeBSD CURRENT Shared 3.8: https://buildbot.python.org/all/#/builders/212/builds/310 FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) FAIL: test_run_coroutine_threadsafe_with_timeout (test.test_asyncio.test_tasks.RunCoroutineThreadsafeTests) 1:01:44 load avg: 7.72 Re-running failed tests in verbose mode 1:01:44 load avg: 7.72 Re-running test__xxsubinterpreters in verbose mode FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) 1:02:01 load avg: 7.58 Re-running test_asyncio in verbose mode
msg354677 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-10-14 23:00
> Kyle Stanley proposed a fix: PR 16293. Note that I'm not confident about the fix I proposed in GH-16293, it was more of an idea to fix the intermittent failures more than anything. It definitely needs review from someone knowledgeable about sub-interpreters and the PyInterpreterState API. This isn't an area that I'm experienced in, but I am interested in it.
msg354876 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-18 11:06
The race condition can be reproduced on Linux using a lot of parallel test worker processes. For example, I reproduce it on my laptop (8 CPUs) using: $ ./python -m test test__xxsubinterpreters -F -j30 ... 0:00:26 load avg: 17.86 [ 24] test__xxsubinterpreters passed 0:00:26 load avg: 17.86 [ 25] test__xxsubinterpreters passed 0:00:26 load avg: 17.86 [ 26] test__xxsubinterpreters passed 0:00:27 load avg: 17.86 [ 27/1] test__xxsubinterpreters failed Exception in thread Thread-8: Traceback (most recent call last): File "/home/vstinner/python/master/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/home/vstinner/python/master/Lib/threading.py", line 882, in run self._target(self._args, *self._kwargs) File "/home/vstinner/python/master/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/home/vstinner/python/master/Lib/test/test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true
msg355221 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-23 14:42
AMD64 FreeBSD Shared 3.x: https://buildbot.python.org/all/#/builders/371/builds/7 test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ... Exception in thread Thread-8: Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 882, in run self._target(self._args, self._kwargs) File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 FAIL test_already_running (test.test__xxsubinterpreters.RunStringTests) ... Exception in thread Thread-9: Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 882, in run self._target(self._args, **self._kwargs) File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: interpreter already running FAIL ====================================================================== FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true ====================================================================== FAIL: test_already_running (test.test__xxsubinterpreters.RunStringTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 853, in test_already_running interpreters.run_string(self.id, 'print("spam")') AssertionError: RuntimeError not raised
msg355520 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-28 10:52
See also bpo-33868: test__xxsubinterpreters fails randomly since at least 2018-06-15.
msg356631 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-11-14 22:27
AMD64 Windows8.1 Refleaks 3.8: https://buildbot.python.org/all/#/builders/224/builds/151 0:54:23 load avg: 5.62 [306/423/3] test__xxsubinterpreters failed -- running: test_asyncio (4 min 26 sec), test_zipfile (6 min 6 sec), test_compileall (6 min 21 sec) beginning 6 repetitions 123456 ..Exception in thread Thread-27: Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 932, in _bootstrap_inner self.run() File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 870, in run self._target(self._args, *self._kwargs) File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 175 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 763, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests)
msg357347 - (view)	Author: Eric Snow (eric.snow) *	Date: 2019-11-22 23:57
Sorry I haven't gotten back to you sooner, Kyle. Thanks for working on this. I'm looking at your PR right now.
msg357349 - (view)	Author: Eric Snow (eric.snow) *	Date: 2019-11-23 00:17
BTW, Kyle, your problem-solving approach on this is on-track. Don't get discouraged. This stuff is tricky. :)
msg357350 - (view)	Author: Eric Snow (eric.snow) *	Date: 2019-11-23 00:30
Thus far these are the failures we've seen: * not running when we expect it to be running: * interpreters.is_running(interp) * interpreters.run_string(interp, ...) * interpreters.destroy(interp) * can't find the interpreter even though we expect it to exist * interpreters.run_string(interp, ...) * finds it running when we expect it to not be running * interpreters.run_string(interp, ...) Except for the last one (which might be a separate issue), they all look like they could be explained by the same thing: the subinterpreter stopped (or went away) prematurely. That could be related to the code in _xxsubinterpretersmodule.c or it could be the cleanup code that makes sure interpreters get cleaned up at the end of tests (e.g. running too soon). Either way I expect the fix will be in the module code and not the tests. Regarding "is_running()", notice that it relies almost entirely on "frame->f_executing". That might not be enough (or maybe the behavior there changed). That would be worth checking out. @aeros, feel free too keep investigating. I'd be glad to help you out. Otherwise I'll dive into this probably next week.
msg357361 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-11-23 06:46
> Sorry I haven't gotten back to you sooner, Kyle. Thanks for working on this. I'm looking at your PR right now. > BTW, Kyle, your problem-solving approach on this is on-track. Don't get discouraged. This stuff is tricky. :) No problem at all, and thank you for the words of encouragement. (: > Regarding "is_running()", notice that it relies almost entirely on "frame->f_executing". That might not be enough (or maybe the behavior there changed). That would be worth checking out. Hmm, that's an interesting point that I hadn't considered. > @aeros, feel free too keep investigating. I'd be glad to help you out. Otherwise I'll dive into this probably next week. Sounds good, I'll do some further digging around, particularly anywhere that interacts with PyFrameObject's `f_executing` field. I think it's possible that there's a non-obvious issue with `_is_running()`, where it works correctly 99% of the time. That seems to be a significant commonality between the different areas where the intermittent failures are occurring, they all directly or indirectly call `_is_running()`. Also, thanks again Eric for the PR review. Looking back it after Eric's analysis and having a couple of months to think it over, I don't think that GH-16293 is the correct solution. It seems unlikely that this is being caused by a lack of proper GIL acquisition, as that would likely be causing far more consistent build failures. I'm thinking that it's more likely to be an issue with either: 1) Subinterpreters occasionally disappearing due to premature cleanup (as Eric suggested) 2) _is_running() being incorrect a small percentage of the time
msg357364 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-11-23 07:58
So, I was finally able to replicate a failure in test_still_running locally, it required using a rather ridiculous number of parallel workers: $ ./python -m test test__xxsubinterpreters -j200 -F ... Exception in thread Thread-7: Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 882, in run self._target(self._args, *self._kwargs) File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 39 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 766, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised ... == Tests result: FAILURE == 94 tests OK. 1 test failed: test__xxsubinterpreters Total duration: 1 min 49 sec Tests result: FAILURE OS: Arch Linux x86_64 Kernel: 5.3.11 CPU: Intel i5-4460 I was able to consistently reproduce the above failure using 200 parallel workers, even without `-f`. I have a few different theories that I'd like to test for fixing the failure, I'll report back if any of them yield positive results.
msg357365 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-11-23 08:01
> I was able to consistently reproduce the above failure using 200 parallel workers, even without `-f`. Oops, I didn't mean without passing `-F`, as this would result in only a single test being ran. I meant without letting it repeat multiple times, as in the test failure would consistently occur in the first batch of 200 parallel workers.
msg357378 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-11-23 12:58
> So, I was finally able to replicate a failure in test_still_running locally, it required using a rather ridiculous number of parallel workers I forgot to mention that I was able to replicate the above failure on the latest commit to the 3.8 branch. I was _unable_ to replicate the failure on the latest commit to master, even when scaling up the number of parallel workers (the max I tested was 300 workers and ~3k iterations of test__xxsubinterpreters).
msg357671 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-12-01 04:23
> Regarding "is_running()", notice that it relies almost entirely on "frame->f_executing". That might not be enough (or maybe the behavior there changed). That would be worth checking out. Based on the above hint, I was able to make some progress on a potential solution. Thanks Eric. Instead of only checking "frame->f_executing", I changed "_is_running()" to also check the "finalizing" field of PyInterpreterState. The "finalizing" field is set to 1 in "Py_EndInterpreter()", so this ensures that an interpreter in the process of being destroyed is considered "running", so that operations (such as running scripts, destroying the interpreter, etc) can't occur during finalization. I had to add a private function to the C-API in order to access "interp->finalizing" from Modules/_xxsubinterpretersmodule.c due to the struct for PyInterpreterState being internal only. The above fix seems to completely remove the test failure that occurs in "interpreters.destroy(interp)" in "test_already_running" after running it several times, but I'm able to consistently reproduce the following: Exception in thread Thread-8: Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 932, in _bootstrap_inner self.run() File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 870, in run self._target(self._args, *self._kwargs) File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true I have a few ideas that I'd like to test out for fixing this failure, and if any of them produce positive results I'll report back. Since the failures are still consistently occurring, I have not yer revised GH-16293. I'll do that when/if I come up with a more thorough solution.
msg357672 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-12-01 04:26
> so that operations (such as running scripts, destroying the interpreter, etc) can't occur during finalization Clarification: by "destroying the interpreter" I am specifically referring to calling `interp_destroy()` after finalization has already started.
msg358358 - (view)	Author: Eric Snow (eric.snow) *	Date: 2019-12-13 21:39
On Sat, Nov 30, 2019 at 9:23 PM Kyle Stanley <report@bugs.python.org> wrote: > Based on the above hint, I was able to make some progress on a potential solution. Thanks Eric. That's great! > Instead of only checking "frame->f_executing", I changed "_is_running()" to also check the > "finalizing" field of PyInterpreterState. The "finalizing" field is set to 1 in "Py_EndInterpreter()", > so this ensures that an interpreter in the process of being destroyed is considered "running", > so that operations (such as running scripts, destroying the interpreter, etc) can't occur during > finalization. Ah, that makes sense. > I had to add a private function to the C-API in order to access "interp->finalizing" from > Modules/_xxsubinterpretersmodule.c due to the struct for PyInterpreterState being internal only. Yep, it has to use the public C-API just like any other module. The function has a "_Py" prefix and be defined in Include/cpython, right?
msg358359 - (view)	Author: Eric Snow (eric.snow) *	Date: 2019-12-13 21:42
On Sat, Nov 30, 2019 at 9:23 PM Kyle Stanley <report@bugs.python.org> wrote: > I have a few ideas that I'd like to test out for fixing this failure, and if any of them produce positive results I'll report back. Sounds good. > Since the failures are still consistently occurring, I have not yer revised GH-16293. I'll do that when/if I come up with a more thorough solution. We don't need everything to be fixed in a single PR. Feel free to create a PR just for the "finalizing" fix.
msg358373 - (view)	Author: Kyle Stanley (aeros) *	Date: 2019-12-14 03:08
> Yep, it has to use the public C-API just like any other module. The > function has a "_Py" prefix and be defined in Include/cpython, right? Yeah, I named it "_PyInterpreterIsFinalizing" and it's within Include/cpython. Definitely open to suggestions on the name though, it's basically just a private getter for interp->finalizing. > We don't need everything to be fixed in a single PR. Feel free to > create a PR just for the "finalizing" fix. Oh, awesome! In that case, I'll do some more rigorous testing before opening the PR then; I'd like to be 99.99% certain that it at least resolves the following failure: FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test__xxsubinterpreters.py", line 765, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised Especially since it would be adding a new private C-API function, who's primary purpose is to address this specific failure. This might be a bit of a time consuming process, but I should have time in the next week or so to work on it.
msg358552 - (view)	Author: Eric Snow (eric.snow) *	Date: 2019-12-17 15:52
On Fri, Dec 13, 2019 at 8:08 PM Kyle Stanley <report@bugs.python.org> wrote: > Yeah, I named it "_PyInterpreterIsFinalizing" and it's within Include/cpython. Definitely open > to suggestions on the name though, it's basically just a private getter for interp->finalizing. For a struct-specific getter we usually end the prefix with an underscore: _PyInterpreter_IsFinalizing. Otherwise, that's the same name I would have used. :) > Oh, awesome! In that case, I'll do some more rigorous testing before opening the PR then; > [snip] > This might be a bit of a time consuming process, but I should have time in the next week > or so to work on it. No worries (or hurries). Just request a review from me when you're ready. Thanks again for working on this!
msg359699 - (view)	Author: Kyle Stanley (aeros) *	Date: 2020-01-10 00:52
> For a struct-specific getter we usually end the prefix with an underscore: _PyInterpreter_IsFinalizing. Otherwise, that's the same name I would have used. :) Good to know, thanks! > No worries (or hurries). Just request a review from me when you're ready. Thanks again for working on this! No problem, and thanks for your continued patience. I've been a bit additionally busy recently with the holidays and preparing for an upcoming technical interview w/ Amazon, so I didn't have time to thoroughly test my proposed fix. This would potentially be my first full-time job in the Software Engineering industry. As a result, I have a lot to prepare for. When I get the chance to open a PR after ensuring it at least 100% fixes the failure I was targeting, I'll be sure to add you as a reviewer. If any deadline comes up for this, feel free to move forward with implementing any part of my suggestions (or an entirely different fix that resolves the other failures at the same time). I should have time to work on this in the near future though. Thanks again for the guidance and patience with this! I've learned a lot about the C-API and C in general (which has been a weaker area for me) from my time spent working on the issue.
msg360024 - (view)	Author: Kyle Stanley (aeros) *	Date: 2020-01-15 05:26
Update: I have a bit of good news and not so great news. The good news is that I had some time to work on this again, specifically with isolating the failure in test__xxsubinterpreters.DestroyTests. Locally, I added a few temporary "@unittest.skip()" annotations to the other failing tests in an unmodified 3.8 until the only failure that would occur was the following: ====================================================================== FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 764, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised ---------------------------------------------------------------------- Once the failure was isolated, I applied my changes on top of that branch with the skipped tests to ensure my proposed change resolved the above failure. Unfortunately, it still occurred. It was somewhat disappointing, but I'm certainly glad that I waited to thoroughly test the proposed changes before opening a PR. I'll continue working on this and see if I can find another fix.
msg360025 - (view)	Author: Kyle Stanley (aeros) *	Date: 2020-01-15 05:49
I also just realized that I can run "test.test__xxsubinterpreters.DestroyTests" by itself with: ./python -m test test__xxsubinterpreters.DestroyTests -j200 -F -v For some reason, I hadn't thought of running that class of tests by itself to isolate the failure. Prior to this issue, I didn't have experience in debugging a group of intermittent failures occurring across different tests. For most bugs I've worked on so far, it was a single, clearly defined point of failure; or a behavioral issue that wasn't covered in the regression tests. But, that's certainly useful to know for future debugging and will help to improve my workflow for further investigating this issue. It's a lot more effective than adding a bunch of skip test annotations throughout the test file!
msg360027 - (view)	Author: Kyle Stanley (aeros) *	Date: 2020-01-15 06:11
> I also just realized that I can run "test.test__xxsubinterpreters.DestroyTests" by itself with: > ./python -m test test__xxsubinterpreters.DestroyTests -j200 -F -v Oops, the correct syntax is: ./python -m test test__xxsubinterpreters --match DestroyTests -j200 -F -v
msg360029 - (view)	Author: Kyle Stanley (aeros) *	Date: 2020-01-15 07:20
I just made a rather interesting discovery. Instead of specifically focusing my efforts on the logic with interp_destroy(), I decided to take a closer look at the failing unit test itself. The main test within DestroyTests that's failing is the following: ``` def test_still_running(self): main, = interpreters.list_all() interp = interpreters.create() with _running(interp): with self.assertRaises(RuntimeError): interpreters.destroy(interp) self.assertTrue(interpreters.is_running(interp)) ``` (Specifically, "self.assertRaises(RuntimeError): interpreters.destroy(interp)" is the main point of failure) In order to be 100% certain that it was an issue occurring from interpreters.destroy(), I decided to add in a bit of a "sanity check" to ensure the interpreter was actually running in the first place before destroying it (I also added some extra debugging info): ``` def test_still_running(self): main, = interpreters.list_all() interp = interpreters.create() with _running(interp): self.assertTrue(interpreters.is_running(interp), msg=f"Interp {interp} should be running before destruction.") with self.assertRaises(RuntimeError, msg=f"Should not be able to destroy interp {interp} while" " it's still running."): interpreters.destroy(interp) self.assertTrue(interpreters.is_running(interp)) ``` The results were very interesting... ``` OK 0:00:49 load avg: 135.49 [306/1] test__xxsubinterpreters failed test_all (test.test__xxsubinterpreters.DestroyTests) ... ok test_already_destroyed (test.test__xxsubinterpreters.DestroyTests) ... ok test_bad_id (test.test__xxsubinterpreters.DestroyTests) ... ok test_does_not_exist (test.test__xxsubinterpreters.DestroyTests) ... ok test_from_current (test.test__xxsubinterpreters.DestroyTests) ... ok test_from_other_thread (test.test__xxsubinterpreters.DestroyTests) ... ok test_from_sibling (test.test__xxsubinterpreters.DestroyTests) ... ok test_main (test.test__xxsubinterpreters.DestroyTests) ... ok test_one (test.test__xxsubinterpreters.DestroyTests) ... ok test_still_running (test.test__xxsubinterpreters.DestroyTests) ... FAIL ====================================================================== FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 763, in test_still_running self.assertTrue(interpreters.is_running(interp), AssertionError: False is not true : Interp 12 should be running before destruction. ---------------------------------------------------------------------- ``` As can be seen from the results above, the interpreter is not even running in the first place before it's destroyed, so of course destroy() won't raise an RuntimeError. I think this proves that interpreters.destroy() is _not_ where we should be focusing our efforts (at least for now). Instead, we should first investigate why it's not even running at this point. I suspect the issue _might_ be a race condition within the "_running()" context manager that's preventing the interpreter from being ran, but I'll have to do some further investigation. I also ran this ~20 times to be 100% certain, and every single one of those times the point of failure was at the new assertion check I added before destroy(). Notably, a rather difficult and hard to explain side effect occurred from adding the new assertion. The average number of tests before failure increased by a significant amount. In the above test, it was able to pass 306 iterations before failing, and in one of my earlier tests it reached over 1000. That never happened before on the 3.8 branch, it would very consistently fail in the first set of parallel workers if not very soon after. I can say that with a degree certainty as well, since I've ran this set of tests a countless number of times while trying to debug the failure. But, I have no explanation for this. Do you have any potential ideas, Eric? Also, do you think it might be worth adding in the changes I made to DestroyTests.test_still_running above? It doesn't directly address the core failure occurring, but I think it improves the test significantly; both in functionality and debugging info. I would be glad to open a PR if you think the test changes might be useful, as well as make any needed adjustments to them.
msg360209 - (view)	Author: Eric Snow (eric.snow) *	Date: 2020-01-17 19:49
On Wed, Jan 15, 2020 at 12:20 AM Kyle Stanley <report@bugs.python.org> wrote: > As can be seen from the results above, the interpreter is not even running in the first place before > it's destroyed, so of course destroy() won't raise an RuntimeError. I think this proves that > interpreters.destroy() is _not_ where we should be focusing our efforts (at least for now). Instead, > we should first investigate why it's not even running at this point. Good catch. > I suspect the issue _might_ be a race condition within the "_running()" context manager that's > preventing the interpreter from being ran, but I'll have to do some further investigation. Sounds good. > Notably, a rather difficult and hard to explain side effect occurred from adding the new assertion. > [snip] > But, I have no explanation for this. Yeah, that sounds a bit strange. Keep in mind that there have been other changes in this part of the runtime code, so this might be related. Or I suppose it could be a side effect of calling is_running() (though that definitely should not have side effects). > do you think it might be worth adding in the changes I made to DestroyTests.test_still_running above? Yeah, it's a good sanity check on the assumptions made by the test. Please do open a PR and request a review from me.
msg361130 - (view)	Author: miss-islington (miss-islington)	Date: 2020-01-31 20:07
New changeset f03a8f8d5001963ad5b5b28dbd95497e9cc15596 by Kyle Stanley in branch 'master': bpo-37224: Improve test__xxsubinterpreters.DestroyTests (GH-18058) https://github.com/python/cpython/commit/f03a8f8d5001963ad5b5b28dbd95497e9cc15596
msg361353 - (view)	Author: miss-islington (miss-islington)	Date: 2020-02-04 15:31
New changeset 9a740b6c7e7a88185d79128b8a1993ac387d5091 by Miss Islington (bot) in branch '3.8': bpo-37224: Improve test__xxsubinterpreters.DestroyTests (GH-18058) https://github.com/python/cpython/commit/9a740b6c7e7a88185d79128b8a1993ac387d5091
msg361354 - (view)	Author: Eric Snow (eric.snow) *	Date: 2020-02-04 15:32
Thanks, Kyle. That helps at least a little. :)
msg361387 - (view)	Author: Kyle Stanley (aeros) *	Date: 2020-02-05 00:20
> Thanks, Kyle. That helps at least a little. :) No problem. (: I'll certainly spend some additional time investigating the main issue when I have the chance to, but in the meantime that test change should make it slightly easier to determine the point of failure.
msg363992 - (view)	Author: Kyle Stanley (aeros) *	Date: 2020-03-12 03:15
I have a few spare cycles to take another stab at this issue. I can say with some certainty that the failure in test__xxsubinterpreters.DestroyTests does not occur on the latest commit to master (3.9): ``` $ ./python -m test test__xxsubinterpreters --match DestroyTests -j200 -F -v OK 0:09:28 load avg: 80.87 [2188] test__xxsubinterpreters passed test_all (test.test__xxsubinterpreters.DestroyTests) ... ok test_already_destroyed (test.test__xxsubinterpreters.DestroyTests) ... ok test_bad_id (test.test__xxsubinterpreters.DestroyTests) ... ok test_does_not_exist (test.test__xxsubinterpreters.DestroyTests) ... ok test_from_current (test.test__xxsubinterpreters.DestroyTests) ... ok test_from_other_thread (test.test__xxsubinterpreters.DestroyTests) ... ok test_from_sibling (test.test__xxsubinterpreters.DestroyTests) ... ok test_main (test.test__xxsubinterpreters.DestroyTests) ... ok test_one (test.test__xxsubinterpreters.DestroyTests) ... ok test_still_running (test.test__xxsubinterpreters.DestroyTests) ... ok == Tests result: INTERRUPTED == Test suite interrupted by signal SIGINT. 2188 tests OK. ``` So at this point it seems to be a matter of looking through the diff between 3.8 vs 3.9 for any relevant code paths, and attempting to determine what change resolved the failure for 3.9. It might also be useful to determine if it occurred when the tests were first added in 3.8 or if some other commit introduced a subtle regression.
msg371160 - (view)	Author: Kubilay Kocak (koobs)	Date: 2020-06-10 05:13
Failure on AMD64 FreeBSD Shared 3.x (full log attached): ====================================================================== FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 478, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true ... 2 tests failed: test__xxsubinterpreters test_interpreters ... test_already_running (test.test_interpreters.TestInterpreterRun) ... spam Stderr: Exception in thread Thread-19: Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 950, in _bootstrap_inner self.run() File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 888, in run self._target(self._args, self._kwargs) File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test_interpreters.py", line 45, in run interp.run(dedent(f""" File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/support/interpreters.py", line 94, in run _interpreters.run_string(self._id, src_str) RuntimeError: unrecognized interpreter ID 111 FAIL ... test test_interpreters failed test_send_recv_same_interpreter (test.test_interpreters.TestSendRecv) ... ok ====================================================================== FAIL: test_still_running (test.test_interpreters.TestInterpreterDestroy) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test_interpreters.py", line 309, in test_still_running interp.close() AssertionError: RuntimeError not raised ====================================================================== FAIL: test_subinterpreter (test.test_interpreters.TestInterpreterIsRunning) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test_interpreters.py", line 207, in test_subinterpreter self.assertTrue(interp.is_running()) AssertionError: False is not true ====================================================================== FAIL: test_already_running (test.test_interpreters.TestInterpreterRun) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test_interpreters.py", line 383, in test_already_running self.interp.run('print("spam")') AssertionError: RuntimeError not raised Stderr: Exception in thread Thread-19: Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 950, in _bootstrap_inner self.run() File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 888, in run self._target(self._args, **self._kwargs) File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test_interpreters.py", line 45, in run interp.run(dedent(f""" File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/support/interpreters.py", line 94, in run _interpreters.run_string(self._id, src_str) RuntimeError: unrecognized interpreter ID 111 ---------------------------------------------------------------------- Ran 40 tests in 20.251s FAILED (failures=3) 1 test failed again: test_interpreters == Tests result: FAILURE then FAILURE == ... 1 test failed: test_interpreters ... 2 re-run tests: test__xxsubinterpreters test_interpreters
msg372561 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-06-29 10:54
AMD64 Windows8.1 Refleaks 3.x: https://buildbot.python.org/all/#/builders/157/builds/218 2:51:27 load avg: 3.54 [275/423/2] test__xxsubinterpreters crashed (Exit code 3221225477) -- running: test_regrtest (2 hour 49 min), test_distutils (1 min 56 sec), test_pydoc (1 min 36 sec) beginning 6 repetitions 123456 ...Windows fatal exception: access violation Thread 0x000004dc (most recent call first): File "D:\buildarea\3.x.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 766 in setUp File "D:\buildarea\3.x.ware-win81-release.refleak\build\lib\unittest\case.py", line 547 in _callSetUp File "D:\buildarea\3.x.ware-win81-release.refleak\build\lib\unittest\case.py", line 589 in run File "D:\buildarea\3.x.ware-win81-release.refleak\build\lib\unittest\case.py", line 653 in __call__ (...)
msg377749 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-01 12:36
The test still fails randomly: AMD64 Windows8.1 Refleaks 3.8 https://buildbot.python.org/all/#/builders/200/builds/48 test_subinterpreter failed twice on Refleaks which runs tests 6 times in a row. First run: 0:33:58 load avg: 4.78 [204/423/1] test__xxsubinterpreters failed -- running: test_multiprocessing_spawn (12 min 42 sec), test_io (7 min 28 sec), test_tarfile (39.2 sec) beginning 6 repetitions 123456 ...Exception in thread Thread-38: Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 932, in _bootstrap_inner self.run() File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 870, in run self._target(self._args, *self._kwargs) File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 250 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 495, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true Second run: FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 495, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true
msg394974 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-06-03 02:21
Recent test_still_running() failure on AMD64 Windows8.1 Refleaks 3.10: https://buildbot.python.org/all/#/builders/638/builds/19 1:22:17 load avg: 3.00 [393/427/2] test__xxsubinterpreters failed -- running: test_pydoc (9 min 7 sec), test_venv (2 min 54 sec), test_dbm_dumb (1 min 12 sec) beginning 6 repetitions 123456 .Warning -- Uncaught thread exception: RuntimeError Exception in thread Thread-16 (run): Traceback (most recent call last): File "D:\buildarea\3.10.ware-win81-release.refleak\build\lib\threading.py", line 1006, in _bootstrap_inner self.run() File "D:\buildarea\3.10.ware-win81-release.refleak\build\lib\threading.py", line 943, in run self._target(self._args, *self._kwargs) File "D:\buildarea\3.10.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 46, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 125 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "D:\buildarea\3.10.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 753, in test_still_running self.assertTrue(interpreters.is_running(interp), AssertionError: False is not true : Interp 125 should be running before destruction.
msg395021 - (view)	Author: Hai Shi (shihai1991) *	Date: 2021-06-03 16:59
OK, I try to take a look after Kyle leaves temporarily. But I haven't replicate this issue in my vm in recent days :(
msg397809 - (view)	Author: Hai Shi (shihai1991) *	Date: 2021-07-19 15:45
I use this add_printf_to_get_details_from_race_condition.patch to get some details of the running subinterpreter. The successful output of test_already_running: test_already_running (test.test__xxsubinterpreters.RunStringTests) ... In interp_create, the current state is: 0x23ee8f0 In interp_list_all, the current state is: 0x23ee8f0 # Entering _running(). Before interp runs, the current state is: 0x2489590 # Running the interp in thread. In interp_list_all, the current state is: 0x23ee8f0 # Running interpreters.list() in _running(). Before interp runs, the current state is: 0x23ee8f0 # Running the interp in test_already_running(). After interp runs, the current state is: 0x25cf0e0 # Running the interp in thread. In interp_list_all, the current state is: 0x23ee8f0 # Clearing the created interps. The failed output of test_already_running: test_already_running (test.test__xxsubinterpreters.RunStringTests) ... In interp_create, the current state is: 0x23e58f0 In interp_list_all, the current state is: 0x23e58f0 In interp_list_all, the current state is: 0x23e58f0 # Compared to the successful output of test_already_running, the interp doesn't run in time. Before interp runs, the current state is: 0x23e58f0 spam After interp runs, the current state is: 0x24fdfc0 In interp_list_all, the current state is: 0x23e58f0 FAIL ====================================================================== FAIL: test_already_running (test.test__xxsubinterpreters.RunStringTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/shihai/cpython/Lib/test/test__xxsubinterpreters.py", line 835, in test_already_running with self.assertRaises(RuntimeError): AssertionError: RuntimeError not raised ---------------------------------------------------------------------- Ran 123 tests in 71.968s FAILED (failures=1, skipped=6) Warning -- Uncaught thread exception: RuntimeError Exception in thread Thread-8 (run): Traceback (most recent call last): File "/home/shihai/cpython/Lib/threading.py", line 1009, in _bootstrap_inner self.run() ^^^^^^^^^^ File "/home/shihai/cpython/Lib/threading.py", line 946, in run self._target(self._args, *self._kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shihai/cpython/Lib/test/test__xxsubinterpreters.py", line 49, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 60 test test__xxsubinterpreters failed. According to above details, there are several possible explanations: 1.Some race condition break the opcode execution order. I checked the eval function. I didn't find that any condition will disrupt the execution order. 2.The running thread hasn't get the GIL in time. "RuntimeError: unrecognized interpreter ID 60" proves that the interp has been executed in _running(), but later than the test_already_running. So I creates PR 26598. But I catch the error again with PR 26598. 3.the running thread can't get the time slice of CPU. By running python with add_printf_to_get_details_from_race_condition.patch, I find that the thread created in _running() has be executed, but the interp hasn't been executed immediately in run(). At the same time, the done.set() has been executed in run(), so I guess the real reason is the thread created in _running() hasn't get the time slice of CPU to run interp in run() in multiprocess. Pls correct me if I am wrong.

History
Date	User	Action	Args
2022-04-11 14:59:16	admin	set	github: 81405
2021-07-29 17:32:05	lukasz.langa	set	status: open -> closed stage: patch review -> resolved resolution: fixed versions: + Python 3.9, Python 3.10
2021-07-29 17:28:06	miss-islington	set	pull_requests: + pull_request25981
2021-07-29 17:08:17	miss-islington	set	pull_requests: + pull_request25980
2021-07-19 16:01:23	shihai1991	set	pull_requests: + pull_request25789
2021-07-19 15:45:08	shihai1991	set	files: + add_printf_to_get_details_from_race_condition.patch messages: + msg397809
2021-06-15 11:30:26	shihai1991	set	pull_requests: + pull_request25320
2021-06-14 10:48:36	python-dev	set	nosy: + python-dev pull_requests: + pull_request25305
2021-06-08 11:12:36	shihai1991	set	pull_requests: + pull_request25182
2021-06-03 16:59:19	shihai1991	set	nosy: + shihai1991 messages: + msg395021 versions: + Python 3.11, - Python 3.8
2021-06-03 02:21:32	vstinner	set	messages: + msg394974
2020-10-01 12:36:08	vstinner	set	messages: + msg377749
2020-06-29 10:54:54	vstinner	set	nosy: + vstinner messages: + msg372561
2020-06-10 05:13:48	koobs	set	files: + 3.x.koobs-freebsd-564d-build-973-test-stdio.txt nosy: + koobs messages: + msg371160
2020-05-15 00:40:09	vstinner	set	components: + Subinterpreters title: test__xxsubinterpreters fails randomly -> [subinterpreters] test__xxsubinterpreters fails randomly
2020-03-12 03:15:55	aeros	set	messages: + msg363992
2020-02-05 00:20:20	aeros	set	messages: + msg361387
2020-02-04 15:32:52	eric.snow	set	messages: + msg361354
2020-02-04 15:31:25	miss-islington	set	messages: + msg361353
2020-02-03 06:00:52	miss-islington	set	pull_requests: + pull_request17693
2020-01-31 20:07:22	miss-islington	set	nosy: + miss-islington messages: + msg361130
2020-01-19 04:43:00	aeros	set	pull_requests: + pull_request17452
2020-01-17 19:49:57	eric.snow	set	messages: + msg360209
2020-01-15 21:31:33	vstinner	set	nosy: - vstinner
2020-01-15 07:20:48	aeros	set	messages: + msg360029
2020-01-15 06:11:52	aeros	set	messages: + msg360027
2020-01-15 05:49:30	aeros	set	messages: + msg360025
2020-01-15 05:26:09	aeros	set	messages: + msg360024
2020-01-10 00:52:37	aeros	set	messages: + msg359699
2019-12-17 15:52:50	eric.snow	set	messages: + msg358552
2019-12-14 03:08:01	aeros	set	messages: + msg358373
2019-12-13 21:42:27	eric.snow	set	messages: + msg358359
2019-12-13 21:39:20	eric.snow	set	messages: + msg358358
2019-12-01 04:26:21	aeros	set	messages: + msg357672
2019-12-01 04:23:09	aeros	set	messages: + msg357671
2019-11-23 12:58:33	aeros	set	messages: + msg357378
2019-11-23 08:01:39	aeros	set	messages: + msg357365
2019-11-23 07:58:49	aeros	set	messages: + msg357364
2019-11-23 06:46:39	aeros	set	messages: + msg357361
2019-11-23 00:30:36	eric.snow	set	messages: + msg357350
2019-11-23 00:17:20	eric.snow	set	messages: + msg357349
2019-11-22 23:57:36	eric.snow	set	messages: + msg357347
2019-11-22 23:56:40	eric.snow	set	keywords: + patch stage: patch review pull_requests: + pull_request16843
2019-11-14 22:27:18	vstinner	set	messages: + msg356631
2019-10-28 10:52:32	vstinner	set	messages: + msg355520
2019-10-23 14:42:33	vstinner	set	messages: + msg355221
2019-10-18 11:06:24	vstinner	set	messages: + msg354876
2019-10-14 23:00:08	aeros	set	messages: + msg354677
2019-10-14 21:28:12	vstinner	set	messages: + msg354662
2019-10-14 21:27:17	vstinner	set	messages: + msg354661
2019-10-14 21:26:36	vstinner	set	title: test__xxsubinterpreters failed on AMD64 Windows8.1 Refleaks 3.8 -> test__xxsubinterpreters fails randomly
2019-10-14 21:26:24	vstinner	link	issue38154 superseder
2019-10-14 21:26:18	vstinner	set	messages: + msg354659
2019-10-14 21:24:46	vstinner	set	messages: + msg354658
2019-09-20 05:03:25	aeros	set	messages: + msg352835
2019-09-17 00:11:09	aeros	set	messages: + msg352592
2019-09-16 06:52:07	aeros	set	messages: + msg352516
2019-09-16 06:48:15	aeros	set	messages: + msg352514
2019-09-16 05:20:01	aeros	set	messages: + msg352510
2019-09-16 05:17:13	aeros	set	messages: + msg352509
2019-09-16 05:15:02	aeros	set	nosy: + aeros messages: + msg352508
2019-07-08 07:46:36	vstinner	set	messages: + msg347486
2019-06-12 21:52:44	vstinner	set	messages: + msg345415
2019-06-11 10:55:23	vstinner	set	messages: + msg345219
2019-06-11 00:58:20	vstinner	set	messages: + msg345163
2019-06-11 00:58:01	vstinner	create