New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: Modules/gcmodule.c:379: visit_decref: Assertion `((gc)->gc.gc_refs >> (1)) != 0' failed. #64725
Comments
While trying to reproduce a race condition in asyncio, I got an assertion error from the Python garbage collector at exit. It's not easy to reproduce the issue: run attached script on Python 3.4 compiled in debug mode (to get assertions) and press CTRL+c twice. It looks like the bug occurs more often after at least 400 runs. The Future class of the asyncio may store an exception in its private Future._exception attribute. An exception stores a traceback. I don't know if it's related. Output: ...
run #347
^CTraceback (most recent call last):
File "x.py", line 73, in <module>
main()
File "x.py", line 70, in main
el.join()
File "/home/haypo/prog/python/default/Lib/threading.py", line 1061, in join
self._wait_for_tstate_lock()
File "/home/haypo/prog/python/default/Lib/threading.py", line 1077, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
run #348
KeyboardInterrupt
...
run #364
run #365
run #366
run #367
-cancelled-
run #368
^CException ignored in: <module 'threading' from '/home/haypo/prog/python/default/Lib/threading.py'>
Traceback (most recent call last):
File "/home/haypo/prog/python/default/Lib/threading.py", line 1295, in _shutdown
t.join()
File "/home/haypo/prog/python/default/Lib/threading.py", line 1061, in join
self._wait_for_tstate_lock()
File "/home/haypo/prog/python/default/Lib/threading.py", line 1077, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
-cancelled-
python: Modules/gcmodule.c:379: visit_decref: Assertion `((gc)->gc.gc_refs >> (1)) != 0' failed.
Abandon (core dumped) $ ls *core
python-6564.core
$ gdb ./python -c python-6564.core Core was generated by `./python x.py'. (gdb) frame 4 (gdb) print op |
At the first CTRL+c, the main thread exit and enters Py_Finalize() which calls wait_for_thread_shutdown() (in Python: threading._shutdown()). The problem is that the second call interrupted threading._shutdown() (still in the main thread).
At this point, we can expect that only one Python thread is running because wait_for_thread_shutdown() has been called, but there are between 2 and 9 running Python threads. If wait_for_thread_shutdown() is interrupted, maybe Python should kill all other threads with pthread_kill() or something like that. Or wait_for_thread_shutdown() should maybe block SIGINT and SIGTERM signals using pthread_sigmark()? |
This sounds a bit desperate. |
Random thoughts:
|
Are you sure? I didn't see the daemon flag in ThreadPoolExecutor. |
Oh you're right, ThreadPoolExecutor does create daemon threads: class ThreadPoolExecutor(_base.Executor):
...
def _adjust_thread_count(self):
# When the executor gets lost, the weakref callback will wake up
# the worker threads.
def weakref_cb(_, q=self._work_queue):
q.put(None)
# TODO(bquinlan): Should avoid creating new threads if there are more
# idle threads than items in the work queue.
if len(self._threads) < self._max_workers:
t = threading.Thread(target=_worker,
args=(weakref.ref(self, weakref_cb),
self._work_queue))
t.daemon = True <======= HERE
t.start()
self._threads.add(t)
_threads_queues[t] = self._work_queue |
threading_shutdown_interrupted.py: simpler and reliable script to reproduce the bug. |
Charles-François wrote:
Good catch. If I reverted changes of bpo-19466, threading_shutdown_interrupted.py doesn't crash anymore. It doesn't explain how the GC header of a frame object becomes inconsistent. IMO fixing bpo-19466 was a good idea but because these changes, some bugs became more likley. Maybe we can revert bpo-19466, then try to understand and fix this GC inconstency, and later fix again bpo-19466? -- Oh by the way, threading_shutdown_interrupted.py shows also a deadlock. The main threads waits in threading._shutdown(), but this function is interrupted. Then it calls flush_std_files() which waits for a lock on stdout. The problem is that a thread was writing into stdout, but this thread is no more running and holds the lock. |
I set the priority to release blocker because it's a crash, it's a regression since Python 3.3 and it can be easily fixed (revert changes done in bpo-19466). |
revert_19466.patch: Patch to revert changes of issue bpo-19466. It restores how Python cleared threads in Python 3.3. |
Please revert. To debug, since I guess it's due to a memory corruption because objects are |
New changeset 1166b3321012 by Victor Stinner in branch 'default': |
I added assert(_Py_Finalizing == NULL || _Py_Finalizing == PyThreadState_GET()) almost everywhere, but the assertion didn't fail. It looks like only the main thread is running between the interrupted wait_for_thread_shutdown() and the final "python: Modules/gcmodule.c:379: visit_decref: ... failed" assertion in PyGC_Collect(). I tried to add directly assert(_PyGCHead_REFS(gc) != 0) in _PyGCHead_SET_REFS(), but the assertion didn't fail. I also ran Python in Valgrind: no error neither. Ok... I close this issue as won't fix because it was occur with bpo-19466 applied, but I just reverted this patch. |
New changeset 9ce58a73b6b5 by Victor Stinner in branch '3.4': |
Hello, I encounter a quite similar issue with python 3.4.0 and cx_Oracle. It segfault from time to time (hard to reproduce) on visit_decref at Modules/gcmodule.c:373. With gdb i got the following stacktrace :
#31 PyEval_EvalFrameEx (f=<optimized out>, throwflag=throwflag@entry=0) at Python/ceval.c:2829
#32 0x00000000004c5e85 in fast_function (nk=<optimized out>, na=2, n=2, pp_stack=0x7fffffffcc50, func=0x7ffff1635d90) at Python/ceval.c:4324
#33 call_function (oparg=<optimized out>, pp_stack=0x7fffffffcc50) at Python/ceval.c:4252
#34 PyEval_EvalFrameEx (f=<optimized out>, throwflag=throwflag@entry=0) at Python/ceval.c:2829
#35 0x00000000004c5e85 in fast_function (nk=<optimized out>, na=2, n=2, pp_stack=0x7fffffffce30, func=0x7ffff2513730) at Python/ceval.c:4324
#36 call_function (oparg=<optimized out>, pp_stack=0x7fffffffce30) at Python/ceval.c:4252
#37 PyEval_EvalFrameEx (f=<optimized out>, throwflag=throwflag@entry=0) at Python/ceval.c:2829
#38 0x00000000004c5e85 in fast_function (nk=<optimized out>, na=2, n=2, pp_stack=0x7fffffffd010, func=0x7ffff1633378) at Python/ceval.c:4324
#39 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd010) at Python/ceval.c:4252
#40 PyEval_EvalFrameEx (f=f@entry=0xab9478, throwflag=throwflag@entry=0) at Python/ceval.c:2829
#41 0x00000000004be47b in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, argcount=argcount@entry=1, kws=0xab9380, kwcount=0, defs=0x7ffff2504b88, defcount=1, kwdefs=0x0,
closure=0x0) at Python/ceval.c:3578
#42 0x00000000004c3d91 in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7fffffffd2a0, func=0x7ffff2513400) at Python/ceval.c:4334
#43 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd2a0) at Python/ceval.c:4252
#44 PyEval_EvalFrameEx (f=<optimized out>, throwflag=throwflag@entry=0) at Python/ceval.c:2829
#45 0x00000000004c5e85 in fast_function (nk=<optimized out>, na=1, n=1, pp_stack=0x7fffffffd480, func=0x7ffff16331e0) at Python/ceval.c:4324
#46 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd480) at Python/ceval.c:4252
#47 PyEval_EvalFrameEx (f=<optimized out>, throwflag=throwflag@entry=0) at Python/ceval.c:2829
#48 0x00000000004c5e85 in fast_function (nk=<optimized out>, na=0, n=0, pp_stack=0x7fffffffd660, func=0x7ffff6ed4d08) at Python/ceval.c:4324
#49 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd660) at Python/ceval.c:4252
#50 PyEval_EvalFrameEx (f=f@entry=0x7ffff7eceaf8, throwflag=throwflag@entry=0) at Python/ceval.c:2829
#51 0x00000000004be47b in PyEval_EvalCodeEx (_co=_co@entry=0x7ffff6ed14b0, globals=globals@entry=0x7ffff7ef1388, locals=locals@entry=0x7ffff7ef1388, args=args@entry=0x0, argcount=argcount@entry=0, kws=kws@entry=0x0, kwcount=kwcount@entry=0,
defs=defs@entry=0x0, defcount=defcount@entry=0, kwdefs=kwdefs@entry=0x0, closure=closure@entry=0x0) at Python/ceval.c:3578
#52 0x00000000004be54b in PyEval_EvalCode (co=co@entry=0x7ffff6ed14b0, globals=globals@entry=0x7ffff7ef1388, locals=locals@entry=0x7ffff7ef1388) at Python/ceval.c:773
#53 0x00000000004ed844 in run_mod (arena=0x8fea30, flags=0x7fffffffd910, locals=0x7ffff7ef1388, globals=0x7ffff7ef1388, filename=0x7ffff7ea7a78, mod=0x97c090) at Python/pythonrun.c:2175
#54 PyRun_InteractiveOneObject (fp=fp@entry=0x7ffff74b6340 <_IO_2_1_stdin_>, filename=filename@entry=0x7ffff7ea7a78, flags=flags@entry=0x7fffffffd910) at Python/pythonrun.c:1441
#55 0x00000000004edb7e in PyRun_InteractiveLoopFlags (fp=fp@entry=0x7ffff74b6340 <_IO_2_1_stdin_>, filename_str=filename_str@entry=0x580d63 "<stdin>", flags=flags@entry=0x7fffffffd910) at Python/pythonrun.c:1320
#56 0x00000000004edcd6 in PyRun_AnyFileExFlags (fp=fp@entry=0x7ffff74b6340 <_IO_2_1_stdin_>, filename=<optimized out>, closeit=0, flags=flags@entry=0x7fffffffd910) at Python/pythonrun.c:1282
#57 0x000000000041e243 in run_file (p_cf=0x7fffffffd910, filename=0x0, fp=0x7ffff74b6340 <_IO_2_1_stdin_>) at Modules/main.c:319
#58 Py_Main (argc=argc@entry=1, argv=argv@entry=0x872010) at Modules/main.c:751
#59 0x000000000041aacc in main (argc=1, argv=<optimized out>) at ./Modules/python.c:69
(gdb) Regards |
Hi Sebastien, |
Hi Antoine, Program received signal SIGSEGV, Segmentation fault. |
Same issue with a fresh python 3.3.5: 361 if (PyObject_IS_GC(op)) { Is it a python bug, or could it be an issue with the cx_oracle extension ? |
Well, 0xb00000000 is an unlikely pointer value, especially since other dynamically-allocated pointers seem to lie in other memory areas. So it would look like there's some memory corruption here. As for whether it's a Python issue, try reproducing without cx_Oracle? If you can't, it's likely to be a bug in cx_Oracle. |
Your issue is different, compare the top frames: My trace: #0 0x0000003f3a835c59 in raise () from /lib64/libc.so.6
#1 0x0000003f3a837368 in abort () from /lib64/libc.so.6
#2 0x0000003f3a82ebb6 in __assert_fail_base () from /lib64/libc.so.6
#3 0x0000003f3a82ec62 in __assert_fail () from /lib64/libc.so.6
#4 0x000000000043ac66 in visit_decref (
op=Frame 0x7f013c001398, for file x.py, line 43 (...) at Modules/gcmodule.c:379
#5 0x00000000004336bd in tb_traverse (tb=0x7f01493a66e8, visit=0x43abb4 <visit_decref>, arg=0x0) at Python/traceback.c:64
#6 0x000000000043acdc in subtract_refs (containers=0x8f1a20 <generations+64>) at Modules/gcmodule.c:398 cx_Oracle trace: #0 0x000000000043ab98 in visit_decref (op=0xb00000000, data=0x0) at Modules/gcmodule.c:373 In my trace, visit_decref() is called on a frame and fail with an assertion error. In cx_Oracle trace, visit_decref() is called on a NULL pointer which comes from an Exception. In my experience, it's a bug in cx_Oracle. If you think that I'm wrong and that it's a bug in Python, please open a *new* issue since the trace is different. -- For your cx_Oracle, issue: #1 0x000000000048193a in BaseException_traverse (self=0x7ffff0f645f8, visit=0x43ab64 <visit_decref>, arg=0x0) at Objects/exceptions.c:97 Can you go this frame (in gdb, type "frame 1") and dump the Exception object: (gdb) print _PyObject_Dump(0x000000000048193a) Note for myself: I should write a documentation explaining how to debug Python in gdb. |
Unless C conventions changed, 0xb00000000 is not a NULL pointer :-) |
Ooops, I missed the B :-) By the way, my gdb example is wrong: you should pass self :-)
=> (gdb) print _PyObject_Dump(0x7ffff0f645f8) |
Right now, threading_shutdown_interrupted.py does crash on the master branch (commit 05e4a29). I reintroduced the bug with: commit 9ad58ac
I investigated this bug with Pablo Galindo Salgado. When the bug triggers, there are 2 strong references to a Frame object:
The bug occurs in "except ValueError as exc:" of EvilThread.coroutine(). The exception contains a traceback object which has a strong reference to the frame. The frame object contains "exc" variable (the exception) which contains a reference to the tracecback (exc.__traceback__) which contains a refererence to the frame (tb.tb_frame): there is a reference cycle. (It's not really an issue by itself). There are 2 strong references to the Frame object. Py_FinalizeEx() calls _PyThreadState_DeleteExcept() which doesn't clear the reference cycle. But PyThreadState_Clear() calls Py_CLEAR(tstate->frame) which reduces the Frame reference counter from 2 to 1, even if there are still 2 strong references to it (traceback and generator). tstate->frame is only set at 3 places:
By the way, _PyEval_EvalFrameDefault() has a bug: tstate->frame is not reset if _PyCode_InitOpcache() fails: bpo-40048. Py_FinalizeEx() calls _PyRuntimeState_SetFinalizing(runtime, tstate) and so threads which tries to take the GIL must exit immediately. Extract of take_gil(): static inline int
tstate_must_exit(PyThreadState *tstate)
{
PyThreadState *finalizing = _PyRuntimeState_GetFinalizing(&_PyRuntime);
return (finalizing != NULL && finalizing != tstate);
}
static void
take_gil(PyThreadState *tstate)
{
...
if (tstate_must_exit(tstate)) {
PyThread_exit_thread();
}
...
} The thread is exited in the middle of _PyEval_EvalFrameDefault() by take_gil() which calls PyThread_exit_thread() because the thread must exit (because the main thread destroyed its Python thread state and so accessing it would crash the thread). _PyThreadState_DeleteExcept() calls PyThreadState_Clear() which calls Py_CLEAR(tstate->frame): this case only occurs because the thread exited in the middle of _PyEval_EvalFrameDefault(). In the normal case, _PyEval_EvalFrameDefault() always resets tstate->frame to its previous value. Py_CLEAR(tstate->frame) is wrong: it's a *borrowed* reference. Pablo and me agree with PyThreadState_Clear() must not call Py_CLEAR(tstate->frame): it's borrowed reference, not a strong reference. I'm working on a PR to fix the issue. |
(This issue previously had "release blocker" priority before it was closed. Now that it is re-opened does it still need that? I'm assuming not.) |
Techicanilly, it's a 3.9 regression but I don't think that it should be hold 3.9 release. So no, it's not a blocker issue. |
I ran threading_shutdown_interrupted.py 10 times on the master branch (at commit 9b8e74c): it does no longer crash. I consider that the root issue is now fixed, so I close again the bug. Thanks Pablo for the helpful debugging session! |
Just in case, I also ran asyncio_gc.py 10x times on the master branch. I just replaced asyncio.async with asyncio.ensure_future. In short, I consider that the bug is now fixed. I interrupted the test two times with CTRL+c. 9 runs were stopped correctly. 1 run failed with: Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads But this is a different issue, unrelated to "Modules/gcmodule.c:379: visit_decref: Assertion `((gc)->gc.gc_refs >> (1)) != 0' failed". It's just that the test uses print() in threads, whereas using print() during Python finalization is not reliable. Using os.write() or avoiding print() would avoid the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: