classification
Title: Assertion failure in PyThreadState_Clear
Type: crash Stage: resolved
Components: Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, davin, ned.deily, pitrou, serhiy.storchaka, vstinner
Priority: release blocker Keywords: patch

Created on 2018-05-23 07:17 by serhiy.storchaka, last changed 2018-05-24 20:17 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
bug.py vstinner, 2018-05-23 10:18
Pull Requests
URL Status Linked Edit
PR 7069 merged vstinner, 2018-05-23 10:26
PR 7074 merged miss-islington, 2018-05-23 16:01
Messages (15)
msg317363 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-23 07:17
I got the following output when run tests in the huntrleaks mode.

$ ./python -We -m test -R 3:3 -x test_builtin -x test_urlparse
...
1:24:54 load avg: 2.48 [225/414] test_multiprocessing_fork
beginning 6 repetitions
123456
..python: Python/pystate.c:589: PyThreadState_Clear: Assertion `tstate->exc_info->previous_item == NULL' failed.
Fatal Python error: Aborted

Current thread 0x00007fdd3e07e700 (most recent call first):
  File "/home/serhiy/py/cpython/Lib/multiprocessing/popen_fork.py", line 70 in _launch
  File "/home/serhiy/py/cpython/Lib/multiprocessing/popen_fork.py", line 20 in __init__
  File "/home/serhiy/py/cpython/Lib/multiprocessing/context.py", line 277 in _Popen
  File "/home/serhiy/py/cpython/Lib/multiprocessing/process.py", line 112 in start
  File "/home/serhiy/py/cpython/Lib/multiprocessing/pool.py", line 241 in _repopulate_pool
  File "/home/serhiy/py/cpython/Lib/multiprocessing/pool.py", line 248 in _maintain_pool
  File "/home/serhiy/py/cpython/Lib/multiprocessing/pool.py", line 412 in _handle_workers
  File "/home/serhiy/py/cpython/Lib/threading.py", line 865 in run
  File "/home/serhiy/py/cpython/Lib/threading.py", line 917 in _bootstrap_inner
  File "/home/serhiy/py/cpython/Lib/threading.py", line 885 in _bootstrap
..python: Python/pystate.c:589: PyThreadState_Clear: Assertion `tstate->exc_info->previous_item == NULL' failed.
Fatal Python error: Aborted

Current thread 0x00007fdd6cb6c700 (most recent call first):
  File "/home/serhiy/py/cpython/Lib/multiprocessing/popen_fork.py", line 70 in _launch
  File "/home/serhiy/py/cpython/Lib/multiprocessing/popen_fork.py", line 20 in __init__
  File "/home/serhiy/py/cpython/Lib/multiprocessing/context.py", line 277 in _Popen
  File "/home/serhiy/py/cpython/Lib/multiprocessing/process.py", line 112 in start
  File "/home/serhiy/py/cpython/Lib/multiprocessing/pool.py", line 241 in _repopulate_pool
  File "/home/serhiy/py/cpython/Lib/multiprocessing/pool.py", line 248 in _maintain_pool
  File "/home/serhiy/py/cpython/Lib/multiprocessing/pool.py", line 412 in _handle_workers
  File "/home/serhiy/py/cpython/Lib/threading.py", line 865 in run
  File "/home/serhiy/py/cpython/Lib/threading.py", line 917 in _bootstrap_inner
  File "/home/serhiy/py/cpython/Lib/threading.py", line 885 in _bootstrap
..
1:34:07 load avg: 1.01 [226/414] test_multiprocessing_forkserver -- test_multiprocessing_fork passed in 9 min 13 sec
...

It happens after 1.5 hours of running tests. Running the test_multiprocessing_fork test only doesn't expose it. Likely there are already opened related issues, but the assertion failure in PyThreadState_Clear is new, it was added in 3.7. It may help to identify the problem.
msg317368 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 09:36
The assertion has been added by bpo-25612:

commit ae3087c6382011c47db82fea4d05f8bbf514265d
Author: Mark Shannon <mark@hotpy.org>
Date:   Sun Oct 22 22:41:51 2017 +0100

    Move exc state to generator. Fixes bpo-25612 (#1773)
    
    Move exception state information from frame objects to coroutine (generator/thread) object where it belongs.
msg317369 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 09:44
Failing assertion:

    /* The stack of exception states should contain just this thread. */
    assert(tstate->exc_info->previous_item == NULL);

"test_multiprocessing_fork passed in 9 min 13 sec"

Oh. A child process crashed on "self.pid = os.fork()" in "/Lib/multiprocessing/popen_fork.py", line 70 in _launch", but the test still pass.

os.fork() calls PyOS_AfterFork_Child() which starts by calling _PyGILState_Reinit() which ends with "_PyThreadState_DeleteExcept(current_tstate);".

_PyThreadState_DeleteExcept() creates a "garbage" list of thread states, and then call PyThreadState_Clear() on each thread. Finally, one call to PyThreadState_Clear() fails on the assertion.

I'm not sure that the assertion takes in account the case of specific case of fork().
msg317373 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 10:04
I created bpo-33613: "test_multiprocessing_fork: test_semaphore_tracker_sigint() fails with -W error". I don't think that it's directly related.
msg317377 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 10:18
Attached bug.py reproduces the bug with a daemonic thread but without fork():

$ ./python bug.py 
ls coredpython: Python/pystate.c:589: PyThreadState_Clear: Assertion `tstate->exc_info->previous_item == NULL' failed.
Aborted (core dumped)

(You might have to run the script a few times to get the crash, it's not deterministic.)

vstinner@apu$ gdb ./python -c coredump-python.24130
GNU gdb (GDB) Fedora 8.0.1-36.fc27
(..)
Core was generated by `./python bug.py'.
Program terminated with signal SIGABRT, Aborted.

#0  0x00007f5136075660 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f5136f64040 (LWP 24130))]

(gdb) where
#0  0x00007f5136075660 in raise () from /lib64/libc.so.6
#1  0x00007f5136076c41 in abort () from /lib64/libc.so.6
#2  0x00007f513606df7a in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f513606dff2 in __assert_fail () from /lib64/libc.so.6
#4  0x0000000000591dde in PyThreadState_Clear (tstate=0x1b7a430) at Python/pystate.c:589
#5  0x0000000000590274 in PyInterpreterState_Clear (interp=0x1b23930) at Python/pystate.c:204
#6  0x000000000058cf4c in Py_FinalizeEx () at Python/pylifecycle.c:1153
#7  0x00000000004257af in pymain_main (pymain=0x7ffe5f82f200) at Modules/main.c:2664
#8  0x0000000000425915 in _Py_UnixMain (argc=2, argv=0x7ffe5f82f448) at Modules/main.c:2697
#9  0x000000000041f1e7 in main (argc=2, argv=0x7ffe5f82f448) at ./Programs/python.c:15

(gdb) info threads
  Id   Target Id         Frame 
  1    Thread 0x7f5136f64040 (LWP 24130) 0x00007f5136075660 in raise () from /lib64/libc.so.6
* 2    Thread 0x7f512e8c6700 (LWP 24138) 0x00007f5136b56be6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3    Thread 0x7f512e0c5700 (LWP 24139) 0x00007f5136b56be6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4    Thread 0x7f512d8c4700 (LWP 24147) 0x00007f5136b56be6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5    Thread 0x7f512f0c7700 (LWP 24137) 0x00007f5136b56be6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) t 2
[Switching to thread 2 (Thread 0x7f512e8c6700 (LWP 24138))]
#0  0x00007f5136b56be6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) where
#0  0x00007f5136b56be6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000053458a in PyCOND_TIMEDWAIT (cond=0xa07248 <_PyRuntime+1224>, mut=0xa07278 <_PyRuntime+1272>, us=5000) at Python/condvar.h:90
#2  0x00000000005349a3 in take_gil (tstate=0x1b87980) at Python/ceval_gil.h:208
msg317379 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 10:30
I created PR 7069 to remove the assertion. With this change, attached bug.py doesn't crash anymore, but it still logs warnings in verbose mode:

vstinner@apu$ ./python -v bug.py 
(...)
PyThreadState_Clear: warning: thread still has a frame
PyThreadState_Clear: warning: thread still has a frame
PyThreadState_Clear: warning: thread still has a generator
PyThreadState_Clear: warning: thread still has a frame
PyThreadState_Clear: warning: thread still has a generator
PyThreadState_Clear: warning: thread still has a frame

The root issue is that Python doesn't join deamonic threads. But it's by design no?

I hate daemonic threads :-)

For the specific case of a fork(), it can be possible that a thread calls os.fork() while another thread is running a generator, no?

--

Since this issue seems to be a regression and I have a PR fixing it, I propose to mark this issue as a release blocker. Sorry Ned :-(
msg317385 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-23 11:27
The assertion looked good to me when I reviewed this code, and I don't understand why it fails. Maybe it would be better to fix the case in which tstate->exc_info->previous_item != NULL before calling PyThreadState_Clear(). Otherwise we can leak some things.
msg317386 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-05-23 11:31
Yes, simply removing the assertion feels more like a copout than an actual fix.
(perhaps it *is* the right fix to the issue, but it would be nice to find out why :-))
msg317388 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 11:57
> Yes, simply removing the assertion feels more like a copout than an actual fix.

We are very close to the 3.7rc1, so I suggest to remove the assertion, just to get more time to fix the issue.

--

I agree that in a perfect world, Python should cleanup everything properly, but in the current world, daemon threads and fork are a mess full of corner cases like this one.

Python doesn't join daemonic threads at shutdown, so I don't see how Python could ensure that the state of the Python state of these daemonic threads is consistent (ready to be cleared). For fork, it seems like the issue is similar. A fork removes immediately all threads except of the thread which called fork. How would it be possible for Python to ensure that the Python state of the other destroyed threads is consistent?

tstate->exc_info is a new feature of Python 3.7, to fix bpo-25612. But I don't think that Python 3.6 is better to cleanup daemon threads at shutdown or to cleanup threads on fork. If you run the attached bug.py with python3.6 -v bug.py, I also see warnings:

PyThreadState_Clear: warning: thread still has a frame
PyThreadState_Clear: warning: thread still has a frame
PyThreadState_Clear: warning: thread still has a frame
PyThreadState_Clear: warning: thread still has a frame
msg317389 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-05-23 11:59
Le 23/05/2018 à 13:57, STINNER Victor a écrit :
> 
> I agree that in a perfect world, Python should cleanup everything properly, but in the current world, daemon threads and fork are a mess full of corner cases like this one.

For now, we don't know for sure that it's a cleanup issue.  What if it's
something else?

I agree that, if it's simply a cleanup issue with daemon threads, we can
remove the assertion.  But perhaps we should first diagnose the issue.
msg317418 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 16:00
New changeset b6dccf54fd3bac9c87348d96f9d6b571608c15bc by Victor Stinner in branch 'master':
bpo-33612: Remove PyThreadState_Clear() assertion (#7069)
https://github.com/python/cpython/commit/b6dccf54fd3bac9c87348d96f9d6b571608c15bc
msg317443 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-05-23 19:58
New changeset 508d7693bc09affd99fdaa4a321cc3da0638c8a0 by Ned Deily (Miss Islington (bot)) in branch '3.7':
bpo-33612: Remove PyThreadState_Clear() assertion (GH-7069) (GH-7074)
https://github.com/python/cpython/commit/508d7693bc09affd99fdaa4a321cc3da0638c8a0
msg317458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 21:06
Mark Shannon, the author of the assertion, approved my PR to remove the assertion:
https://github.com/python/cpython/pull/7069#issuecomment-391342700

The initial issue, the failing assertion, has been fixed. Well, to be honest, I failed to reproduce Serhiy's use case using test_multiprocessing_fork, but I'm confident that the removal of the assertion fixed the bug ;-)
msg317595 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-24 17:53
Thank you for fixing this Victor!
msg317612 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-24 20:17
> Thank you for fixing this Victor!

You're welcome. It's always a pleasure to remove code!
History
Date User Action Args
2018-05-24 20:17:10vstinnersetmessages: + msg317612
2018-05-24 17:53:59serhiy.storchakasetmessages: + msg317595
2018-05-23 21:06:54vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg317458

stage: patch review -> resolved
2018-05-23 19:58:25ned.deilysetmessages: + msg317443
2018-05-23 16:01:31miss-islingtonsetpull_requests: + pull_request6704
2018-05-23 16:00:58vstinnersetmessages: + msg317418
2018-05-23 11:59:49pitrousetmessages: + msg317389
2018-05-23 11:57:11vstinnersetmessages: + msg317388
2018-05-23 11:31:04pitrousetmessages: + msg317386
2018-05-23 11:27:45serhiy.storchakasetmessages: + msg317385
2018-05-23 10:30:09vstinnersetpriority: critical -> release blocker
nosy: + ned.deily
messages: + msg317379

2018-05-23 10:26:42vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request6697
2018-05-23 10:18:24vstinnersetfiles: + bug.py

messages: + msg317377
2018-05-23 10:04:39vstinnersetmessages: + msg317373
2018-05-23 09:44:59vstinnersetmessages: + msg317369
2018-05-23 09:36:46vstinnersetmessages: + msg317368
2018-05-23 08:48:57pitrousetpriority: normal -> critical
2018-05-23 07:20:10serhiy.storchakasetnosy: + vstinner
2018-05-23 07:17:40serhiy.storchakacreate