New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyContextVar_Get(): crash due to race condition in updating tstate->id #83957
Comments
Hello everybody! We are using Python 3.7 running at CentOS 7 x64. Python is used as a library to create dynamic extensions for our app server. Some time ago we began to experience crashes in decimal module in some heavy-multithreaded scenarios. After some testing and debugging I was able to reproduce it without our own code using only pybind11 library to simplify embedding (in real app we are using boost.python). I've built python 3.8 with clang 7 and address sanitizer enabled and got error "use-after-free" with some additional data. Please find attached C++ source file, python module and ASAN output. Is it really a bug (most probably - data race) or there is something wrong with such embedding scenario? |
Before I look at the example code: Can you also reproduce this with There's a high chance though that the problem is in the c++ module. |
I've briefly looked at the zip archive. Without going much into gil_unlocker.UnlockGILAndSleep()
self.val = decimal.Decimal(1) / decimal.Decimal(7)
gil_unlocker.UnlockGILAndSleep() If you want C++ threads with a released GIL, you should use libmpdec |
Please note, that UnlockGILandSleep takes GIL back before returning. In a real production code there is a database query. In this example I emulate them with random sleep. So I don't see any problems here. |
I built your example with 3.6: git clone https://github.com/pybind/pybind11 git checkout v3.6.7 g++ -std=c++11 -pthread -Wno-unused-result -Wsign-compare -g -Og -Wall -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -I. -I./Include -I./pybind11/include -c main.cpp g++ -pthread -Xlinker -export-dynamic -o main main.o libpython3.6dm.a -lpthread -ldl -lutil -lm cp python python3 And I literally get this error (not always, it may take 10 runs or so): $ ./main
Fatal Python error: Python memory allocator called without holding the GIL Thread 0x00007f1e73fff700 (most recent call first): Thread 0x00007f1e7b836700 (most recent call first): Thread 0x00007f1e7a834700 (most recent call first): Thread 0x00007f1e7b035700 (most recent call first): Thread 0x00007f1e7d039700 (most recent call first): Thread 0x00007f1e7c838700 (most recent call first): Current thread 0x00007f1e7c037700 (most recent call first): Thread 0x00007f1e7e84f740 (most recent call first): So no, I don't think the GIL handling is correct. |
Thank you for feedback. I will try to reproduce the issue with 3.6. By the way, haven't you used gdb with python pretty-printers enabled to examine the state of the program? I've got the same error message, then I breaked the execution in debugger and tried to examine the callstack of threads, that stucked in UnlockGILandSleep. The reason for it is clear: then the debugger tries to build a callstack, some of pretty printers try to execute some python code to give a better representation of interpreter objects. The code is executed at the top of the stack of the examined thread. Since this thread explicitly released the GIL before going to sleep, these functions hit the assert about calling the memory allocator without holdng the GIL. Disabling pretty-printers makes these error messages to disappear. |
I'am unable to reproduce neither my or your issues with python 3.6. The program runs infinitely as it meant to be. Can you please give me C++ traceback from the core dump, which was created when you ran my program? |
This is 3.6.7, compiled --with-pydebug: $ ./main
Aborted (core dumped) (gdb) bt |
Note that my pybind11 is from GitHub master, it can also be a pybind11 It is interesting that you cannot reproduce your original issue with I think we need a reproducer without pybind11 though, could you |
Your callstack is very strange. At line 30 of main.cpp GIL is obviously locked: // importing module in this tread
gstate = PyGILState_Ensure();
py::module crash_test = py::module::import( "crash_test" ); <-- import
PyGILState_Release( gstate ); I suppose that there is something wrong with your setup. Maybe - wrong working directory for the main executable, which doesn't contain crash_test.py Also I've tried to revert this patch #5278 for 3.7. It makes problem to disappear, 1 hour of stable work under ASAN. So I suppose it is the source of the bug. I will try to tweak _testembed.c. |
Regarding *my* issue, it could be anything, e.g. a missing call to "Changed in version 3.7: This function is now called by Py_Initialize(), so you don’t have to call it yourself anymore." This is why we need to eliminate pybind11 so we can see what is |
I rewrote my example without pybind and eliminated C++ module (I realized that time.sleep() also releases the GIL, so we achieve the same effect). Still the same results: with python 3.7.3 app crashes with attached ASAN output, with python 3.7.3 without #5278 works just fine. To run main.cpp you should add directory with crash_test.py to PYTHONPATH. |
Also I understood the source of your crash with my initial example. Since you haven't used CMake to configure project, pybind didn't setup required macroses to enable threading support. So no issues in pybind. |
Thanks, I'm now getting the same results as you. Looking at the smaller |
I think the PR fixes the issue but I have to run longer tests still. Threads created by PyGILState_Ensure() could have a duplicate tstate->id, |
Since many duplicate tstate ids are generated in the test case This can lead to incorrect results without noticing. |
Also _pydecimal was affected. This is a modified version of Evgeny's from _pydecimal import *
from time import sleep
from random import randint
import sys
sys.setswitchinterval(0.0000001)
def usleep(x):
sleep(x/1000000.0)
class Test:
def __init__(self, threadno):
self.threadno = threadno
def GetCallback(self):
usleep(randint(0, 9));
setcontext(Context(Emax=self.threadno))
usleep(randint(0, 9))
c = getcontext()
try:
if c.Emax != self.threadno:
raise RuntimeError("GetCallback: another thread changed this thread's context")
except AttributeError:
raise RuntimeError("GetCallback: type(c)==%s and c.Emax disappeared" % type(c))
usleep(randint(0, 9))
return self.Callback
def Callback(self):
c = getcontext()
try:
c.Emax = self.threadno
except AttributeError:
raise RuntimeError("Callback: type(c)==%s and c.Emax disappeared" % type(c))
def DoTest(threadno):
return Test(threadno).GetCallback() It produces one of the exceptions or a segfault. You may have |
Setting to release blocker, but please move to deferred again |
Thanks for the heads-up and the fix, Stefan. The fix for 3.7.x was merged before the 3.7.7rc1 cutoff (by a few hours!) and the next 3.8.x release cutoff is planned for April and 3.9.0a5 is later in March, so, if you are not planning to merge any other changes for this issue, we can set the issue status to "closed" now. |
Thanks Ned, closing then. Evgeny, please reopen if you see it again (I ran the tests for about Thanks for the very instructive test case! |
I checked both my test example and real production code with your patch. I'm unable to reproduce the bug, so I think it is fixed now. Thank you! |
Thank you so much, Stefan, for looking into this. Really appreciate the help. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: