Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyContextVar_Get(): crash due to race condition in updating tstate->id #83957

Closed
boytsovea mannequin opened this issue Feb 27, 2020 · 25 comments
Closed

PyContextVar_Get(): crash due to race condition in updating tstate->id #83957

boytsovea mannequin opened this issue Feb 27, 2020 · 25 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@boytsovea
Copy link
Mannequin

boytsovea mannequin commented Feb 27, 2020

BPO 39776
Nosy @vstinner, @ned-deily, @skrah, @ambv, @1st1
PRs
  • bpo-39776: Lock ++interp->tstate_next_unique_id. (GH-18746) #18746
  • [3.8] bpo-39776: Lock ++interp->tstate_next_unique_id. (GH-18746)  #18752
  • [3.7] bpo-39776: Lock ++interp->tstate_next_unique_id (GH-18746) #18753
  • Files
  • decimal_crash.zip: files to reproduce the bug
  • threaded_crash.zip
  • pydecimal_cases.zip: _pydecimal cases
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/skrah'
    closed_at = <Date 2020-03-03.16:21:32.281>
    created_at = <Date 2020-02-27.15:25:13.411>
    labels = ['interpreter-core', '3.8', '3.9', 'release-blocker', '3.7', 'type-crash']
    title = 'PyContextVar_Get(): crash due to race condition in updating tstate->id'
    updated_at = <Date 2020-03-05.00:42:46.093>
    user = 'https://bugs.python.org/boytsovea'

    bugs.python.org fields:

    activity = <Date 2020-03-05.00:42:46.093>
    actor = 'yselivanov'
    assignee = 'skrah'
    closed = True
    closed_date = <Date 2020-03-03.16:21:32.281>
    closer = 'skrah'
    components = ['Interpreter Core']
    creation = <Date 2020-02-27.15:25:13.411>
    creator = 'boytsovea'
    dependencies = []
    files = ['48923', '48930', '48945']
    hgrepos = []
    issue_num = 39776
    keywords = ['patch', '3.7regression']
    message_count = 25.0
    messages = ['362807', '362812', '362820', '362825', '362834', '362865', '362869', '362875', '362876', '362877', '362878', '362880', '362882', '362885', '363192', '363208', '363239', '363240', '363248', '363266', '363267', '363270', '363274', '363338', '363406']
    nosy_count = 6.0
    nosy_names = ['vstinner', 'ned.deily', 'skrah', 'lukasz.langa', 'yselivanov', 'boytsovea']
    pr_nums = ['18746', '18752', '18753']
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue39776'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Feb 27, 2020

    Hello everybody!

    We are using Python 3.7 running at CentOS 7 x64. Python is used as a library to create dynamic extensions for our app server.

    Some time ago we began to experience crashes in decimal module in some heavy-multithreaded scenarios. After some testing and debugging I was able to reproduce it without our own code using only pybind11 library to simplify embedding (in real app we are using boost.python).

    I've built python 3.8 with clang 7 and address sanitizer enabled and got error "use-after-free" with some additional data.

    Please find attached C++ source file, python module and ASAN output. Is it really a bug (most probably - data race) or there is something wrong with such embedding scenario?

    @boytsovea boytsovea mannequin added 3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 27, 2020
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 27, 2020

    Before I look at the example code: Can you also reproduce this with
    Python 3.6? The threading code in _decimal was changed to a ContextVar
    in 3.7.

    There's a high chance though that the problem is in the c++ module.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 27, 2020

    I've briefly looked at the zip archive. Without going much into
    the C++ module as a whole, this should not be done:

        gil_unlocker.UnlockGILAndSleep()
        self.val = decimal.Decimal(1) / decimal.Decimal(7)
        gil_unlocker.UnlockGILAndSleep()

    If you want C++ threads with a released GIL, you should use libmpdec
    directly and not the Python module.

    @skrah skrah mannequin closed this as completed Feb 27, 2020
    @skrah skrah mannequin added the invalid label Feb 27, 2020
    @skrah skrah mannequin closed this as completed Feb 27, 2020
    @skrah skrah mannequin added the invalid label Feb 27, 2020
    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Feb 27, 2020

    Please note, that UnlockGILandSleep takes GIL back before returning. In a real production code there is a database query. In this example I emulate them with random sleep. So I don't see any problems here.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 27, 2020

    I built your example with 3.6:

    git clone https://github.com/pybind/pybind11
    wget https://bugs.python.org/file48923/decimal_crash.zip
    unzip decimal_crash.zip

    git checkout v3.6.7
    ./configure --with-pydebug
    make

    g++ -std=c++11 -pthread -Wno-unused-result -Wsign-compare -g -Og -Wall -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -I. -I./Include -I./pybind11/include -c main.cpp

    g++ -pthread -Xlinker -export-dynamic -o main main.o libpython3.6dm.a -lpthread -ldl -lutil -lm

    cp python python3
    PATH=.:$PATH
    ./main

    And I literally get this error (not always, it may take 10 runs or so):

    $ ./main
    Fatal Python error: Python memory allocator called without holding the GIL

    Thread 0x00007f1e73fff700 (most recent call first):

    Thread 0x00007f1e7b836700 (most recent call first):

    Thread 0x00007f1e7a834700 (most recent call first):

    Thread 0x00007f1e7b035700 (most recent call first):

    Thread 0x00007f1e7d039700 (most recent call first):

    Thread 0x00007f1e7c838700 (most recent call first):

    Current thread 0x00007f1e7c037700 (most recent call first):

    Thread 0x00007f1e7e84f740 (most recent call first):
    Aborted (core dumped)

    So no, I don't think the GIL handling is correct.

    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Feb 28, 2020

    Thank you for feedback. I will try to reproduce the issue with 3.6. By the way, haven't you used gdb with python pretty-printers enabled to examine the state of the program? I've got the same error message, then I breaked the execution in debugger and tried to examine the callstack of threads, that stucked in UnlockGILandSleep. The reason for it is clear: then the debugger tries to build a callstack, some of pretty printers try to execute some python code to give a better representation of interpreter objects. The code is executed at the top of the stack of the examined thread. Since this thread explicitly released the GIL before going to sleep, these functions hit the assert about calling the memory allocator without holdng the GIL. Disabling pretty-printers makes these error messages to disappear.

    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Feb 28, 2020

    I'am unable to reproduce neither my or your issues with python 3.6. The program runs infinitely as it meant to be. Can you please give me C++ traceback from the core dump, which was created when you ran my program?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 28, 2020

    This is 3.6.7, compiled --with-pydebug:

    $ ./main 
    Aborted (core dumped)

    (gdb) bt
    #0 0x00007f9974077428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
    #1 0x00007f997407902a in __GI_abort () at abort.c:89
    #2 0x000000000056e2d1 in Py_FatalError (msg=msg@entry=0x62ccf8 "Python memory allocator called without holding the GIL") at Python/pylifecycle.c:1462
    #3 0x00000000004c0cec in _PyMem_DebugCheckGIL () at Objects/obmalloc.c:1963
    #4 0x00000000004c0d27 in _PyMem_DebugMalloc (ctx=0x8f7220 <_PyMem_Debug+96>, nbytes=75) at Objects/obmalloc.c:1971
    #5 0x00000000004c204e in PyObject_Malloc (size=<optimized out>) at Objects/obmalloc.c:479
    #6 0x00000000004ec12f in PyUnicode_New (size=10, maxchar=<optimized out>) at Objects/unicodeobject.c:1281
    #7 0x00000000005162f4 in _PyUnicodeWriter_PrepareInternal (writer=writer@entry=0x7f9971ca4cf0, length=length@entry=10, maxchar=<optimized out>, maxchar@entry=127) at Objects/unicodeobject.c:13565
    #8 0x000000000051af20 in PyUnicode_DecodeUTF8Stateful (s=0x61d15b "crash_test", size=10, errors=errors@entry=0x0, consumed=consumed@entry=0x0) at Objects/unicodeobject.c:5067
    #9 0x000000000051c6b0 in PyUnicode_FromString (u=<optimized out>) at Objects/unicodeobject.c:2077
    #10 0x0000000000563c1c in PyImport_ImportModule (name=<optimized out>) at Python/import.c:1266
    #11 0x00000000004531dd in pybind11::module::import (name=0x61d15b "crash_test") at ./pybind11/include/pybind11/pybind11.h:849
    #12 0x0000000000446434 in ThreadFunc () at main.cpp:30
    #13 0x000000000046a1b1 in std::_Bind_simple<void (())()>::_M_invoke<>(std::_Index_tuple<>) (this=0x10c28d8) at /usr/include/c++/5/functional:1531
    #14 0x000000000046a10a in std::_Bind_simple<void (
    ())()>::operator()() (this=0x10c28d8) at /usr/include/c++/5/functional:1520
    #15 0x000000000046a09a in std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (this=0x10c28c0) at /usr/include/c++/5/thread:115
    #16 0x00007f99749e3c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    #17 0x00007f99750bb6ba in start_thread (arg=0x7f9971ca5700) at pthread_create.c:333
    #18 0x00007f997414941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 28, 2020

    Note that my pybind11 is from GitHub master, it can also be a pybind11
    issue.

    It is interesting that you cannot reproduce your original issue with
    3.6, so I'm reopening this issue.

    I think we need a reproducer without pybind11 though, could you
    tweak Programs/_testembed.c (from the CPython sources) to run the
    crash script?

    @skrah skrah mannequin reopened this Feb 28, 2020
    @skrah skrah mannequin reopened this Feb 28, 2020
    @skrah skrah mannequin removed invalid labels Feb 28, 2020
    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Feb 28, 2020

    Your callstack is very strange. At line 30 of main.cpp GIL is obviously locked:

       // importing module in this tread
       gstate = PyGILState_Ensure();
       py::module crash_test = py::module::import( "crash_test" ); <-- import
       PyGILState_Release( gstate );

    I suppose that there is something wrong with your setup. Maybe - wrong working directory for the main executable, which doesn't contain crash_test.py

    Also I've tried to revert this patch #5278 for 3.7. It makes problem to disappear, 1 hour of stable work under ASAN. So I suppose it is the source of the bug.

    I will try to tweak _testembed.c.

    @boytsovea boytsovea mannequin added invalid labels Feb 28, 2020
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 28, 2020

    Regarding *my* issue, it could be anything, e.g. a missing call to
    PyEval_InitThreads() in 3.6:

    "Changed in version 3.7: This function is now called by Py_Initialize(), so you don’t have to call it yourself anymore."

    This is why we need to eliminate pybind11 so we can see what is
    actually going on.

    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Feb 28, 2020

    I rewrote my example without pybind and eliminated C++ module (I realized that time.sleep() also releases the GIL, so we achieve the same effect). Still the same results: with python 3.7.3 app crashes with attached ASAN output, with python 3.7.3 without #5278 works just fine.

    To run main.cpp you should add directory with crash_test.py to PYTHONPATH.

    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Feb 28, 2020

    Also I understood the source of your crash with my initial example. Since you haven't used CMake to configure project, pybind didn't setup required macroses to enable threading support. So no issues in pybind.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 28, 2020

    With python 3.7.3 without #5278 works just fine.

    Thanks, I'm now getting the same results as you. Looking at the smaller
    test case, I also agree that it should work (as it did in 3.6).

    @skrah skrah mannequin added type-crash A hard crash of the interpreter, possibly with a core dump and removed invalid labels Feb 28, 2020
    @pitrou pitrou added 3.9 only security fixes labels Mar 1, 2020
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 2, 2020

    I think the PR fixes the issue but I have to run longer tests still.

    Threads created by PyGILState_Ensure() could have a duplicate tstate->id,
    which confused the ContextVar caching machinery.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 2, 2020

    New changeset b3b9ade by Stefan Krah in branch 'master':
    bpo-39776: Lock ++interp->tstate_next_unique_id. (GH-18746) (bpo-18746)
    b3b9ade

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 3, 2020

    New changeset 5a92f42 by Stefan Krah in branch '3.8':
    bpo-39776: Lock ++interp->tstate_next_unique_id. (GH-18746) (bpo-18746) (bpo-18752)
    5a92f42

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 3, 2020

    New changeset 852aee6 by Stefan Krah in branch '3.7':
    bpo-39776: Lock ++interp->tstate_next_unique_id (GH-18746)
    852aee6

    @skrah skrah mannequin changed the title Crash in decimal module in heavy-multithreaded scenario PyContextVar_Get(): crash due to race condition in updating tstate->id Mar 3, 2020
    @skrah skrah mannequin changed the title Crash in decimal module in heavy-multithreaded scenario PyContextVar_Get(): crash due to race condition in updating tstate->id Mar 3, 2020
    @skrah skrah mannequin assigned skrah Mar 3, 2020
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 3, 2020

    Since many duplicate tstate ids are generated in the test case
    before the crash happens, I guess a set of tstates with the same
    id silently uses the cached context of the "winner" tstate.

    This can lead to incorrect results without noticing.

    @skrah skrah mannequin added deferred-blocker labels Mar 3, 2020
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 3, 2020

    Also _pydecimal was affected. This is a modified version of Evgeny's
    test case:

    from _pydecimal import *
    from time import sleep
    from random import randint
    import sys
    
    
    sys.setswitchinterval(0.0000001)
    
    
    def usleep(x):
        sleep(x/1000000.0)
    
    class Test:
    
        def __init__(self, threadno):
            self.threadno = threadno
    
        def GetCallback(self):
            usleep(randint(0, 9));
            setcontext(Context(Emax=self.threadno))
            usleep(randint(0, 9))
            c = getcontext()
            try:
                if c.Emax != self.threadno:
                    raise RuntimeError("GetCallback: another thread changed this thread's context")
            except AttributeError:
                raise RuntimeError("GetCallback: type(c)==%s and c.Emax disappeared" % type(c))
            usleep(randint(0, 9))
            return self.Callback
    
        def Callback(self):
            c = getcontext()
            try:
                c.Emax = self.threadno
            except AttributeError:
                raise RuntimeError("Callback: type(c)==%s and c.Emax disappeared" % type(c))
    
    def DoTest(threadno):
        return Test(threadno).GetCallback()

    It produces one of the exceptions or a segfault. You may have
    to build in release mode to get the race.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 3, 2020

    Setting to release blocker, but please move to deferred again
    if a release is almost finished.

    @skrah skrah mannequin added release-blocker and removed deferred-blocker labels Mar 3, 2020
    @ned-deily
    Copy link
    Member

    Thanks for the heads-up and the fix, Stefan. The fix for 3.7.x was merged before the 3.7.7rc1 cutoff (by a few hours!) and the next 3.8.x release cutoff is planned for April and 3.9.0a5 is later in March, so, if you are not planning to merge any other changes for this issue, we can set the issue status to "closed" now.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Mar 3, 2020

    Thanks Ned, closing then.

    Evgeny, please reopen if you see it again (I ran the tests for about
    20 min, way above the usual reproduction time of 1 min).

    Thanks for the very instructive test case!

    @skrah skrah mannequin closed this as completed Mar 3, 2020
    @skrah skrah mannequin closed this as completed Mar 3, 2020
    @boytsovea
    Copy link
    Mannequin Author

    boytsovea mannequin commented Mar 4, 2020

    I checked both my test example and real production code with your patch. I'm unable to reproduce the bug, so I think it is fixed now. Thank you!

    @1st1
    Copy link
    Member

    1st1 commented Mar 5, 2020

    Thank you so much, Stefan, for looking into this. Really appreciate the help.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants