Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit #82187

Closed
tiran opened this issue Sep 2, 2019 · 96 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@tiran
Copy link
Member

tiran commented Sep 2, 2019

BPO 38006
Nosy @tim-one, @nascheme, @pitrou, @larryhastings, @tiran, @benjaminp, @ned-deily, @encukou, @methane, @ambv, @markshannon, @jdemeyer, @pablogsal, @miss-islington
PRs
  • bpo-38006: Avoid closure in weakref.WeakValueDictionary #15641
  • [3.8] bpo-38006: Avoid closure in weakref.WeakValueDictionary (GH-15641) #15787
  • [3.7] bpo-38006: Avoid closure in weakref.WeakValueDictionary (GH-15641) #15789
  • bpo-38006: Clear weakrefs in garbage found by the GC #16495
  • [3.8] bpo-38006: Clear weakrefs in garbage found by the GC (GH-16495) #16499
  • [3.8] bpo-33418: Restore tp_clear for function object #16502
  • bpo-38006: Add unit test for weakref clear bug #16788
  • Files
  • reproducer.tar.gz
  • gc_crash.py
  • gc_crash.patch
  • gc_disable_wr_callback.txt
  • gc_weakref_bug_demo.py
  • gc_weakref_bug_demo2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/nascheme'
    closed_at = <Date 2021-11-08.21:45:42.669>
    created_at = <Date 2019-09-02.09:48:27.174>
    labels = ['interpreter-core', '3.7', '3.8', '3.9', 'type-crash']
    title = 'Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit'
    updated_at = <Date 2021-11-08.21:45:42.668>
    user = 'https://github.com/tiran'

    bugs.python.org fields:

    activity = <Date 2021-11-08.21:45:42.668>
    actor = 'nascheme'
    assignee = 'nascheme'
    closed = True
    closed_date = <Date 2021-11-08.21:45:42.669>
    closer = 'nascheme'
    components = ['Interpreter Core']
    creation = <Date 2019-09-02.09:48:27.174>
    creator = 'christian.heimes'
    dependencies = []
    files = ['48586', '48587', '48588', '48630', '48633', '48637']
    hgrepos = []
    issue_num = 38006
    keywords = ['patch', '3.8regression']
    message_count = 96.0
    messages = ['350974', '350975', '350976', '350981', '350983', '350985', '350987', '350990', '350991', '350992', '350995', '350996', '350997', '350998', '350999', '351006', '351007', '351009', '351010', '351012', '351015', '351017', '351020', '351037', '351038', '351039', '351040', '351041', '351042', '351067', '351072', '351077', '351082', '351083', '351086', '351098', '351099', '351492', '351519', '351561', '351676', '353371', '353378', '353398', '353399', '353410', '353412', '353414', '353439', '353480', '353506', '353512', '353515', '353520', '353521', '353522', '353524', '353527', '353528', '353531', '353580', '353592', '353593', '353596', '353597', '353598', '353600', '353601', '353602', '353603', '353604', '353605', '353606', '353607', '353608', '353610', '353623', '353631', '353642', '353643', '353708', '353713', '353726', '353729', '353867', '353875', '353883', '353884', '353896', '353961', '354674', '354678', '354685', '354766', '405980', '405984']
    nosy_count = 14.0
    nosy_names = ['tim.peters', 'nascheme', 'pitrou', 'larry', 'christian.heimes', 'benjamin.peterson', 'ned.deily', 'petr.viktorin', 'methane', 'lukasz.langa', 'Mark.Shannon', 'jdemeyer', 'pablogsal', 'miss-islington']
    pr_nums = ['15641', '15787', '15789', '16495', '16499', '16502', '16788']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue38006'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

    @tiran
    Copy link
    Member Author

    tiran commented Sep 2, 2019

    I have a case of a segfault on Fedora 32 (currently rawhide) with Python 3.8b4 and FreeIPA. The ipactl Python helper crashes with a segfault when the Python process exits. The segfault occurs in _PyFunction_Vectorcall() with a weakref related local scope function that has a NULL func_code pointer. _PyFunction_Vectorcall() does not check the code object for NULL and access the variable without check:

    Program received signal SIGSEGV, Segmentation fault.
    0x00007ffff7be354f in _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, stack=0x7fffffffdb50, nargsf=1, kwnames=0x0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:395
    395 Py_ssize_t nkwargs = (kwnames == NULL) ? 0 : PyTuple_GET_SIZE(kwnames);
    (gdb) bt
    #0 0x00007ffff7be354f in _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, stack=0x7fffffffdb50, nargsf=1, kwnames=0x0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:395
    #1 0x00007ffff7bb16b0 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1, args=0x7fffffffdb50, callable=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/cpython/abstract.h:127
    #2 _PyObject_FastCall (nargs=1, args=0x7fffffffdb50, func=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/cpython/abstract.h:147
    #3 object_vacall (base=<optimized out>, callable=<function at remote 0x7fffe6ebfd30>, vargs=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:1186
    #4 0x00007ffff7bb19e1 in PyObject_CallFunctionObjArgs (callable=callable@entry=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:1259
    #5 0x00007ffff7c3f7c3 in handle_callback (ref=<optimized out>, callback=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/weakrefobject.c:877
    #6 0x00007ffff7c01bf7 in PyObject_ClearWeakRefs (object=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/weakrefobject.c:922
    #7 0x00007fffe905babc in ctypedescr_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
    #8 0x00007fffe9058825 in cfield_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
    #9 0x00007ffff7ba4c65 in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:2000
    #10 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
    #11 free_keys_object (keys=0x55555609f590) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:580
    #12 dictkeys_decref (dk=0x55555609f590) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:324
    #13 dict_dealloc (mp=0x7fffe6dd5340) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:1994
    #14 0x00007fffe905bb25 in ctypedescr_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
    #15 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
    #16 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
    #17 tupledealloc (op=0x7fffe6e3bbb0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
    #18 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
    #19 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
    #20 tupledealloc (op=0x7fffe6e55b80) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
    #21 0x00007ffff7c16b76 in _Py_DECREF (filename=0x7ffff7d2e6a8 "/builddir/build/BUILD/Python-3.8.0b4/Objects/typeobject.c", lineno=1110, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
    #22 clear_slots (self=<optimized out>, type=0x55555559d0d0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/typeobject.c:1110
    #23 subtype_dealloc (self=<KeyedRef at remote 0x7fffe6deb220>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/typeobject.c:1262
    #24 0x00007ffff7ba4c65 in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:2000
    #25 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
    #26 free_keys_object (keys=0x555556078d30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:580
    #27 dictkeys_decref (dk=0x555556078d30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:324
    #28 dict_dealloc (mp=0x7fffe6eba6c0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:1994
    #29 0x00007ffff7b98a4b in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
    #30 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
    #31 cell_dealloc (op=0x7fffe6ebe2e0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/cellobject.c:84
    #32 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
    #33 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
    #34 tupledealloc (op=0x7fffe6ebe310) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
    #35 0x00007ffff7b991c2 in _Py_DECREF (filename=0x7ffff7d2e460 "/builddir/build/BUILD/Python-3.8.0b4/Objects/funcobject.c", lineno=584, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
    #36 func_clear (op=0x7fffe6ebfd30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/funcobject.c:584
    #37 0x00007ffff7ba5f84 in delete_garbage (state=<optimized out>, state=<optimized out>, old=<optimized out>, collectable=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:929
    #38 collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1, state=0x7ffff7dd9598 <_PyRuntime+344>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:1106
    #39 0x00007ffff7cac8c2 in _PyGC_CollectNoFail () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:1848
    #40 0x00007ffff7cacb74 in PyImport_Cleanup () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/import.c:541
    #41 0x00007ffff7cacf56 in Py_FinalizeEx () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pylifecycle.c:1226
    #42 0x00007ffff7cad088 in Py_Exit (sts=0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pylifecycle.c:2248
    #43 0x00007ffff7cad0cf in handle_system_exit () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:658
    #44 0x00007ffff7cad219 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x55555555bfb0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:755
    #45 PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:755
    #46 0x00007ffff7cad23a in PyErr_Print () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:761
    #47 0x00007ffff7b91b42 in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffe178) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:434
    #48 0x00007ffff7cae36f in pymain_run_file (cf=0x7fffffffe178, config=0x55555555b410) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:383
    #49 pymain_run_python (exitcode=0x7fffffffe170) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:567
    #50 Py_RunMain () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:646
    #51 0x00007ffff7cae559 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:700
    #52 0x00007ffff7e21193 in __libc_start_main () from /lib64/libc.so.6
    #53 0x000055555555508e in _start ()

    It looks like the func object is the local remove function of weakref.WeakValueDictionary() implementation,

    cpython/Lib/weakref.py

    Lines 90 to 112 in 353053d

    class WeakValueDictionary(_collections_abc.MutableMapping):
    """Mapping class that references values weakly.
    Entries in the dictionary will be discarded when no strong
    reference to the value exists anymore
    """
    # We inherit the constructor without worrying about the input
    # dictionary; since it uses our .update() method, we get the right
    # checks (if the other dictionary is a WeakValueDictionary,
    # objects are unwrapped on the way out, and we always wrap on the
    # way in).
    def __init__(self, other=(), /, **kw):
    def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
    self = selfref()
    if self is not None:
    if self._iterating:
    self._pending_removals.append(wr.key)
    else:
    # Atomic removal is necessary since this function
    # can be called asynchronously by the GC
    _atomic_removal(d, wr.key)
    self._remove = remove

    All function object fields except func_qualname and vectorcall are already NULL.

    (gdb) print ((PyFunctionObject*)func).func_annotations
    $19 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_closure
    $20 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_code
    $21 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_defaults
    $22 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_dict
    $23 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_doc
    $24 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_globals
    $25 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_kwdefaults
    $26 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_module
    $27 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_name
    $28 = 0x0
    (gdb) print ((PyFunctionObject*)func).func_qualname
    $29 = 'WeakValueDictionary.__init__.<locals>.remove'
    (gdb) print ((PyFunctionObject*)func).func_weakreflist
    $30 = 0x0
    (gdb) print ((PyFunctionObject*)func).vectorcall
    $31 = (vectorcallfunc) 0x7ffff7be3530 <_PyFunction_Vectorcall>
    (gdb) print func.ob_refcnt
    $32 = 135
    (gdb) print func.ob_type
    $33 = (struct _typeobject *) 0x7ffff7dd4e80 <PyFunction_Type>

    The Fedora downstream bug is https://bugzilla.redhat.com/show_bug.cgi?id=1747901

    @tiran tiran added release-blocker 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Sep 2, 2019
    @encukou
    Copy link
    Member

    encukou commented Sep 2, 2019

    I don't understand how the function ended up with func_code=NULL. That shouldn't be a valid function to call, IMO.
    Do you have any info on how the function ended up in that state?

    @tiran
    Copy link
    Member Author

    tiran commented Sep 2, 2019

    Not yet.

    My current hypothesis is the function code object is already cleaned up *somehow*.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 2, 2019

    I don't understand how the function ended up with func_code=NULL. That shouldn't be a valid function to call, IMO. Do you have any info on how the function ended up in that state?

    It doesn't seem possible to create a function with func_code=NULL, nor to set func_code to NULL. func_code can be be set to NULL by func_clear() which is called by func_dealloc().

    I bet that func_clear() has been called since most func fields are set to NULL, which is consistent with:

    static int
    func_clear(PyFunctionObject *op)
    {
        Py_CLEAR(op->func_code);
        Py_CLEAR(op->func_globals);
        Py_CLEAR(op->func_module);
        Py_CLEAR(op->func_name);
        Py_CLEAR(op->func_defaults);
        Py_CLEAR(op->func_kwdefaults);
        Py_CLEAR(op->func_doc);
        Py_CLEAR(op->func_dict);
        Py_CLEAR(op->func_closure);
        Py_CLEAR(op->func_annotations);
        Py_CLEAR(op->func_qualname);
        return 0;
    }

    The question is how is it possible that a deallocated function is still accessed? It smells like a borrowed reference somewhere in the call chain.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 2, 2019

    Using gdb, I checked if func_clear() can be cleared outside func_dealloc(): yes, delete_garbage() (gcmodule.c) calls type->clear(). But I'm surprised that the function would be seen as "unreachable" if it's reference counter was equal to 135:

    (gdb) print func.ob_refcnt
    $32 = 135

    @vstinner vstinner changed the title _PyFunction_Vectorcall() can segfault on process exit Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit Sep 2, 2019
    @vstinner
    Copy link
    Member

    vstinner commented Sep 2, 2019

    (gdb) print ((PyFunctionObject*)func).func_qualname
    $29 = 'WeakValueDictionary.__init__.<locals>.remove'

    This is a temporary remove() function declared in weakref.WeakValueDictionary constructor:

    class WeakKeyDictionary(_collections_abc.MutableMapping):
        def __init__(self, dict=None):
            def remove(k, selfref=ref(self)):
                ...
            self._remove = remove
            ...
    
        def __setitem__(self, key, value):
            self.data[ref(key, self._remove)] = value
    ...
    

    Simplified gdb traceback:

    Py_FinalizeEx()
    -> PyImport_Cleanup()
    -> _PyGC_CollectNoFail()
    -> delete_garbage()
    -> func_clear(op=0x7fffe6ebfd30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/funcobject.c:584
        Py_CLEAR(op->func_closure);
    -> tupledealloc()
    -> cell_dealloc()
    -> dict_dealloc()
    -> dictkeys_decref()
    -> cfield_dealloc () from _cffi_backend
    -> PyObject_ClearWeakRefs()
    -> _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, ...)

    PyImport_Cleanup() calls delete_garbage() which calls func_clear(op=0x7fffe6ebfd30).

    But while the function closure is destroyed, a cffi object stored in a dict (namespace?) (stored in the function closure) is cleared, and there was a weak reference to this cffi object with a callback... which is the cleared function (Python object 0x7fffe6ebfd30)...

    --

    I don't understand why/how remove() gets a closure. I modified weakref.py to ensure that remove.__closure__ is None: test_weakref still pass.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 2, 2019

    I don't understand why/how remove() gets a closure.

    Ok, I found the reason and I wrote PR 15641 to fix it.

    Previous fix of WeakValueDictionary remove() function:

    commit 9cd7e17
    Author: Łukasz Langa <lukasz@langa.pl>
    Date: Fri Feb 10 00:14:55 2017 -0800

    Fix bpo-29519: weakref spewing exceptions during interp finalization
    

    @tiran
    Copy link
    Member Author

    tiran commented Sep 2, 2019

    Your workaround solves the segfault issue for me.

    @pablogsal
    Copy link
    Member

    I think the real problem is in vectorcall. tp_clear will make sure all internal pointers are either NULL or valid (by using Py_CLEAR, which also protects against re-entrancy) and it the responsibility of the object or its users to check that the fields that are needed are valid. From the docs on tp_clear:

    The Py_CLEAR() macro should be used, because clearing references is delicate: the reference to the contained object must not be decremented until after the pointer to the contained object is set to NULL. This is because decrementing the reference count may cause the contained object to become trash, triggering a chain of reclamation activity that may include invoking arbitrary Python code (due to finalizers, or weakref callbacks, associated with the contained object). If it’s possible for such code to reference self again, it’s important that the pointer to the contained object be NULL at that time, so that self knows the contained object can no longer be used.

    Notice the "so that self knows the contained object can no longer be used". I think _PyFunction_Vectorcall is missing the check "to see of the contained object can or cannot be used". Here the contained object being everything that has being cleaned before:

        Py_CLEAR(op->func_code);
        Py_CLEAR(op->func_globals);
        Py_CLEAR(op->func_module);
        Py_CLEAR(op->func_name);
        Py_CLEAR(op->func_defaults);
        Py_CLEAR(op->func_kwdefaults);
        Py_CLEAR(op->func_doc);
        Py_CLEAR(op->func_dict);
        Py_CLEAR(op->func_closure); <----- Everything before this
        Py_CLEAR(op->func_annotations);
        Py_CLEAR(op->func_qualname);

    I think it will be enough to make a check for func_code being NULL in _PyFunction_Vectorcall or before.

    @pablogsal
    Copy link
    Member

    In particular, this was not happening before because the function type did not implement tp_clear:

    https://github.com/python/cpython/blob/3.7/Objects/funcobject.c#L615

    The new implementation of tp_clear without checks is allowing this to happen.

    @pablogsal
    Copy link
    Member

    See also https://bugs.python.org/issue33418 as the potential source of the problem.

    @encukou
    Copy link
    Member

    encukou commented Sep 2, 2019

    I'm not sure adding a check would solve this. What should be done when a function with func_code=NULL is called? "Silently do nothing" is not really an option; raising an exception wouldn't help much in this case.

    I wonder if a function's tp_clear we should clear the mutable parts (like func_closure) before the immutable ones (func_code).
    Any access to the immutable ones should have checks (and probably raise exceptions if those are NULL).

    @vstinner
    Copy link
    Member

    vstinner commented Sep 2, 2019

    I'm now able to reproduce the FreeIPA crash. In short:

    • Get a Fedora Rawhide VM (to get Python 3.8 as "python3", it's more convenient)
    • Install FreeIPA (dnf install freeipa-server)
    • Run: ipa-server-install --help
    • Python does crash at exit

    In particular, this was not happening before because the function type did not implement tp_clear:
    https://github.com/python/cpython/blob/3.7/Objects/funcobject.c#L615

    I confirm that the patch below does fix the FreeIPA crash:

    diff --git a/Objects/funcobject.c b/Objects/funcobject.c
    index df5cc2d3f5..357372c565 100644
    --- a/Objects/funcobject.c
    +++ b/Objects/funcobject.c
    @@ -669,7 +669,7 @@ PyTypeObject PyFunction_Type = {
         Py_TPFLAGS_METHOD_DESCRIPTOR,               /* tp_flags */
         func_new__doc__,                            /* tp_doc */
         (traverseproc)func_traverse,                /* tp_traverse */
    -    (inquiry)func_clear,                        /* tp_clear */
    +    (inquiry)0,                        /* tp_clear */
         0,                                          /* tp_richcompare */
         offsetof(PyFunctionObject, func_weakreflist), /* tp_weaklistoffset */
         0,                                          /* tp_iter */

    @vstinner
    Copy link
    Member

    vstinner commented Sep 2, 2019

    PyFunction_Type.tp_clear changed in bpo-33418 (previously, it was equal to 0: no clear method):

    commit 3c45240
    Author: INADA Naoki <methane@users.noreply.github.com>
    Date: Wed Jul 4 11:15:50 2018 +0900

    bpo-33418: Add tp_clear for function object (GH-8058)
    
    Without tp_clear, GC can't break cyclic reference.
    It will cause memory leak when cyclic reference is
    created intentionally.
    

    @pablogsal
    Copy link
    Member

    In https://bugs.python.org/issue33418 I proposed reverting tp_clear on function objects.

    What should be done when a function with func_code=NULL is called?

    We can set the error indicator at least. Although I agree that it seems suboptimal. At least I think we should add an assertion to check that the function called is valid so we can identify this situation easier.

    What downsides do we see raising an exception?

    Notice that this can happen without involving weak references. For example:

    You have object A and B. A is a function. The destructor of B calls A. A and B are in a cycle. When calling tp_clear on A the refs of B reaches 0, trying to call A and .... Boom!

    Either we remove tp_clear or somehow we have to protect all callers of the function. Technically, any field cleared from tp_cleared leaves the function in an inconsistent state

    @encukou
    Copy link
    Member

    encukou commented Sep 2, 2019

    What downsides do we see raising an exception?

    Yeah, on second thought, that would probably be best. We still want PR bpo-15641 as well, so the exception doesn't actually happen in practice.

    Note that _PyEval_EvalCodeWithName (the main use of func_globals) already fails with an exception if globals is NULL.

    Either we remove tp_clear or somehow we have to protect all callers of the function. Technically, any field cleared from tp_cleared leaves the function in an inconsistent state

    tp_clear has a purpose; it prevents memory leaks. That leaves protecting access to the cleared fields.

    (But we don't need to clear func_name and func_qualname as these must be strings.)

    @pablogsal
    Copy link
    Member

    The more I think about this the more I think there is something else at play. If the GC is able to follow the dependency chain, all weakrefs should have been already managed correctly and PyObject_ClearWeakRefs should have never be invoked.

    I think *something* is not informing the gc about the dependency chain

    @tiran
    Copy link
    Member Author

    tiran commented Sep 2, 2019

    It's going to be hard to figure that out. FreeIPA uses a ton of C extensions. Some are hand-written, some uses Cython, ctypes, and cffi.

    @encukou
    Copy link
    Member

    encukou commented Sep 2, 2019

    The traceback does have some ctypedescr_dealloc and cfield_dealloc from _cffi. Could you check what those are?

    @pablogsal
    Copy link
    Member

    I think the problem is that whatever is weak-referenced by the weak ref (CField_Type or similar) is not on the gc list. Because is not on the gc list, handle_weakrefs (https://github.com/python/cpython/blob/master/Modules/gcmodule.c#L1090) is not acting correctly on the weakreferences, clearing them before the tp_clear calls and therefore producing the crash.

    @pablogsal
    Copy link
    Member

    Unless we are missing something I think this may be caused by something in cffi that is not implementing gc-related functions correctly, as PyObject_ClearWeakRefs should not be called from a gc run. Given how complicated this is and even if the possible solution is to fix the problem in cffi or elsewhere, we can add an extra check in PyObject_ClearWeakRefs to NOT call callbacks that are in the process of being collected. Technically we should never reach that code, but we won't be segfaulting and the cost is a redundant-check that will be false in the normal path.

    @tiran
    Copy link
    Member Author

    tiran commented Sep 2, 2019

    The biggest CFFI based dependency of FreeIPA is cryptography, but cryptography does not use weakref in its sources. CFFI on the other hand has a WeakValueDictionary object attached to its internal typecache backend, https://bitbucket.org/cffi/cffi/src/bf80722dea36237710083315e336c81bc8571fd0/cffi/model.py#lines-572 . Perhaps that is the culprit?

    I checked the CTypeDescrObject from the stacktrace. Three out of three times it's for "unsigned long(*)(void *)". I wasn't able to get any useful information from the other fields yet.

    (gdb) up 7
    #7 0x00007fffe905babc in ctypedescr_dealloc (ct=0x7fffe6d958b0) at c/_cffi_backend.c:401
    401 PyObject_ClearWeakRefs((PyObject *) ct);
    (gdb) p (char*)ct.ct_name
    $1 = 0x7fffe6d95908 "unsigned long(*)(void *)"
    (gdb) p ct
    $2 = (CTypeDescrObject *) 0x7fffe6d958b0

    @pablogsal
    Copy link
    Member

    For the weakref to be handled correctly the ctypedescr needs to be identified correctly as part of the isolated cycle. If is not in the isolated cycle something may be missing tp_traverse or the GC flags.

    Can you check if the ctypedescr is part of GC.objects()?

    @vstinner
    Copy link
    Member

    vstinner commented Sep 2, 2019

    I investigated the FreeIPA crash.

    • Python 3.8 behaves differently because func_clear() has been implemented (bpo-33418, commit 3c45240)

    • The bug is a crash on a function call (_PyFunction_Vectorcall) because the function has been cleared (by func_clear), but there was still a weak reference using this function as its callback.

    • Note: the function is called *during* it's being cleared by func_clear().

    • The GC has a workaround for weak references part of "unreachable" objects, but its handle_weakrefs() function doesn't work because CFFI CField_Type type doesn't implement tp_traverse.

    --

    PR 15641 just hides the real bug.

    One issue is that CFFI doesn't implement correctly the GC protocol. If an object contains another object, its type must:

    • Have Py_TPFLAGS_HAVE_GC flag
    • Implement tp_traverse
    • Use PyObject_GC_Malloc() to allocate an object
    • Call PyObject_GC_Track() on created object

    Another issue is that the GC doesn't prevent the crash. Would it be possible to prevent the crash without changing the behavior (ex: still call weakref callbacks)?

    @tim-one
    Copy link
    Member

    tim-one commented Sep 30, 2019

    Łukasz, all type objects have tp_clear slots, and always did. The patch in question put something useful in the function object's tp_clear slot instead of leaving it NULL. No interface, as such, changes either way.

    @nascheme
    Copy link
    Member

    Is introducing tp_clear on functions a thing that has ABI consequences? In other words, if we take our time on this, would it land in 3.8.1 or 3.9.0?

    I think it should not have ABI consequences. However, I see the addition of tp_clear as a new feature. Leaking memory due to reference cycles is bad behavior but not a bug to be fixed in a maintenance release. Unless we require full GC protocol on every container object, we can't promise not to leak.

    Inaka's change to add func tp_clear has been in all the 3.8.0 pre-releases (I think). So, it should be pretty well exercised by now (at least, as well as the per-releases are tested).

    @ambv
    Copy link
    Contributor

    ambv commented Sep 30, 2019

    If either of you resurrects tp_clear in functions in the next 12 hours or so, I'll merge it.

    @nascheme
    Copy link
    Member

    I created #60706. I'm not exactly sure of the process of making a revert on a branch like '3.8' so hopefully it is correct. The code change is exactly what has been reverted in 3.8. The net effect will be as if the revert was never done.

    @nascheme nascheme added the 3.7 (EOL) end of life label Sep 30, 2019
    @nascheme
    Copy link
    Member

    nascheme commented Oct 1, 2019

    I worked some on trying to create a unit test this evening. Attached is my best result so far (gc_weakref_bug_demo.py). It requires that you enable the 'xx' module so we have a container object without tp_traverse. We should probably add one to _testcapi for the real unit test so we have it.

    I can make the weakref callback run from within delete_garbage(). I haven't been able to make tp_clear for the function execute before the callback and I think that is needed to replicate the crash in this bug report.

    @ambv
    Copy link
    Contributor

    ambv commented Oct 1, 2019

    New changeset b361207 by Łukasz Langa (Neil Schemenauer) in branch '3.8':
    Restore tp_clear for function object. (bpo-16502)
    b361207

    @ambv
    Copy link
    Contributor

    ambv commented Oct 1, 2019

    Thanks a million everyone, this is now hopefully solved. I'm leaving it open for Neil to experiment with his unit test (which would be amazing to have!).

    @tim-one
    Copy link
    Member

    tim-one commented Oct 1, 2019

    Neil, about this comment:

    # - ct is not yet trash (it actually is but the GC doesn't know because of
    # the missing tp_traverse method).

    I believe gc should know ct is trash. ct is in the cf list, and the latter does have tp_traverse.

    What gc won't know is that a is trash, because a is attached to ct, and ct doesn't have tp_traverse.

    It should blow up anyway :-)

    Maybe with a simpler structure it would be easier to rearrange code to nudge the callback into getting cleared before use?

    Z <- Y <- A <-> B -> WZ -> C

    where WZ is a weakref to Z with callback C, and Y doesn't implement tp_traverse. The only cycle is between A and B, which could just as well be the same object. All the other stuff hangs off that cycle.

    It's all trash, but we won't know in advance that Z is part of it.

    @nascheme
    Copy link
    Member

    nascheme commented Oct 1, 2019

    Łukasz, is there some reason you removed old versions (2.7, 3.6, etc)? The bug is present on those versions of Python and it should be trivial to backport the fix. If those branches are maintained, I think we should fix it.

    Attached is a small test program (gc_weakref_bug_demo2.py) that causes the same crash as in this bug report (SEGV inside _PyFunction_Vectorcall). Setting things up is quite tricky and it depends on the order that things are cleared in delete_garbage(). Maybe still worth making a unit test for it?

    Instead of using the 'dummy' function, you can also use types.SimpleNamespace(). Calling repr() on it after tp_clear will result in a crash or an assert failure. Those two crashes are not needed to confirm the bug. Just the fact that 'callback' runs is proof enough. So, in a unit test, the callback to just set a global to indicate failure.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 2, 2019

    Oh, Neil missed "bpo-38006: " prefix in his commit. Here it is:

    commit bcda460
    Author: Neil Schemenauer <nas-github@arctrix.com>
    Date: Mon Sep 30 10:06:45 2019 -0700

    Clear weakrefs in garbage found by the GC (bpo-16495)
    
    Fix a bug due to the interaction of weakrefs and the cyclic garbage
    collector. We must clear any weakrefs in garbage in order to prevent
    their callbacks from executing and causing a crash.
    

    @ambv
    Copy link
    Contributor

    ambv commented Oct 2, 2019

    Łukasz, is there some reason you removed old versions (2.7, 3.6, etc)? The bug is present on those versions of Python and it should be trivial to backport the fix. If those branches are maintained, I think we should fix it.

    Sorry, I did that by mistake.

    @ambv ambv added the 3.7 (EOL) end of life label Oct 2, 2019
    @tim-one
    Copy link
    Member

    tim-one commented Oct 3, 2019

    Loose ends. Telegraphic because typing is hard.

    1. Docs should be changed to encourage implementing the full gc protocol for "all" containers. Spell out what can go wrong if they don't. Be upfront about that history has, at times, proved us too optimistic about that ever since weakrefs were added.

    2. Surprise finalizers (SF). All weakrefs to trash objects were already cleared. So if a SF can access trash, it must be via a chain of strong refs. But in that case, they wouldn't be trash to begin with (since they're reachable from the SF, which is a S because we believed it wasn't trash).

    So best optimistic guess is that only this can go wrong: we promise that a finalizer will run at most once (even if the object is resurrected), but a SF neither records that it has been run nor recognizes that (if so) it's already been run. IOW, a finalizer will be run at most once when it's not a surprise, but will run every time it is a SF.

    1. get_objects() exposing invalid objects. That's no good. Not due to missing tp_traverse, but that the aggregate of trash objects revealed by tp_traverses exceed the aggregate of those reclaimed by tp_clears and subsequent refcounting.

    Must not move invalids to older generation. Move to new internal (to delete_garbage) list instead. That list should be empty at delete_garbage's end. If not, I'd be happy to die with a fatal runtime error. Else, e.g.,

    • add to count of trash objects that could not be collected

    • append them to gc.garbage, each as the sole attribute of a new (say) _GCInvalid container, whose repr/str only shows the address of the invalid object. So the top-level objects in gc.garbage remain valid, but the daring can access the invalid objects indirectly. At least their type pointers should be intact.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 3, 2019

    1. Docs should be changed to encourage implementing the full gc protocol for "all" containers. Spell out what can go wrong if they don't. Be upfront about that history has, at times, proved us too optimistic about that ever since weakrefs were added.

    Would it make any sense to add an opt-in option to emit a warning when a new type is created with Py_TPFLAGS_HAVE_GC but it doesn't implement tp_traverse? Maybe also emit a warning if it doesn't implement tp_clear?

    Maybe it could be a ResourceWarning emitted in development mode, when -X dev is used on the command line.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 3, 2019

    Would it make any sense to add an opt-in option to emit a warning when a new type is created with Py_TPFLAGS_HAVE_GC but it doesn't implement tp_traverse?

    Or even fatal error out? There's no reason to have Py_TPFLAGS_HAVE_GC but not implement tp_traverse, AFAIR.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 3, 2019

    Or even fatal error out? There's no reason to have Py_TPFLAGS_HAVE_GC but not implement tp_traverse, AFAIR.

    Well... It would prefer a smooth transition.

    Such sanity checks which mostly concerns developers perfectly fit the -X dev mode.

    I mean, calling Py_FatalError() only in development mode would be acceptable, rather than blocking the usage of C extensions which are working "perfectly well" on Python 3.7.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 4, 2019

    My understanding is that the CFFI types at issue don't even have Py_TPFLAGS_HAVE_GC. They're completely invisible to gc. As Armin says in the CFFI issue report (linked to earlier), he never got the impression from the docs that he needed to implement anything related to cyclic gc.

    Since Armin is extremely capable, conscientious, and reasonable, that tells me our docs are lacking.

    It was intended at the start that the worst that could happen if a type ignored the gc protocols was that memory may leak. That story changed radically when weakrefs with callbacks were added - but nobody knew it at the time because the weakref design gave no thought to cyclic gc. It's been driven by segfaults ever since ;-)

    We're doing all we can to keep non-cooperating code "working", and will continue to do so. But who knows? The next segfault may be one we can't hack around. It's fundamentally insane to expect any gc to work perfectly when it may be blind to what the object graph _is_.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 4, 2019

    BTW, the phrase "missing tp_traverse" is misleading. If an object with a NULL tp_traverse appears in a gc generation, gc will blow up the next time that generation is collected. That's always been so - gc doesn't check whether tp_traverse is NULL, it just calls it. It's tp_clear that it checks, because that one is optional.

    I don't recall any problem we've had with extensions implementing the gc protocol incorrectly or incompletely. It's this issue's problem: containers not participating in gc _at all_.

    If we had a traditional mark-sweep collector, that would be massively catastrophic. Miss a pointer and you can conclude a live object is actually trash, and tear it down while it's still in use.

    Our problem is the opposite: miss a pointer and we can conclude a trash object is actually live. At the start, the worst that _could_ do is leak memory. It's the later introduction of time-to-die finalizers (weakref callbacks at first) that created the opportunity for segfaults: the only 100% clearly safe way to run finalizers in cyclic trash is to force them to run _before_ anything at all is torn down by force (tp_clear). But to run them in advance, we have to know the relevant objects _are_ trash. Which we can't always know if containers don't always participate.

    While Neil & I haven't thought of ways that can go wrong now beyond that a "surprise finalizer" may get run any number of times, that doesn't mean far worse things can't happen - just that they'll surprise us when they do :-)

    @nascheme
    Copy link
    Member

    It's fundamentally insane to expect any gc to work perfectly when it may be blind to what the object graph _is_.

    I'm often amazed it works at all, let alone perfectly. ;-P

    This did trigger me to think of another possible problem. I setup my unit test as you suggested:

        #   Z <- Y <- A--+--> WZ -> C
        #             ^  |
        #             +--+
        # where:
        #   WZ is a weakref to Z with callback C
        #   Y doesn't implement tp_traverse
        #   A contains a reference to itself, Y and WZ
    

    But what happens if the GC doesn't see that WZ is trash? Then it will not be cleared. Hang it off Y so the GC can't find it. We can set things up so that Z is freed before WZ (e.g. WZ in a second and newer cycle). Then, the callback might still run.

    On further thought, this seems safe (but maybe broken) because of the handle_weakrefs() logic. The GC will think WZ is going to outlive Z so it will call it before doing any tp_clear calls.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 15, 2019

    I'm often amazed it works at all, let alone perfectly. ;-P

    Indeed! Every time I take a break from gc and come back, I burn another hour wondering why it doesn't recycle _everything_ ;-)

    But what happens if the GC doesn't see that WZ is trash?
    Then it will not be cleared. Hang it off Y so the GC
    can't find it.

    In that case it won't think C is trash either (a weakref holds a strong reference to its callback), so C won't be tp_clear'ed - if WZ isn't in the unreachable set, C can't be either.

    We can set things up so that Z is freed before WZ (e.g.
    WZ in a second and newer cycle). Then, the callback
    might still run.

    Since C hasn't been damaged, and nothing reachable from C would have been considered trash either, I don't see a problem.

    On further thought, this seems safe (but maybe broken) because
    of the handle_weakrefs() logic. The GC will think WZ is going
    to outlive Z so it will call it before doing any tp_clear calls.

    Don't think that applies. As above, WZ isn't in the unreachable set, so gc knows nothing about it.

    Z will eventually be reclaimed via refcounting side effect, C _will_ run, and then WZ, and then C, will be reclaimed by refcounting.

    Or am I missing something?

    @tim-one
    Copy link
    Member

    tim-one commented Oct 15, 2019

    While Neil & I haven't thought of ways that can go wrong now
    beyond that a "surprise finalizer" may get run any number of
    times ...

    Speaking of which, I no longer believe that's true. Thanks to the usual layers of baffling renaming via macro after macro, I missed object.c's PyObject_CallFinalizerFromDealloc(), which both respects and sets the "am I finalized?" flag regardless of how, e.g., an object with a __del__ ends its life (and possibly resurrects).

    @nascheme
    Copy link
    Member

    New changeset 392a13b by Neil Schemenauer in branch 'master':
    bpo-38006: Add unit test for weakref clear bug (GH-16788)
    392a13b

    @pablogsal
    Copy link
    Member

    Are we missing something here?

    @nascheme
    Copy link
    Member

    nascheme commented Nov 8, 2021

    Closing since I believe the bug is fixed and there is an appropriate unit test.

    @nascheme nascheme closed this as completed Nov 8, 2021
    @nascheme nascheme self-assigned this Nov 8, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants