Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit #82187

tiran · 2019-09-02T09:48:27Z

BPO	38006
Nosy	@tim-one, @nascheme, @pitrou, @larryhastings, @tiran, @benjaminp, @ned-deily, @encukou, @methane, @ambv, @markshannon, @jdemeyer, @pablogsal, @miss-islington
PRs	bpo-38006: Avoid closure in weakref.WeakValueDictionary #15641 [3.8] bpo-38006: Avoid closure in weakref.WeakValueDictionary (GH-15641) #15787 [3.7] bpo-38006: Avoid closure in weakref.WeakValueDictionary (GH-15641) #15789 bpo-38006: Clear weakrefs in garbage found by the GC #16495 [3.8] bpo-38006: Clear weakrefs in garbage found by the GC (GH-16495) #16499 [3.8] bpo-33418: Restore tp_clear for function object #16502 bpo-38006: Add unit test for weakref clear bug #16788
Files	reproducer.tar.gz gc_crash.py gc_crash.patch gc_disable_wr_callback.txt gc_weakref_bug_demo.py gc_weakref_bug_demo2.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/nascheme'
closed_at = <Date 2021-11-08.21:45:42.669>
created_at = <Date 2019-09-02.09:48:27.174>
labels = ['interpreter-core', '3.7', '3.8', '3.9', 'type-crash']
title = 'Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit'
updated_at = <Date 2021-11-08.21:45:42.668>
user = 'https://github.com/tiran'

bugs.python.org fields:

activity = <Date 2021-11-08.21:45:42.668>
actor = 'nascheme'
assignee = 'nascheme'
closed = True
closed_date = <Date 2021-11-08.21:45:42.669>
closer = 'nascheme'
components = ['Interpreter Core']
creation = <Date 2019-09-02.09:48:27.174>
creator = 'christian.heimes'
dependencies = []
files = ['48586', '48587', '48588', '48630', '48633', '48637']
hgrepos = []
issue_num = 38006
keywords = ['patch', '3.8regression']
message_count = 96.0
messages = ['350974', '350975', '350976', '350981', '350983', '350985', '350987', '350990', '350991', '350992', '350995', '350996', '350997', '350998', '350999', '351006', '351007', '351009', '351010', '351012', '351015', '351017', '351020', '351037', '351038', '351039', '351040', '351041', '351042', '351067', '351072', '351077', '351082', '351083', '351086', '351098', '351099', '351492', '351519', '351561', '351676', '353371', '353378', '353398', '353399', '353410', '353412', '353414', '353439', '353480', '353506', '353512', '353515', '353520', '353521', '353522', '353524', '353527', '353528', '353531', '353580', '353592', '353593', '353596', '353597', '353598', '353600', '353601', '353602', '353603', '353604', '353605', '353606', '353607', '353608', '353610', '353623', '353631', '353642', '353643', '353708', '353713', '353726', '353729', '353867', '353875', '353883', '353884', '353896', '353961', '354674', '354678', '354685', '354766', '405980', '405984']
nosy_count = 14.0
nosy_names = ['tim.peters', 'nascheme', 'pitrou', 'larry', 'christian.heimes', 'benjamin.peterson', 'ned.deily', 'petr.viktorin', 'methane', 'lukasz.langa', 'Mark.Shannon', 'jdemeyer', 'pablogsal', 'miss-islington']
pr_nums = ['15641', '15787', '15789', '16495', '16499', '16502', '16788']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue38006'
versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

tiran · 2019-09-02T09:48:26Z

I have a case of a segfault on Fedora 32 (currently rawhide) with Python 3.8b4 and FreeIPA. The ipactl Python helper crashes with a segfault when the Python process exits. The segfault occurs in _PyFunction_Vectorcall() with a weakref related local scope function that has a NULL func_code pointer. _PyFunction_Vectorcall() does not check the code object for NULL and access the variable without check:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7be354f in _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, stack=0x7fffffffdb50, nargsf=1, kwnames=0x0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:395
395 Py_ssize_t nkwargs = (kwnames == NULL) ? 0 : PyTuple_GET_SIZE(kwnames);
(gdb) bt
#0 0x00007ffff7be354f in _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, stack=0x7fffffffdb50, nargsf=1, kwnames=0x0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:395
#1 0x00007ffff7bb16b0 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1, args=0x7fffffffdb50, callable=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/cpython/abstract.h:127
#2 _PyObject_FastCall (nargs=1, args=0x7fffffffdb50, func=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/cpython/abstract.h:147
#3 object_vacall (base=<optimized out>, callable=<function at remote 0x7fffe6ebfd30>, vargs=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:1186
#4 0x00007ffff7bb19e1 in PyObject_CallFunctionObjArgs (callable=callable@entry=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:1259
#5 0x00007ffff7c3f7c3 in handle_callback (ref=<optimized out>, callback=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/weakrefobject.c:877
#6 0x00007ffff7c01bf7 in PyObject_ClearWeakRefs (object=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/weakrefobject.c:922
#7 0x00007fffe905babc in ctypedescr_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
#8 0x00007fffe9058825 in cfield_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
#9 0x00007ffff7ba4c65 in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:2000
#10 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#11 free_keys_object (keys=0x55555609f590) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:580
#12 dictkeys_decref (dk=0x55555609f590) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:324
#13 dict_dealloc (mp=0x7fffe6dd5340) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:1994
#14 0x00007fffe905bb25 in ctypedescr_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
#15 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#16 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#17 tupledealloc (op=0x7fffe6e3bbb0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
#18 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#19 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#20 tupledealloc (op=0x7fffe6e55b80) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
#21 0x00007ffff7c16b76 in _Py_DECREF (filename=0x7ffff7d2e6a8 "/builddir/build/BUILD/Python-3.8.0b4/Objects/typeobject.c", lineno=1110, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#22 clear_slots (self=<optimized out>, type=0x55555559d0d0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/typeobject.c:1110
#23 subtype_dealloc (self=<KeyedRef at remote 0x7fffe6deb220>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/typeobject.c:1262
#24 0x00007ffff7ba4c65 in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:2000
#25 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#26 free_keys_object (keys=0x555556078d30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:580
#27 dictkeys_decref (dk=0x555556078d30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:324
#28 dict_dealloc (mp=0x7fffe6eba6c0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:1994
#29 0x00007ffff7b98a4b in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#30 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#31 cell_dealloc (op=0x7fffe6ebe2e0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/cellobject.c:84
#32 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#33 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#34 tupledealloc (op=0x7fffe6ebe310) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
#35 0x00007ffff7b991c2 in _Py_DECREF (filename=0x7ffff7d2e460 "/builddir/build/BUILD/Python-3.8.0b4/Objects/funcobject.c", lineno=584, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#36 func_clear (op=0x7fffe6ebfd30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/funcobject.c:584
#37 0x00007ffff7ba5f84 in delete_garbage (state=<optimized out>, state=<optimized out>, old=<optimized out>, collectable=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:929
#38 collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1, state=0x7ffff7dd9598 <_PyRuntime+344>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:1106
#39 0x00007ffff7cac8c2 in _PyGC_CollectNoFail () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:1848
#40 0x00007ffff7cacb74 in PyImport_Cleanup () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/import.c:541
#41 0x00007ffff7cacf56 in Py_FinalizeEx () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pylifecycle.c:1226
#42 0x00007ffff7cad088 in Py_Exit (sts=0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pylifecycle.c:2248
#43 0x00007ffff7cad0cf in handle_system_exit () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:658
#44 0x00007ffff7cad219 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x55555555bfb0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:755
#45 PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:755
#46 0x00007ffff7cad23a in PyErr_Print () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:761
#47 0x00007ffff7b91b42 in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffe178) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:434
#48 0x00007ffff7cae36f in pymain_run_file (cf=0x7fffffffe178, config=0x55555555b410) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:383
#49 pymain_run_python (exitcode=0x7fffffffe170) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:567
#50 Py_RunMain () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:646
#51 0x00007ffff7cae559 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:700
#52 0x00007ffff7e21193 in __libc_start_main () from /lib64/libc.so.6
#53 0x000055555555508e in _start ()

It looks like the func object is the local remove function of weakref.WeakValueDictionary() implementation,

cpython/Lib/weakref.py

Lines 90 to 112 in 353053d

    
           class WeakValueDictionary(_collections_abc.MutableMapping): 
        
               """Mapping class that references values weakly. 
        
               Entries in the dictionary will be discarded when no strong 
        
               reference to the value exists anymore 
        
               """ 
        
               # We inherit the constructor without worrying about the input 
        
               # dictionary; since it uses our .update() method, we get the right 
        
               # checks (if the other dictionary is a WeakValueDictionary, 
        
               # objects are unwrapped on the way out, and we always wrap on the 
        
               # way in). 
        
               def __init__(self, other=(), /, **kw): 
        
                   def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref): 
        
                       self = selfref() 
        
                       if self is not None: 
        
                           if self._iterating: 
        
                               self._pending_removals.append(wr.key) 
        
                           else: 
        
                               # Atomic removal is necessary since this function 
        
                               # can be called asynchronously by the GC 
        
                               _atomic_removal(d, wr.key) 
        
                   self._remove = remove

All function object fields except func_qualname and vectorcall are already NULL.

(gdb) print ((PyFunctionObject*)func).func_annotations
$19 = 0x0
(gdb) print ((PyFunctionObject*)func).func_closure
$20 = 0x0
(gdb) print ((PyFunctionObject*)func).func_code
$21 = 0x0
(gdb) print ((PyFunctionObject*)func).func_defaults
$22 = 0x0
(gdb) print ((PyFunctionObject*)func).func_dict
$23 = 0x0
(gdb) print ((PyFunctionObject*)func).func_doc
$24 = 0x0
(gdb) print ((PyFunctionObject*)func).func_globals
$25 = 0x0
(gdb) print ((PyFunctionObject*)func).func_kwdefaults
$26 = 0x0
(gdb) print ((PyFunctionObject*)func).func_module
$27 = 0x0
(gdb) print ((PyFunctionObject*)func).func_name
$28 = 0x0
(gdb) print ((PyFunctionObject*)func).func_qualname
$29 = 'WeakValueDictionary.__init__.<locals>.remove'
(gdb) print ((PyFunctionObject*)func).func_weakreflist
$30 = 0x0
(gdb) print ((PyFunctionObject*)func).vectorcall
$31 = (vectorcallfunc) 0x7ffff7be3530 <_PyFunction_Vectorcall>
(gdb) print func.ob_refcnt
$32 = 135
(gdb) print func.ob_type
$33 = (struct _typeobject *) 0x7ffff7dd4e80 <PyFunction_Type>

The Fedora downstream bug is https://bugzilla.redhat.com/show_bug.cgi?id=1747901

encukou · 2019-09-02T10:06:56Z

I don't understand how the function ended up with func_code=NULL. That shouldn't be a valid function to call, IMO.
Do you have any info on how the function ended up in that state?

tiran · 2019-09-02T10:11:48Z

Not yet.

My current hypothesis is the function code object is already cleaned up *somehow*.

vstinner · 2019-09-02T10:57:55Z

I don't understand how the function ended up with func_code=NULL. That shouldn't be a valid function to call, IMO. Do you have any info on how the function ended up in that state?

It doesn't seem possible to create a function with func_code=NULL, nor to set func_code to NULL. func_code can be be set to NULL by func_clear() which is called by func_dealloc().

I bet that func_clear() has been called since most func fields are set to NULL, which is consistent with:

static int
func_clear(PyFunctionObject *op)
{
    Py_CLEAR(op->func_code);
    Py_CLEAR(op->func_globals);
    Py_CLEAR(op->func_module);
    Py_CLEAR(op->func_name);
    Py_CLEAR(op->func_defaults);
    Py_CLEAR(op->func_kwdefaults);
    Py_CLEAR(op->func_doc);
    Py_CLEAR(op->func_dict);
    Py_CLEAR(op->func_closure);
    Py_CLEAR(op->func_annotations);
    Py_CLEAR(op->func_qualname);
    return 0;
}

The question is how is it possible that a deallocated function is still accessed? It smells like a borrowed reference somewhere in the call chain.

vstinner · 2019-09-02T11:08:05Z

Using gdb, I checked if func_clear() can be cleared outside func_dealloc(): yes, delete_garbage() (gcmodule.c) calls type->clear(). But I'm surprised that the function would be seen as "unreachable" if it's reference counter was equal to 135:

(gdb) print func.ob_refcnt
$32 = 135

vstinner · 2019-09-02T11:27:54Z

(gdb) print ((PyFunctionObject*)func).func_qualname
$29 = 'WeakValueDictionary.__init__.<locals>.remove'

This is a temporary remove() function declared in weakref.WeakValueDictionary constructor:

class WeakKeyDictionary(_collections_abc.MutableMapping):
    def __init__(self, dict=None):
        def remove(k, selfref=ref(self)):
            ...
        self._remove = remove
        ...

    def __setitem__(self, key, value):
        self.data[ref(key, self._remove)] = value

...

Simplified gdb traceback:

Py_FinalizeEx()
-> PyImport_Cleanup()
-> _PyGC_CollectNoFail()
-> delete_garbage()
-> func_clear(op=0x7fffe6ebfd30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/funcobject.c:584
    Py_CLEAR(op->func_closure);
-> tupledealloc()
-> cell_dealloc()
-> dict_dealloc()
-> dictkeys_decref()
-> cfield_dealloc () from _cffi_backend
-> PyObject_ClearWeakRefs()
-> _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, ...)

PyImport_Cleanup() calls delete_garbage() which calls func_clear(op=0x7fffe6ebfd30).

But while the function closure is destroyed, a cffi object stored in a dict (namespace?) (stored in the function closure) is cleared, and there was a weak reference to this cffi object with a callback... which is the cleared function (Python object 0x7fffe6ebfd30)...

--

I don't understand why/how remove() gets a closure. I modified weakref.py to ensure that remove.__closure__ is None: test_weakref still pass.

vstinner · 2019-09-02T11:40:33Z

I don't understand why/how remove() gets a closure.

Ok, I found the reason and I wrote PR 15641 to fix it.

Previous fix of WeakValueDictionary remove() function:

commit 9cd7e17
Author: Łukasz Langa <lukasz@langa.pl>
Date: Fri Feb 10 00:14:55 2017 -0800

Fix bpo-29519: weakref spewing exceptions during interp finalization

tiran · 2019-09-02T12:09:13Z

Your workaround solves the segfault issue for me.

pablogsal · 2019-09-02T12:34:51Z

I think the real problem is in vectorcall. tp_clear will make sure all internal pointers are either NULL or valid (by using Py_CLEAR, which also protects against re-entrancy) and it the responsibility of the object or its users to check that the fields that are needed are valid. From the docs on tp_clear:

The Py_CLEAR() macro should be used, because clearing references is delicate: the reference to the contained object must not be decremented until after the pointer to the contained object is set to NULL. This is because decrementing the reference count may cause the contained object to become trash, triggering a chain of reclamation activity that may include invoking arbitrary Python code (due to finalizers, or weakref callbacks, associated with the contained object). If it’s possible for such code to reference self again, it’s important that the pointer to the contained object be NULL at that time, so that self knows the contained object can no longer be used.

Notice the "so that self knows the contained object can no longer be used". I think _PyFunction_Vectorcall is missing the check "to see of the contained object can or cannot be used". Here the contained object being everything that has being cleaned before:

    Py_CLEAR(op->func_code);
    Py_CLEAR(op->func_globals);
    Py_CLEAR(op->func_module);
    Py_CLEAR(op->func_name);
    Py_CLEAR(op->func_defaults);
    Py_CLEAR(op->func_kwdefaults);
    Py_CLEAR(op->func_doc);
    Py_CLEAR(op->func_dict);
    Py_CLEAR(op->func_closure); <----- Everything before this
    Py_CLEAR(op->func_annotations);
    Py_CLEAR(op->func_qualname);

I think it will be enough to make a check for func_code being NULL in _PyFunction_Vectorcall or before.

pablogsal · 2019-09-02T12:41:34Z

In particular, this was not happening before because the function type did not implement tp_clear:

https://github.com/python/cpython/blob/3.7/Objects/funcobject.c#L615

The new implementation of tp_clear without checks is allowing this to happen.

pablogsal · 2019-09-02T12:46:24Z

See also https://bugs.python.org/issue33418 as the potential source of the problem.

encukou · 2019-09-02T12:55:34Z

I'm not sure adding a check would solve this. What should be done when a function with func_code=NULL is called? "Silently do nothing" is not really an option; raising an exception wouldn't help much in this case.

I wonder if a function's tp_clear we should clear the mutable parts (like func_closure) before the immutable ones (func_code).
Any access to the immutable ones should have checks (and probably raise exceptions if those are NULL).

vstinner · 2019-09-02T13:07:55Z

I'm now able to reproduce the FreeIPA crash. In short:

Get a Fedora Rawhide VM (to get Python 3.8 as "python3", it's more convenient)
Install FreeIPA (dnf install freeipa-server)
Run: ipa-server-install --help
Python does crash at exit

In particular, this was not happening before because the function type did not implement tp_clear:
https://github.com/python/cpython/blob/3.7/Objects/funcobject.c#L615

I confirm that the patch below does fix the FreeIPA crash:

diff --git a/Objects/funcobject.c b/Objects/funcobject.c
index df5cc2d3f5..357372c565 100644
--- a/Objects/funcobject.c
+++ b/Objects/funcobject.c
@@ -669,7 +669,7 @@ PyTypeObject PyFunction_Type = {
     Py_TPFLAGS_METHOD_DESCRIPTOR,               /* tp_flags */
     func_new__doc__,                            /* tp_doc */
     (traverseproc)func_traverse,                /* tp_traverse */
-    (inquiry)func_clear,                        /* tp_clear */
+    (inquiry)0,                        /* tp_clear */
     0,                                          /* tp_richcompare */
     offsetof(PyFunctionObject, func_weakreflist), /* tp_weaklistoffset */
     0,                                          /* tp_iter */

vstinner · 2019-09-02T13:09:25Z

PyFunction_Type.tp_clear changed in bpo-33418 (previously, it was equal to 0: no clear method):

commit 3c45240
Author: INADA Naoki <methane@users.noreply.github.com>
Date: Wed Jul 4 11:15:50 2018 +0900

bpo-33418: Add tp_clear for function object (GH-8058)

Without tp_clear, GC can't break cyclic reference.
It will cause memory leak when cyclic reference is
created intentionally.

pablogsal · 2019-09-02T13:15:12Z

In https://bugs.python.org/issue33418 I proposed reverting tp_clear on function objects.

What should be done when a function with func_code=NULL is called?

We can set the error indicator at least. Although I agree that it seems suboptimal. At least I think we should add an assertion to check that the function called is valid so we can identify this situation easier.

What downsides do we see raising an exception?

Notice that this can happen without involving weak references. For example:

You have object A and B. A is a function. The destructor of B calls A. A and B are in a cycle. When calling tp_clear on A the refs of B reaches 0, trying to call A and .... Boom!

Either we remove tp_clear or somehow we have to protect all callers of the function. Technically, any field cleared from tp_cleared leaves the function in an inconsistent state

encukou · 2019-09-02T14:46:17Z

What downsides do we see raising an exception?

Yeah, on second thought, that would probably be best. We still want PR bpo-15641 as well, so the exception doesn't actually happen in practice.

Note that _PyEval_EvalCodeWithName (the main use of func_globals) already fails with an exception if globals is NULL.

Either we remove tp_clear or somehow we have to protect all callers of the function. Technically, any field cleared from tp_cleared leaves the function in an inconsistent state

tp_clear has a purpose; it prevents memory leaks. That leaves protecting access to the cleared fields.

(But we don't need to clear func_name and func_qualname as these must be strings.)

pablogsal · 2019-09-02T15:06:40Z

The more I think about this the more I think there is something else at play. If the GC is able to follow the dependency chain, all weakrefs should have been already managed correctly and PyObject_ClearWeakRefs should have never be invoked.

I think *something* is not informing the gc about the dependency chain

tiran · 2019-09-02T15:11:23Z

It's going to be hard to figure that out. FreeIPA uses a ton of C extensions. Some are hand-written, some uses Cython, ctypes, and cffi.

encukou · 2019-09-02T15:13:38Z

The traceback does have some ctypedescr_dealloc and cfield_dealloc from _cffi. Could you check what those are?

pablogsal · 2019-09-02T15:15:57Z

I think the problem is that whatever is weak-referenced by the weak ref (CField_Type or similar) is not on the gc list. Because is not on the gc list, handle_weakrefs (https://github.com/python/cpython/blob/master/Modules/gcmodule.c#L1090) is not acting correctly on the weakreferences, clearing them before the tp_clear calls and therefore producing the crash.

pablogsal · 2019-09-02T15:44:02Z

Unless we are missing something I think this may be caused by something in cffi that is not implementing gc-related functions correctly, as PyObject_ClearWeakRefs should not be called from a gc run. Given how complicated this is and even if the possible solution is to fix the problem in cffi or elsewhere, we can add an extra check in PyObject_ClearWeakRefs to NOT call callbacks that are in the process of being collected. Technically we should never reach that code, but we won't be segfaulting and the cost is a redundant-check that will be false in the normal path.

tiran · 2019-09-02T16:10:40Z

The biggest CFFI based dependency of FreeIPA is cryptography, but cryptography does not use weakref in its sources. CFFI on the other hand has a WeakValueDictionary object attached to its internal typecache backend, https://bitbucket.org/cffi/cffi/src/bf80722dea36237710083315e336c81bc8571fd0/cffi/model.py#lines-572 . Perhaps that is the culprit?

I checked the CTypeDescrObject from the stacktrace. Three out of three times it's for "unsigned long(*)(void *)". I wasn't able to get any useful information from the other fields yet.

(gdb) up 7
#7 0x00007fffe905babc in ctypedescr_dealloc (ct=0x7fffe6d958b0) at c/_cffi_backend.c:401
401 PyObject_ClearWeakRefs((PyObject *) ct);
(gdb) p (char*)ct.ct_name
$1 = 0x7fffe6d95908 "unsigned long(*)(void *)"
(gdb) p ct
$2 = (CTypeDescrObject *) 0x7fffe6d958b0

pablogsal · 2019-09-02T16:23:16Z

For the weakref to be handled correctly the ctypedescr needs to be identified correctly as part of the isolated cycle. If is not in the isolated cycle something may be missing tp_traverse or the GC flags.

Can you check if the ctypedescr is part of GC.objects()?

vstinner · 2019-09-02T21:09:22Z

I investigated the FreeIPA crash.

Python 3.8 behaves differently because func_clear() has been implemented (bpo-33418, commit 3c45240)
The bug is a crash on a function call (_PyFunction_Vectorcall) because the function has been cleared (by func_clear), but there was still a weak reference using this function as its callback.
Note: the function is called *during* it's being cleared by func_clear().
The GC has a workaround for weak references part of "unreachable" objects, but its handle_weakrefs() function doesn't work because CFFI CField_Type type doesn't implement tp_traverse.

--

PR 15641 just hides the real bug.

One issue is that CFFI doesn't implement correctly the GC protocol. If an object contains another object, its type must:

Have Py_TPFLAGS_HAVE_GC flag
Implement tp_traverse
Use PyObject_GC_Malloc() to allocate an object
Call PyObject_GC_Track() on created object

Another issue is that the GC doesn't prevent the crash. Would it be possible to prevent the crash without changing the behavior (ex: still call weakref callbacks)?

tim-one · 2019-09-30T18:21:58Z

Łukasz, all type objects have tp_clear slots, and always did. The patch in question put something useful in the function object's tp_clear slot instead of leaving it NULL. No interface, as such, changes either way.

nascheme · 2019-09-30T18:26:46Z

Is introducing tp_clear on functions a thing that has ABI consequences? In other words, if we take our time on this, would it land in 3.8.1 or 3.9.0?

I think it should not have ABI consequences. However, I see the addition of tp_clear as a new feature. Leaking memory due to reference cycles is bad behavior but not a bug to be fixed in a maintenance release. Unless we require full GC protocol on every container object, we can't promise not to leak.

Inaka's change to add func tp_clear has been in all the 3.8.0 pre-releases (I think). So, it should be pretty well exercised by now (at least, as well as the per-releases are tested).

ambv · 2019-09-30T18:58:03Z

If either of you resurrects tp_clear in functions in the next 12 hours or so, I'll merge it.

nascheme · 2019-09-30T20:50:01Z

I created #60706. I'm not exactly sure of the process of making a revert on a branch like '3.8' so hopefully it is correct. The code change is exactly what has been reverted in 3.8. The net effect will be as if the revert was never done.

nascheme · 2019-10-01T03:10:16Z

I worked some on trying to create a unit test this evening. Attached is my best result so far (gc_weakref_bug_demo.py). It requires that you enable the 'xx' module so we have a container object without tp_traverse. We should probably add one to _testcapi for the real unit test so we have it.

I can make the weakref callback run from within delete_garbage(). I haven't been able to make tp_clear for the function execute before the callback and I think that is needed to replicate the crash in this bug report.

ambv · 2019-10-01T07:49:41Z

New changeset b361207 by Łukasz Langa (Neil Schemenauer) in branch '3.8':
Restore tp_clear for function object. (bpo-16502)
b361207

ambv · 2019-10-01T07:51:09Z

Thanks a million everyone, this is now hopefully solved. I'm leaving it open for Neil to experiment with his unit test (which would be amazing to have!).

tim-one · 2019-10-01T18:29:30Z

Neil, about this comment:

# - ct is not yet trash (it actually is but the GC doesn't know because of
# the missing tp_traverse method).

I believe gc should know ct is trash. ct is in the cf list, and the latter does have tp_traverse.

What gc won't know is that a is trash, because a is attached to ct, and ct doesn't have tp_traverse.

It should blow up anyway :-)

Maybe with a simpler structure it would be easier to rearrange code to nudge the callback into getting cleared before use?

Z <- Y <- A <-> B -> WZ -> C

where WZ is a weakref to Z with callback C, and Y doesn't implement tp_traverse. The only cycle is between A and B, which could just as well be the same object. All the other stuff hangs off that cycle.

It's all trash, but we won't know in advance that Z is part of it.

nascheme · 2019-10-01T20:55:23Z

Łukasz, is there some reason you removed old versions (2.7, 3.6, etc)? The bug is present on those versions of Python and it should be trivial to backport the fix. If those branches are maintained, I think we should fix it.

Attached is a small test program (gc_weakref_bug_demo2.py) that causes the same crash as in this bug report (SEGV inside _PyFunction_Vectorcall). Setting things up is quite tricky and it depends on the order that things are cleared in delete_garbage(). Maybe still worth making a unit test for it?

Instead of using the 'dummy' function, you can also use types.SimpleNamespace(). Calling repr() on it after tp_clear will result in a crash or an assert failure. Those two crashes are not needed to confirm the bug. Just the fact that 'callback' runs is proof enough. So, in a unit test, the callback to just set a global to indicate failure.

vstinner · 2019-10-02T08:23:36Z

Oh, Neil missed "bpo-38006: " prefix in his commit. Here it is:

commit bcda460
Author: Neil Schemenauer <nas-github@arctrix.com>
Date: Mon Sep 30 10:06:45 2019 -0700

Clear weakrefs in garbage found by the GC (bpo-16495)

Fix a bug due to the interaction of weakrefs and the cyclic garbage
collector. We must clear any weakrefs in garbage in order to prevent
their callbacks from executing and causing a crash.

ambv · 2019-10-02T10:21:06Z

Łukasz, is there some reason you removed old versions (2.7, 3.6, etc)? The bug is present on those versions of Python and it should be trivial to backport the fix. If those branches are maintained, I think we should fix it.

Sorry, I did that by mistake.

tim-one · 2019-10-03T17:57:54Z

Loose ends. Telegraphic because typing is hard.

Docs should be changed to encourage implementing the full gc protocol for "all" containers. Spell out what can go wrong if they don't. Be upfront about that history has, at times, proved us too optimistic about that ever since weakrefs were added.
Surprise finalizers (SF). All weakrefs to trash objects were already cleared. So if a SF can access trash, it must be via a chain of strong refs. But in that case, they wouldn't be trash to begin with (since they're reachable from the SF, which is a S because we believed it wasn't trash).

So best optimistic guess is that only this can go wrong: we promise that a finalizer will run at most once (even if the object is resurrected), but a SF neither records that it has been run nor recognizes that (if so) it's already been run. IOW, a finalizer will be run at most once when it's not a surprise, but will run every time it is a SF.

get_objects() exposing invalid objects. That's no good. Not due to missing tp_traverse, but that the aggregate of trash objects revealed by tp_traverses exceed the aggregate of those reclaimed by tp_clears and subsequent refcounting.

Must not move invalids to older generation. Move to new internal (to delete_garbage) list instead. That list should be empty at delete_garbage's end. If not, I'd be happy to die with a fatal runtime error. Else, e.g.,

add to count of trash objects that could not be collected
append them to gc.garbage, each as the sole attribute of a new (say) _GCInvalid container, whose repr/str only shows the address of the invalid object. So the top-level objects in gc.garbage remain valid, but the daring can access the invalid objects indirectly. At least their type pointers should be intact.

vstinner · 2019-10-03T20:52:20Z

Docs should be changed to encourage implementing the full gc protocol for "all" containers. Spell out what can go wrong if they don't. Be upfront about that history has, at times, proved us too optimistic about that ever since weakrefs were added.

Would it make any sense to add an opt-in option to emit a warning when a new type is created with Py_TPFLAGS_HAVE_GC but it doesn't implement tp_traverse? Maybe also emit a warning if it doesn't implement tp_clear?

Maybe it could be a ResourceWarning emitted in development mode, when -X dev is used on the command line.

pitrou · 2019-10-03T22:35:15Z

Would it make any sense to add an opt-in option to emit a warning when a new type is created with Py_TPFLAGS_HAVE_GC but it doesn't implement tp_traverse?

Or even fatal error out? There's no reason to have Py_TPFLAGS_HAVE_GC but not implement tp_traverse, AFAIR.

vstinner · 2019-10-03T22:41:55Z

Or even fatal error out? There's no reason to have Py_TPFLAGS_HAVE_GC but not implement tp_traverse, AFAIR.

Well... It would prefer a smooth transition.

Such sanity checks which mostly concerns developers perfectly fit the -X dev mode.

I mean, calling Py_FatalError() only in development mode would be acceptable, rather than blocking the usage of C extensions which are working "perfectly well" on Python 3.7.

tim-one · 2019-10-04T03:40:37Z

My understanding is that the CFFI types at issue don't even have Py_TPFLAGS_HAVE_GC. They're completely invisible to gc. As Armin says in the CFFI issue report (linked to earlier), he never got the impression from the docs that he needed to implement anything related to cyclic gc.

Since Armin is extremely capable, conscientious, and reasonable, that tells me our docs are lacking.

It was intended at the start that the worst that could happen if a type ignored the gc protocols was that memory may leak. That story changed radically when weakrefs with callbacks were added - but nobody knew it at the time because the weakref design gave no thought to cyclic gc. It's been driven by segfaults ever since ;-)

We're doing all we can to keep non-cooperating code "working", and will continue to do so. But who knows? The next segfault may be one we can't hack around. It's fundamentally insane to expect any gc to work perfectly when it may be blind to what the object graph _is_.

tim-one · 2019-10-04T17:51:10Z

BTW, the phrase "missing tp_traverse" is misleading. If an object with a NULL tp_traverse appears in a gc generation, gc will blow up the next time that generation is collected. That's always been so - gc doesn't check whether tp_traverse is NULL, it just calls it. It's tp_clear that it checks, because that one is optional.

I don't recall any problem we've had with extensions implementing the gc protocol incorrectly or incompletely. It's this issue's problem: containers not participating in gc _at all_.

If we had a traditional mark-sweep collector, that would be massively catastrophic. Miss a pointer and you can conclude a live object is actually trash, and tear it down while it's still in use.

Our problem is the opposite: miss a pointer and we can conclude a trash object is actually live. At the start, the worst that _could_ do is leak memory. It's the later introduction of time-to-die finalizers (weakref callbacks at first) that created the opportunity for segfaults: the only 100% clearly safe way to run finalizers in cyclic trash is to force them to run _before_ anything at all is torn down by force (tp_clear). But to run them in advance, we have to know the relevant objects _are_ trash. Which we can't always know if containers don't always participate.

While Neil & I haven't thought of ways that can go wrong now beyond that a "surprise finalizer" may get run any number of times, that doesn't mean far worse things can't happen - just that they'll surprise us when they do :-)

nascheme · 2019-10-14T22:44:02Z

It's fundamentally insane to expect any gc to work perfectly when it may be blind to what the object graph _is_.

I'm often amazed it works at all, let alone perfectly. ;-P

This did trigger me to think of another possible problem. I setup my unit test as you suggested:

    #   Z <- Y <- A--+--> WZ -> C
    #             ^  |
    #             +--+
    # where:
    #   WZ is a weakref to Z with callback C
    #   Y doesn't implement tp_traverse
    #   A contains a reference to itself, Y and WZ

But what happens if the GC doesn't see that WZ is trash? Then it will not be cleared. Hang it off Y so the GC can't find it. We can set things up so that Z is freed before WZ (e.g. WZ in a second and newer cycle). Then, the callback might still run.

On further thought, this seems safe (but maybe broken) because of the handle_weakrefs() logic. The GC will think WZ is going to outlive Z so it will call it before doing any tp_clear calls.

tim-one · 2019-10-15T00:19:05Z

I'm often amazed it works at all, let alone perfectly. ;-P

Indeed! Every time I take a break from gc and come back, I burn another hour wondering why it doesn't recycle _everything_ ;-)

But what happens if the GC doesn't see that WZ is trash?
Then it will not be cleared. Hang it off Y so the GC
can't find it.

In that case it won't think C is trash either (a weakref holds a strong reference to its callback), so C won't be tp_clear'ed - if WZ isn't in the unreachable set, C can't be either.

We can set things up so that Z is freed before WZ (e.g.
WZ in a second and newer cycle). Then, the callback
might still run.

Since C hasn't been damaged, and nothing reachable from C would have been considered trash either, I don't see a problem.

On further thought, this seems safe (but maybe broken) because
of the handle_weakrefs() logic. The GC will think WZ is going
to outlive Z so it will call it before doing any tp_clear calls.

Don't think that applies. As above, WZ isn't in the unreachable set, so gc knows nothing about it.

Z will eventually be reclaimed via refcounting side effect, C _will_ run, and then WZ, and then C, will be reclaimed by refcounting.

Or am I missing something?

tim-one · 2019-10-15T03:01:38Z

While Neil & I haven't thought of ways that can go wrong now
beyond that a "surprise finalizer" may get run any number of
times ...

Speaking of which, I no longer believe that's true. Thanks to the usual layers of baffling renaming via macro after macro, I missed object.c's PyObject_CallFinalizerFromDealloc(), which both respects and sets the "am I finalized?" flag regardless of how, e.g., an object with a __del__ ends its life (and possibly resurrects).

nascheme · 2019-10-16T03:56:53Z

New changeset 392a13b by Neil Schemenauer in branch 'master':
bpo-38006: Add unit test for weakref clear bug (GH-16788)
392a13b

pablogsal · 2021-11-08T18:50:26Z

Are we missing something here?

nascheme · 2021-11-08T21:45:43Z

Closing since I believe the bug is fixed and there is an appropriate unit test.

tiran added release-blocker 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Sep 2, 2019

vstinner changed the title ~~_PyFunction_Vectorcall() can segfault on process exit~~ Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit Sep 2, 2019

nascheme added the 3.7 (EOL) end of life label Sep 30, 2019

ambv removed 3.7 (EOL) end of life release-blocker labels Oct 1, 2019

ambv added the 3.7 (EOL) end of life label Oct 2, 2019

nascheme closed this as completed Nov 8, 2021

nascheme self-assigned this Nov 8, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit #82187

Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit #82187

Comments

tiran commented Sep 2, 2019

tiran commented Sep 2, 2019

encukou commented Sep 2, 2019

tiran commented Sep 2, 2019

vstinner commented Sep 2, 2019

vstinner commented Sep 2, 2019

vstinner commented Sep 2, 2019

vstinner commented Sep 2, 2019

tiran commented Sep 2, 2019

pablogsal commented Sep 2, 2019

pablogsal commented Sep 2, 2019

pablogsal commented Sep 2, 2019

encukou commented Sep 2, 2019

vstinner commented Sep 2, 2019

vstinner commented Sep 2, 2019

pablogsal commented Sep 2, 2019

encukou commented Sep 2, 2019

pablogsal commented Sep 2, 2019

tiran commented Sep 2, 2019

encukou commented Sep 2, 2019

pablogsal commented Sep 2, 2019

pablogsal commented Sep 2, 2019

tiran commented Sep 2, 2019

pablogsal commented Sep 2, 2019

vstinner commented Sep 2, 2019

tim-one commented Sep 30, 2019

nascheme commented Sep 30, 2019

ambv commented Sep 30, 2019

nascheme commented Sep 30, 2019

nascheme commented Oct 1, 2019

ambv commented Oct 1, 2019

ambv commented Oct 1, 2019

tim-one commented Oct 1, 2019

nascheme commented Oct 1, 2019

vstinner commented Oct 2, 2019

ambv commented Oct 2, 2019

tim-one commented Oct 3, 2019

vstinner commented Oct 3, 2019

pitrou commented Oct 3, 2019

vstinner commented Oct 3, 2019

tim-one commented Oct 4, 2019

tim-one commented Oct 4, 2019

nascheme commented Oct 14, 2019

tim-one commented Oct 15, 2019

tim-one commented Oct 15, 2019

nascheme commented Oct 16, 2019

pablogsal commented Nov 8, 2021

nascheme commented Nov 8, 2021