classification
Title: Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit
Type: crash Stage: patch review
Components: Interpreter Core Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, benjamin.peterson, christian.heimes, inada.naoki, jdemeyer, larry, lukasz.langa, miss-islington, nascheme, ned.deily, pablogsal, petr.viktorin, pitrou, tim.peters
Priority: normal Keywords: 3.8regression, patch

Created on 2019-09-02 09:48 by christian.heimes, last changed 2019-10-16 03:56 by nascheme.

Files
File name Uploaded Description Edit
reproducer.tar.gz vstinner, 2019-09-03 08:45
gc_crash.py vstinner, 2019-09-03 14:20
gc_crash.patch vstinner, 2019-09-03 14:20
gc_disable_wr_callback.txt nascheme, 2019-09-30 01:06
gc_weakref_bug_demo.py nascheme, 2019-10-01 03:10
gc_weakref_bug_demo2.py nascheme, 2019-10-01 20:55
Pull Requests
URL Status Linked Edit
PR 15641 merged vstinner, 2019-09-02 11:37
PR 15787 merged miss-islington, 2019-09-09 14:56
PR 15789 merged vstinner, 2019-09-09 15:00
PR 16495 merged nascheme, 2019-09-30 16:38
PR 16499 merged miss-islington, 2019-09-30 17:09
PR 16502 merged nascheme, 2019-09-30 20:45
PR 16788 merged nascheme, 2019-10-14 19:22
Messages (94)
msg350974 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-02 09:48
I have a case of a segfault on Fedora 32 (currently rawhide) with Python 3.8b4 and FreeIPA. The ipactl Python helper crashes with a segfault when the Python process exits. The segfault occurs in _PyFunction_Vectorcall() with a weakref related local scope function that has a NULL func_code pointer. _PyFunction_Vectorcall() does not check the code object for NULL and access the variable without check:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7be354f in _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, stack=0x7fffffffdb50, nargsf=1, kwnames=0x0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:395
395         Py_ssize_t nkwargs = (kwnames == NULL) ? 0 : PyTuple_GET_SIZE(kwnames);
(gdb) bt
#0  0x00007ffff7be354f in _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, stack=0x7fffffffdb50, nargsf=1, kwnames=0x0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:395
#1  0x00007ffff7bb16b0 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1, args=0x7fffffffdb50, callable=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/cpython/abstract.h:127
#2  _PyObject_FastCall (nargs=1, args=0x7fffffffdb50, func=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/cpython/abstract.h:147
#3  object_vacall (base=<optimized out>, callable=<function at remote 0x7fffe6ebfd30>, vargs=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:1186
#4  0x00007ffff7bb19e1 in PyObject_CallFunctionObjArgs (callable=callable@entry=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/call.c:1259
#5  0x00007ffff7c3f7c3 in handle_callback (ref=<optimized out>, callback=<function at remote 0x7fffe6ebfd30>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/weakrefobject.c:877
#6  0x00007ffff7c01bf7 in PyObject_ClearWeakRefs (object=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/weakrefobject.c:922
#7  0x00007fffe905babc in ctypedescr_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
#8  0x00007fffe9058825 in cfield_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
#9  0x00007ffff7ba4c65 in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:2000
#10 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#11 free_keys_object (keys=0x55555609f590) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:580
#12 dictkeys_decref (dk=0x55555609f590) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:324
#13 dict_dealloc (mp=0x7fffe6dd5340) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:1994
#14 0x00007fffe905bb25 in ctypedescr_dealloc () from /usr/lib64/python3.8/site-packages/_cffi_backend.cpython-38-x86_64-linux-gnu.so
#15 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#16 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#17 tupledealloc (op=0x7fffe6e3bbb0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
#18 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#19 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#20 tupledealloc (op=0x7fffe6e55b80) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
#21 0x00007ffff7c16b76 in _Py_DECREF (filename=0x7ffff7d2e6a8 "/builddir/build/BUILD/Python-3.8.0b4/Objects/typeobject.c", lineno=1110, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#22 clear_slots (self=<optimized out>, type=0x55555559d0d0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/typeobject.c:1110
#23 subtype_dealloc (self=<KeyedRef at remote 0x7fffe6deb220>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/typeobject.c:1262
#24 0x00007ffff7ba4c65 in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:2000
#25 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#26 free_keys_object (keys=0x555556078d30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:580
#27 dictkeys_decref (dk=0x555556078d30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:324
#28 dict_dealloc (mp=0x7fffe6eba6c0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/dictobject.c:1994
#29 0x00007ffff7b98a4b in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#30 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#31 cell_dealloc (op=0x7fffe6ebe2e0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/cellobject.c:84
#32 0x00007ffff7ba5058 in _Py_DECREF (filename=<optimized out>, lineno=541, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#33 _Py_XDECREF (op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:541
#34 tupledealloc (op=0x7fffe6ebe310) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/tupleobject.c:247
#35 0x00007ffff7b991c2 in _Py_DECREF (filename=0x7ffff7d2e460 "/builddir/build/BUILD/Python-3.8.0b4/Objects/funcobject.c", lineno=584, op=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Include/object.h:478
#36 func_clear (op=0x7fffe6ebfd30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/funcobject.c:584
#37 0x00007ffff7ba5f84 in delete_garbage (state=<optimized out>, state=<optimized out>, old=<optimized out>, collectable=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:929
#38 collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1, state=0x7ffff7dd9598 <_PyRuntime+344>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:1106
#39 0x00007ffff7cac8c2 in _PyGC_CollectNoFail () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/gcmodule.c:1848
#40 0x00007ffff7cacb74 in PyImport_Cleanup () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/import.c:541
#41 0x00007ffff7cacf56 in Py_FinalizeEx () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pylifecycle.c:1226
#42 0x00007ffff7cad088 in Py_Exit (sts=0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pylifecycle.c:2248
#43 0x00007ffff7cad0cf in handle_system_exit () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:658
#44 0x00007ffff7cad219 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x55555555bfb0) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:755
#45 PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:755
#46 0x00007ffff7cad23a in PyErr_Print () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:761
#47 0x00007ffff7b91b42 in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffe178) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Python/pythonrun.c:434
#48 0x00007ffff7cae36f in pymain_run_file (cf=0x7fffffffe178, config=0x55555555b410) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:383
#49 pymain_run_python (exitcode=0x7fffffffe170) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:567
#50 Py_RunMain () at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:646
#51 0x00007ffff7cae559 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Modules/main.c:700
#52 0x00007ffff7e21193 in __libc_start_main () from /lib64/libc.so.6
#53 0x000055555555508e in _start ()

It looks like the func object is the local remove function of weakref.WeakValueDictionary() implementation, https://github.com/python/cpython/blob/353053d9ad08fea0e205e6c008b8a4350c0188e6/Lib/weakref.py#L90-L112

All function object fields except func_qualname and vectorcall are already NULL.

(gdb) print ((PyFunctionObject*)func).func_annotations 
$19 = 0x0
(gdb) print ((PyFunctionObject*)func).func_closure 
$20 = 0x0
(gdb) print ((PyFunctionObject*)func).func_code
$21 = 0x0
(gdb) print ((PyFunctionObject*)func).func_defaults 
$22 = 0x0
(gdb) print ((PyFunctionObject*)func).func_dict
$23 = 0x0
(gdb) print ((PyFunctionObject*)func).func_doc
$24 = 0x0
(gdb) print ((PyFunctionObject*)func).func_globals 
$25 = 0x0
(gdb) print ((PyFunctionObject*)func).func_kwdefaults 
$26 = 0x0
(gdb) print ((PyFunctionObject*)func).func_module 
$27 = 0x0
(gdb) print ((PyFunctionObject*)func).func_name
$28 = 0x0
(gdb) print ((PyFunctionObject*)func).func_qualname 
$29 = 'WeakValueDictionary.__init__.<locals>.remove'
(gdb) print ((PyFunctionObject*)func).func_weakreflist 
$30 = 0x0
(gdb) print ((PyFunctionObject*)func).vectorcall 
$31 = (vectorcallfunc) 0x7ffff7be3530 <_PyFunction_Vectorcall>
(gdb) print func.ob_refcnt 
$32 = 135
(gdb) print func.ob_type 
$33 = (struct _typeobject *) 0x7ffff7dd4e80 <PyFunction_Type>

The Fedora downstream bug is https://bugzilla.redhat.com/show_bug.cgi?id=1747901
msg350975 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2019-09-02 10:06
I don't understand how the function ended up with func_code=NULL. That shouldn't be a valid function to call, IMO.
Do you have any info on how the function ended up in that state?
msg350976 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-02 10:11
Not yet.

My current hypothesis is the function code object is already cleaned up *somehow*.
msg350981 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 10:57
> I don't understand how the function ended up with func_code=NULL. That shouldn't be a valid function to call, IMO. Do you have any info on how the function ended up in that state?

It doesn't seem possible to create a function with func_code=NULL, nor to set func_code to NULL. func_code can be be set to NULL by func_clear() which is called by func_dealloc().

I bet that func_clear() has been called since most func fields are set to NULL, which is consistent with:

static int
func_clear(PyFunctionObject *op)
{
    Py_CLEAR(op->func_code);
    Py_CLEAR(op->func_globals);
    Py_CLEAR(op->func_module);
    Py_CLEAR(op->func_name);
    Py_CLEAR(op->func_defaults);
    Py_CLEAR(op->func_kwdefaults);
    Py_CLEAR(op->func_doc);
    Py_CLEAR(op->func_dict);
    Py_CLEAR(op->func_closure);
    Py_CLEAR(op->func_annotations);
    Py_CLEAR(op->func_qualname);
    return 0;
}

The question is how is it possible that a deallocated function is still accessed? It smells like a borrowed reference somewhere in the call chain.
msg350983 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 11:08
Using gdb, I checked if func_clear() can be cleared outside func_dealloc(): yes, delete_garbage() (gcmodule.c) calls type->clear(). But I'm surprised that the function would be seen as "unreachable" if it's reference counter was equal to 135:

(gdb) print func.ob_refcnt 
$32 = 135
msg350985 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 11:27
(gdb) print ((PyFunctionObject*)func).func_qualname 
$29 = 'WeakValueDictionary.__init__.<locals>.remove'

This is a temporary remove() function declared in weakref.WeakValueDictionary constructor:

class WeakKeyDictionary(_collections_abc.MutableMapping):
    def __init__(self, dict=None):
        def remove(k, selfref=ref(self)):
            ...
        self._remove = remove
        ...

    def __setitem__(self, key, value):
        self.data[ref(key, self._remove)] = value

    ...

Simplified gdb traceback:

Py_FinalizeEx()
-> PyImport_Cleanup()
-> _PyGC_CollectNoFail()
-> delete_garbage()
-> func_clear(op=0x7fffe6ebfd30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/funcobject.c:584
    Py_CLEAR(op->func_closure);
-> tupledealloc()
-> cell_dealloc()
-> dict_dealloc()
-> dictkeys_decref()
-> cfield_dealloc () from _cffi_backend
-> PyObject_ClearWeakRefs()
-> _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, ...)

PyImport_Cleanup() calls delete_garbage() which calls func_clear(op=0x7fffe6ebfd30).

But while the function closure is destroyed, a cffi object stored in a dict (namespace?) (stored in the function closure) is cleared, and there was a weak reference to this cffi object with a callback... which is the cleared function (Python object 0x7fffe6ebfd30)...

--

I don't understand why/how remove() gets a closure. I modified weakref.py to ensure that remove.__closure__ is None: test_weakref still pass.
msg350987 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 11:40
> I don't understand why/how remove() gets a closure.

Ok, I found the reason and I wrote PR 15641 to fix it.


Previous fix of WeakValueDictionary remove() function:

commit 9cd7e17640a49635d1c1f8c2989578a8fc2c1de6
Author: Łukasz Langa <lukasz@langa.pl>
Date:   Fri Feb 10 00:14:55 2017 -0800

    Fix #29519: weakref spewing exceptions during interp finalization
msg350990 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-02 12:09
Your workaround solves the segfault issue for me.
msg350991 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 12:34
I think the real problem is in vectorcall. tp_clear will make sure all internal pointers are either NULL or valid (by using Py_CLEAR, which also protects against re-entrancy) and it the responsibility of the object or its users to check that the fields that are needed are valid. From the docs on tp_clear:

> The Py_CLEAR() macro should be used, because clearing references is delicate: the reference to the contained object must not be decremented until after the pointer to the contained object is set to NULL. This is because decrementing the reference count may cause the contained object to become trash, triggering a chain of reclamation activity that may include invoking arbitrary Python code (due to finalizers, or weakref callbacks, associated with the contained object). If it’s possible for such code to reference self again, it’s important that the pointer to the contained object be NULL at that time, so that self knows the contained object can no longer be used. 

Notice the "so that self knows the contained object can no longer be used". I think _PyFunction_Vectorcall is missing the check "to see of the contained object can or cannot be used". Here the contained object being everything that has being cleaned before:

    Py_CLEAR(op->func_code);
    Py_CLEAR(op->func_globals);
    Py_CLEAR(op->func_module);
    Py_CLEAR(op->func_name);
    Py_CLEAR(op->func_defaults);
    Py_CLEAR(op->func_kwdefaults);
    Py_CLEAR(op->func_doc);
    Py_CLEAR(op->func_dict);
    Py_CLEAR(op->func_closure); <----- Everything before this
    Py_CLEAR(op->func_annotations);
    Py_CLEAR(op->func_qualname);

I think it will be enough to make a check for func_code being NULL in _PyFunction_Vectorcall or before.
msg350992 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 12:41
In particular, this was not happening before because the function type did not implement tp_clear:

https://github.com/python/cpython/blob/3.7/Objects/funcobject.c#L615

The new implementation of tp_clear without checks is allowing this to happen.
msg350995 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 12:46
See also https://bugs.python.org/issue33418 as the potential source of the problem.
msg350996 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2019-09-02 12:55
I'm not sure adding a check would solve this. What should be done when a function with func_code=NULL is called? "Silently do nothing" is not really an option; raising an exception wouldn't help much in this case.

I wonder if a function's tp_clear we should clear the mutable parts (like func_closure) before the immutable ones (func_code).
Any access to the immutable ones should have checks (and probably raise exceptions if those are NULL).
msg350997 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 13:07
I'm now able to reproduce the FreeIPA crash. In short:

* Get a Fedora Rawhide VM (to get Python 3.8 as "python3", it's more convenient)
* Install FreeIPA (dnf install freeipa-server)
* Run: ipa-server-install --help
* Python does crash at exit


> In particular, this was not happening before because the function type did not implement tp_clear:
> https://github.com/python/cpython/blob/3.7/Objects/funcobject.c#L615

I confirm that the patch below does fix the FreeIPA crash:

diff --git a/Objects/funcobject.c b/Objects/funcobject.c
index df5cc2d3f5..357372c565 100644
--- a/Objects/funcobject.c
+++ b/Objects/funcobject.c
@@ -669,7 +669,7 @@ PyTypeObject PyFunction_Type = {
     Py_TPFLAGS_METHOD_DESCRIPTOR,               /* tp_flags */
     func_new__doc__,                            /* tp_doc */
     (traverseproc)func_traverse,                /* tp_traverse */
-    (inquiry)func_clear,                        /* tp_clear */
+    (inquiry)0,                        /* tp_clear */
     0,                                          /* tp_richcompare */
     offsetof(PyFunctionObject, func_weakreflist), /* tp_weaklistoffset */
     0,                                          /* tp_iter */
msg350998 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 13:09
PyFunction_Type.tp_clear changed in bpo-33418 (previously, it was equal to 0: no clear method):

commit 3c452404ae178b742967589a0bb4a5ec768d76e0
Author: INADA Naoki <methane@users.noreply.github.com>
Date:   Wed Jul 4 11:15:50 2018 +0900

    bpo-33418: Add tp_clear for function object (GH-8058)
    
    Without tp_clear, GC can't break cyclic reference.
    It will cause memory leak when cyclic reference is
    created intentionally.
msg350999 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 13:15
In https://bugs.python.org/issue33418 I proposed reverting tp_clear on function objects.

>What should be done when a function with func_code=NULL is called?

We can set the error indicator at least. Although I agree that it seems suboptimal. At least I think we should add an assertion to check that the function called is valid so we can identify this situation easier.

What downsides do we see raising an exception?

Notice that this can happen without involving weak references. For example:

You have object A and B. A is a function. The destructor of B calls A. A and B are in a cycle. When calling tp_clear on A the refs of B reaches 0, trying to call A and .... Boom!

Either we remove tp_clear or somehow we have to protect all callers of the function. Technically, any field cleared from tp_cleared leaves the function in an inconsistent state
msg351006 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2019-09-02 14:46
> What downsides do we see raising an exception?

Yeah, on second thought, that would probably be best. We still want PR #15641 as well, so the exception doesn't actually happen in practice.

Note that _PyEval_EvalCodeWithName (the main use of func_globals) already fails with an exception if globals is NULL.

> Either we remove tp_clear or somehow we have to protect all callers of the function. Technically, any field cleared from tp_cleared leaves the function in an inconsistent state

tp_clear has a purpose; it prevents memory leaks. That leaves protecting access to the cleared fields.

(But we don't need to clear func_name and func_qualname as these must be strings.)
msg351007 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 15:06
The more I think about this the more I think there is something else at play. If the GC is able to follow the dependency chain, all weakrefs should have been already managed correctly and PyObject_ClearWeakRefs should have never be invoked.

I think *something* is not informing the gc about the dependency chain
msg351009 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-02 15:11
It's going to be hard to figure that out. FreeIPA uses a ton of C extensions. Some are hand-written, some uses Cython, ctypes, and cffi.
msg351010 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2019-09-02 15:13
The traceback does have some ctypedescr_dealloc and cfield_dealloc from _cffi. Could you check what those are?
msg351012 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 15:15
I think the problem is that whatever is weak-referenced by the weak ref (CField_Type or similar) is not on the gc list. Because is not on the gc list, handle_weakrefs (https://github.com/python/cpython/blob/master/Modules/gcmodule.c#L1090) is not acting correctly on the weakreferences, clearing them before the tp_clear calls and therefore producing the crash.
msg351015 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 15:44
Unless we are missing something I think this may be caused by something in cffi that is not implementing gc-related functions correctly, as PyObject_ClearWeakRefs should not be called from a gc run. Given how complicated this is and even if the possible solution is to fix the problem in cffi or elsewhere, we can add an extra check in PyObject_ClearWeakRefs to NOT call callbacks that are in the process of being collected. Technically we should never reach that code, but we won't be segfaulting and the cost is a redundant-check that will be false in the normal path.
msg351017 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-02 16:10
The biggest CFFI based dependency of FreeIPA is cryptography, but cryptography does not use weakref in its sources. CFFI on the other hand has a WeakValueDictionary object attached to its internal typecache backend, https://bitbucket.org/cffi/cffi/src/bf80722dea36237710083315e336c81bc8571fd0/cffi/model.py#lines-572 . Perhaps that is the culprit?

I checked the CTypeDescrObject from the stacktrace. Three out of three times it's for "unsigned long(*)(void *)". I wasn't able to get any useful information from the other fields yet.

(gdb) up 7
#7  0x00007fffe905babc in ctypedescr_dealloc (ct=0x7fffe6d958b0) at c/_cffi_backend.c:401
401             PyObject_ClearWeakRefs((PyObject *) ct);
(gdb) p (char*)ct.ct_name
$1 = 0x7fffe6d95908 "unsigned long(*)(void *)"
(gdb) p ct
$2 = (CTypeDescrObject *) 0x7fffe6d958b0
msg351020 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-02 16:23
For the weakref to be handled correctly the ctypedescr needs to be identified correctly as part of the isolated cycle. If is not in the isolated cycle something may be missing tp_traverse or the GC flags. 

Can you check if the ctypedescr is part of GC.objects()?
msg351037 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 21:09
I investigated the FreeIPA crash.

* Python 3.8 behaves differently because func_clear() has been implemented (bpo-33418, commit 3c452404ae178b742967589a0bb4a5ec768d76e0)

* The bug is a crash on a function call (_PyFunction_Vectorcall) because the function has been cleared (by func_clear), but there was still a weak reference using this function as its callback.

* Note: the function is called *during* it's being cleared by func_clear().

* The GC has a workaround for weak references part of "unreachable" objects, but its handle_weakrefs() function doesn't work because CFFI CField_Type type doesn't implement tp_traverse.

--

PR 15641 just hides the real bug.

One issue is that CFFI doesn't implement correctly the GC protocol. If an object contains another object, its type must:

* Have Py_TPFLAGS_HAVE_GC flag
* Implement tp_traverse
* Use PyObject_GC_Malloc() to allocate an object
* Call PyObject_GC_Track() on created object

Another issue is that the GC doesn't prevent the crash. Would it be possible to prevent the crash without changing the behavior (ex: still call weakref callbacks)?
msg351038 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 21:23
For the record, here is an annotated traceback of a FreeIPA crash.

PyGC_list_contains() is a hack that I wrote to check if an object was in the unreachable argument of delete_garbage().


(gdb) where
#0  0x0000000000434b78 in _PyFunction_Vectorcall (func=0x7ffff45d5690, stack=0x7fffffffd170, nargsf=1, kwnames=0x0) at Objects/call.c:406

    remove() func=0x7ffff45d5690

    (gdb) p _PyObject_Dump( ((PyFunctionObject*)func)->func_qualname )
    object  : 'WeakValueDictionary.__init__.<locals>.remove'

    (gdb) p PyGC_list_contains(0x7ffff7a2bac0, func)  # was unreachable?
    $8 = 1


#1  0x0000000000433640 in _PyObject_Vectorcall (callable=0x7ffff45d5690, args=0x7fffffffd170, nargsf=1, kwnames=0x0) at ./Include/cpython/abstract.h:127
#2  0x000000000043368b in _PyObject_FastCall (func=0x7ffff45d5690, args=0x7fffffffd170, nargs=1) at ./Include/cpython/abstract.h:147
#3  0x0000000000436ccc in object_vacall (base=0x0, callable=0x7ffff45d5690, vargs=0x7fffffffd1e0) at Objects/call.c:1186
#4  0x000000000043705b in PyObject_CallFunctionObjArgs (callable=0x7ffff45d5690) at Objects/call.c:1259
#5  0x0000000000502280 in handle_callback (ref=0x7ffff447dc90, callback=0x7ffff45d5690) at Objects/weakrefobject.c:877
#6  0x0000000000502403 in PyObject_ClearWeakRefs (object=0x7ffff42c1550) at Objects/weakrefobject.c:922

    (gdb) p object->ob_type->tp_name
    $13 = 0x7ffff6be404b "_cffi_backend.CTypeDescr"
    (gdb) p PyGC_list_contains(0x7ffff7a2bac0, object)  # was unreachable?
    $9 = 0

#7  0x00007ffff6a41abc in ctypedescr_dealloc (ct=0x7ffff42c1550) at c/_cffi_backend.c:401
#8  0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff42c1550) at Objects/object.c:2213
#9  0x00007ffff6a3e825 in _Py_DECREF (filename=<synthetic pointer>, lineno=703, op=<optimized out>) at /usr/include/python3.8/object.h:478
#10 cfield_dealloc (cf=0x7ffff444dc70) at c/_cffi_backend.c:703

    Py_DECREF(cf->cf_type);

    (gdb) p cf->ob_base.ob_type
    $63 = (struct _typeobject *) 0x7ffff6a62460 <CField_Type>
    (gdb) p PyGC_list_contains(0x7ffff7a2bac0, cf)  # was unreachable?
    $14 = 0

#11 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff444dc70) at Objects/object.c:2213
#12 0x0000000000463af8 in _Py_DECREF (filename=0x6987a0 "./Include/object.h", lineno=541, op=0x7ffff444dc70) at ./Include/object.h:478
#13 0x0000000000463b46 in _Py_XDECREF (op=0x7ffff444dc70) at ./Include/object.h:541
#14 0x00000000004652de in free_keys_object (keys=0x11de780) at Objects/dictobject.c:580
#15 0x0000000000464877 in dictkeys_decref (dk=0x11de780) at Objects/dictobject.c:324
#16 0x00000000004694b0 in dict_dealloc (mp=0x7ffff44a6950) at Objects/dictobject.c:1994

    {'C_Initialize': ..., 'C_CreateObject': ..., 'C_Finalize': ...}

    (gdb) p PyGC_list_contains(0x7ffff7a2bac0, mp)  # was unreachable?
    $16 = 1

#17 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff44a6950) at Objects/object.c:2213
#18 0x00007ffff6a41b25 in _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at /usr/include/python3.8/object.h:478
#19 _Py_XDECREF (op=<optimized out>) at /usr/include/python3.8/object.h:541
#20 ctypedescr_dealloc (ct=0x7ffff42c12d0) at c/_cffi_backend.c:412

    Py_XDECREF(ct->ct_stuff);

    (gdb) p (char*)(((CTypeDescrObject*)ct)->ct_name)
    $38 = 0x7ffff42c1328 "struct _CK_FUNCTION_LIST"

#21 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff42c12d0) at Objects/object.c:2213
#22 0x000000000048f002 in _Py_DECREF (filename=0x6a2500 "./Include/object.h", lineno=541, op=0x7ffff42c12d0) at ./Include/object.h:478
#23 0x000000000048f02e in _Py_XDECREF (op=0x7ffff42c12d0) at ./Include/object.h:541
#24 0x000000000048fc0a in tupledealloc (op=0x7ffff451e050) at Objects/tupleobject.c:247

    (CTypeDescrObject,)

#25 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff451e050) at Objects/object.c:2213
#26 0x000000000048f002 in _Py_DECREF (filename=0x6a2500 "./Include/object.h", lineno=541, op=0x7ffff451e050) at ./Include/object.h:478
#27 0x000000000048f02e in _Py_XDECREF (op=0x7ffff451e050) at ./Include/object.h:541
#28 0x000000000048fc0a in tupledealloc (op=0x7ffff48e13c0) at Objects/tupleobject.c:247

    (str, <a tuple>)

#29 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff48e13c0) at Objects/object.c:2213
#30 0x0000000000491815 in _Py_DECREF (filename=0x6a34bf "Objects/typeobject.c", lineno=1110, op=0x7ffff48e13c0) at ./Include/object.h:478
#31 0x000000000049431d in clear_slots (type=0x8e0330, self=0x7ffff447d9f0) at Objects/typeobject.c:1110
#32 0x0000000000494833 in subtype_dealloc (self=0x7ffff447d9f0) at Objects/typeobject.c:1262

    weakref.KeyedRef

#33 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff447d9f0) at Objects/object.c:2213
#34 0x0000000000463af8 in _Py_DECREF (filename=0x6987a0 "./Include/object.h", lineno=541, op=0x7ffff447d9f0) at ./Include/object.h:478
#35 0x0000000000463b46 in _Py_XDECREF (op=0x7ffff447d9f0) at ./Include/object.h:541
#36 0x00000000004652de in free_keys_object (keys=0x1209c90) at Objects/dictobject.c:580
#37 0x0000000000464877 in dictkeys_decref (dk=0x1209c90) at Objects/dictobject.c:324
#38 0x00000000004694b0 in dict_dealloc (mp=0x7ffff45d64d0) at Objects/dictobject.c:1994

    WeakValueDictionary.data dict

#39 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff45d64d0) at Objects/object.c:2213
#40 0x000000000062bb37 in _Py_DECREF (filename=0x745d20 "./Include/object.h", lineno=541, op=0x7ffff45d64d0) at ./Include/object.h:478
#41 0x000000000062bb85 in _Py_XDECREF (op=0x7ffff45d64d0) at ./Include/object.h:541
#42 0x000000000062bf4a in cell_dealloc (op=0x7ffff45d7050) at Objects/cellobject.c:84

    remove() closure contains "d": WeakValueDictionary.data dict

#43 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff45d7050) at Objects/object.c:2213
#44 0x000000000048f002 in _Py_DECREF (filename=0x6a2500 "./Include/object.h", lineno=541, op=0x7ffff45d7050) at ./Include/object.h:478
#45 0x000000000048f02e in _Py_XDECREF (op=0x7ffff45d7050) at ./Include/object.h:541
#46 0x000000000048fc0a in tupledealloc (op=0x7ffff45d4370) at Objects/tupleobject.c:247
#47 0x000000000047f1a5 in _Py_Dealloc (op=0x7ffff45d4370) at Objects/object.c:2213
#48 0x0000000000449e1b in _Py_DECREF (filename=0x693783 "Objects/funcobject.c", lineno=584, op=0x7ffff45d4370) at ./Include/object.h:478

#49 0x000000000044b730 in func_clear (op=0x7ffff45d5690) at Objects/funcobject.c:584

    remove() func = 0x7ffff45d5690

#50 0x000000000058ebbd in delete_garbage (state=0x7fb798 <_PyRuntime+344>, collectable=0x7fffffffdc10, old=0x7fb7e0 <_PyRuntime+416>) at Modules/gcmodule.c:929

#51 0x000000000058f082 in collect (state=0x7fb798 <_PyRuntime+344>, generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1) at Modules/gcmodule.c:1106
#52 0x0000000000590861 in _PyGC_CollectNoFail () at Modules/gcmodule.c:1849
#53 0x0000000000542a1e in PyImport_Cleanup () at Python/import.c:541
#54 0x000000000055f8c3 in Py_FinalizeEx () at Python/pylifecycle.c:1226
#55 0x00000000005618d6 in Py_Exit (sts=0) at Python/pylifecycle.c:2248

#56 0x0000000000566726 in handle_system_exit () at Python/pythonrun.c:658
#57 0x000000000056673d in _PyErr_PrintEx (tstate=0x8044e0, set_sys_last_vars=1) at Python/pythonrun.c:668
#58 0x0000000000566a32 in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:755
#59 0x0000000000566a43 in PyErr_Print () at Python/pythonrun.c:761
#60 0x000000000056602a in PyRun_SimpleFileExFlags (fp=0x800370, filename=0x7ffff7a1ff60 "/usr/sbin/ipa-server-install", closeit=1, flags=0x7fffffffe080) at Python/pythonrun.c:434
#61 0x000000000056555b in PyRun_AnyFileExFlags (fp=0x800370, filename=0x7ffff7a1ff60 "/usr/sbin/ipa-server-install", closeit=1, flags=0x7fffffffe080) at Python/pythonrun.c:86
#62 0x000000000042264a in pymain_run_file (config=0x8037e0, cf=0x7fffffffe080) at Modules/main.c:383
#63 0x0000000000422bfd in pymain_run_python (exitcode=0x7fffffffe0cc) at Modules/main.c:567
#64 0x0000000000422cee in Py_RunMain () at Modules/main.c:646
#65 0x0000000000422d68 in pymain_main (args=0x7fffffffe130) at Modules/main.c:676
#66 0x0000000000422de2 in Py_BytesMain (argc=3, argv=0x7fffffffe258) at Modules/main.c:700
#67 0x00000000004217a6 in main (argc=3, argv=0x7fffffffe258) at ./Programs/python.c:16
msg351039 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 21:25
> Another issue is that the GC doesn't prevent the crash. Would it be possible to prevent the crash without changing the behavior (ex: still call weakref callbacks)?

Pablo opened bpo-38009 with a PR. I'm not sure if his PR is correct or not yet.
msg351040 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 21:38
> Another issue is that the GC doesn't prevent the crash. Would it be possible to prevent the crash without changing the behavior (ex: still call weakref callbacks)?

I discussed with Pablo and it doesn't seem possible to prevent to clear a function if it is indirectly used as a callback for a weak reference to a object which doesn't implement tp_traverse properly (like the CFFI type).

The best that we can do is to detect the bug in C extension and emit a warning explaining the bug and suggesting a way to fix it.
msg351041 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 21:49
This bug is quite complex. Let me try to explain it more simply.

* Create an object A
* Create a weak reference to A with a callback CB
* A and CB are part of a reference cycle
* Removing the last references to A and CB are not enough to destroy them
* Trigger a garbage collection to break the cycle
* A and CB are seen as unreachable by the GC
* The GC "clears" CB which makes CB inconsistent (tp_clear)
* A is deleted which calls CB
* Crash on calling inconsistent CB

--

In the case of FreeIPA:

* A is a _cffi_backend.CTypeDescr instance
* CB is the remove() function defined in weakref.WeakValueDictionary constructor

The FreeIPA reference cycle is quite complex:

remove()
-> remove().__closure__ (tuple)
-> WeakValueDictionary.data (dict) = remove().__closure__[0]
-> weakref.KeyedRef
-> (<str 1>, <tuple 2>)
-> (<CTypeDescrObject 3>,) = <tuple 2>
-> <CTypeDescrObject 3>
-> {'C_Initialize': ..., 'C_CreateObject': ..., 'C_Finalize': ...} (dict)
-> <_cffi_backend.CField 4>
-> <_cffi_backend.CTypeDescr 5>
-> remove() = callback of a weak reference to <_cffi_backend.CTypeDescr 5>

<_cffi_backend.CField 4> is not seen as unreachable by the GC because CField_Type.tp_traverse is not implemented (and CField_Type doesn't use Py_TPFLAGS_HAVE_GC flag).
msg351042 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-02 21:53
> * A and CB are seen as unreachable by the GC

Oh, that's not correct:

* Only CB is seen as unreachable by the GC
* The GC "clears" CB which makes CB inconsistent (tp_clear)
* A is deleted indirectly
* Deleting A calls CB through the weak reference to A

The "A is deleted indirectly" step is also complex...
msg351067 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-03 07:07
Thanks Victor!

I have opened a bug for CFFI, https://bitbucket.org/cffi/cffi/issues/416/python-38-segfault-cfield_type-does-not
msg351072 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-03 07:56
Pablo, Victor, could you please assist Armin on the CFFI issue? As usual the situation is more complex than I anticipated.
msg351077 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-03 08:45
Attached crash.tar.gz: minimum reproducer which doesn't depend on FreeIPA anymore.

Depends on:

* cffi (_cffi_backend dynamic library)
* pycparser
* Python 3.8 (need "python3.8")

To reproduce the crash:

$ tar -xf reproducer.tar.gz 
$ cd reproducer/
$ ./bug.sh 

Output:

(...)
++ env/bin/python ./crash.py
exit
./bug.sh : ligne 7 : 14529 Segmentation fault      (core dumped)env/bin/python ./crash.py
++ echo 'exitcode: 139'
exitcode: 139
msg351082 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-03 12:21
I failed to write a reproducer from scratch. So let me share my notes here. The first point is that remove() function of WeakValueDictionary keeps WeakValueDictionary.data alive like that:

---
class NoisyDel:
    def __del__(self):
        print("dealloc data dict")

def create_closure():
    data = {0: NoisyDel()}
    def remove():
        return data
    return remove

remove = create_closure()
print("clear ")
remove = None
print("exit")
---

data is only deleted once remove is cleared.
msg351083 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-03 12:45
Second step: when func_clear() is called, the closure is cleared. Clearing the closure should call the function which is being cleared (or is already cleared).

---
import gc

class CallFunc:
    def __del__(self):
        func = self.func
        if func is None:
            return
        print("CallFunc", func)
        func("call func from CallFunc.__del__")

def create_remove():
    call_func = CallFunc()
    def remove(msg):
        x = call_func
        print(msg)
    remove.__name__ = "evil_func"
    call_func.func = remove
    return call_func, remove

call_func, remove = create_remove()
print("== clear ==")
call_func = None
remove = None
print()

print("== collect ==")
gc.collect()
print()

print("== exit ==")
---

Output (with my hacked CPython):
------------------
== clear ==

== collect ==
CallFunc <function create_remove.<locals>.remove at 0x7fab8a189eb0>
call func from CallFunc.__del__
func_clear remove() -- in delete_garbage? 1

== exit ==
------------------

call_func and remove are part of a reference cycle. A forced garbage collection breaks the cycle and removes the two objects, but they are not removed in the expected order:

* first: call_func
* then: remove

The crash requires to destroy the objects in the reverse order
msg351086 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-03 14:20
Aha! I reproduced the crash with attached gc_crash.py and gc_crash.patch:

# cd /path/to/python/source
$ git apply gc_crash.patch
$ make
$ ./python gc_crash.py
Segmentation fault (core dumped)
msg351098 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-03 19:22
gc_crash.py:

* PyObject_GC_UnTrack() + PyObject_GC_Track() is used to order objects in the GC collection 0, the objects which are involved in the reference cycle.

* BadGC2Type type is implemented in C, uses Py_TPFLAGS_HAVE_GC, implements tp_traverse, but it doesn't implement tp_clear.

* IHMO gc_crash.py is only legit code. BadGC2Type implements the GC protocol (except to tp_clear).


I'm not sure that it's the same bug than reproducer.tar.gz which uses a weak reference.

BadGC2Type is used to call a function in tp_dealloc. It should be possible to use a weak reference for that: the weak reference can use a callback. But I failed write a reproducer script using a weak reference.
msg351099 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-03 19:30
collect() of gcmodule.c:

* collect() builds an "unreachable" list which is quite important in this bug
* the bug occurs in delete_garbage() which uses the unreachable list
* weak references part of unreachable are handled before delete_garbage() in handle_weakrefs(): "Clear weakrefs and invoke callbacks as necessary".

It seems like reproducer.tar.gz and gc_crash.py crashes involve a reference cycle.

reproducer.tar.gz uses a weak reference which is *not* part of the reference cycle.
msg351492 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-09 14:56
New changeset a2af05a0d3f0da06b8d432f52efa3ecf29038532 by Victor Stinner in branch 'master':
bpo-38006: Avoid closure in weakref.WeakValueDictionary (GH-15641)
https://github.com/python/cpython/commit/a2af05a0d3f0da06b8d432f52efa3ecf29038532
msg351519 - (view) Author: miss-islington (miss-islington) Date: 2019-09-09 16:24
New changeset 78d15faf6c522619098e94be3e7f6d88a9e61123 by Miss Islington (bot) in branch '3.8':
bpo-38006: Avoid closure in weakref.WeakValueDictionary (GH-15641)
https://github.com/python/cpython/commit/78d15faf6c522619098e94be3e7f6d88a9e61123
msg351561 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-09 22:31
New changeset 23669330b7d0d5ad1a9aac40315ba4c2e765f9dd by Victor Stinner in branch '3.7':
bpo-38006: Avoid closure in weakref.WeakValueDictionary (GH-15641) (GH-15789)
https://github.com/python/cpython/commit/23669330b7d0d5ad1a9aac40315ba4c2e765f9dd
msg351676 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-10 14:32
I reverted the change which added func_clear() to get more time to investigate this bug:

New changeset ccaea525885e41c5f1e566bb68698847faaa82ca by T. Wouters (Victor Stinner) in branch '3.8':
Revert "bpo-33418: Add tp_clear for function object (GH-8058)" (GH-15826)
https://github.com/python/cpython/commit/ccaea525885e41c5f1e566bb68698847faaa82ca

This issue looks quite complex, and I prefer to wait Python 3.9 to fix it. We are stabilizing Python 3.8 to prepare 3.8.0 final release.
msg353371 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-09-27 14:45
Łukasz,

I haven't been able to verify that the last patches are sufficient to prevent Python from segfaulting. We need to wait until rc1 is out. I need a proper release or RPM build to run the tests on our internal test infrastructure.

You have to release rc1 with this blocker and hope for the best...
msg353378 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-09-27 17:00
There is this PR that avoids a hard crash in the interpreter:

https://github.com/python/cpython/pull/15645
msg353398 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-27 20:08
> I haven't been able to verify that the last patches are sufficient to prevent Python from segfaulting. We need to wait until rc1 is out. I need a proper release or RPM build to run the tests on our internal test infrastructure.

I pushed two fixes in the 3.8 branch for FreeIPA:

* I removed func_clear() which triggered the bug
* I fixed a reference cycle... in the weakref module :-)

As Pablo wrote, this issue is still open since the real deepest root issue is not fixed yet.

I was busy with random other stuff last weeks, so I didn't find time to dig into this issue again.

IMHO fixing the root issue can wait for Python 3.9 (and Python 3.8.1).

FreeIPA should now work fine on the current Python 3.8 branch.
msg353399 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-27 20:09
I remove the "release blocker" priority since the regression has been fixed in Python 3.8. The main "regression" was the addition of the func_clear() function: I removed it.
msg353410 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-27 20:51
A few comments from the "peanut gallery".  Thanks to Victor and Pablo for doing the hard work of investigating this bug.

First, it has been a long time since Tim and I first developed gcmodule.c.  So, some of my understanding may be inaccurate due to code changes or old age memory loss. ;-P

If I understand Victor's test case correctly, the problem is caused if you have an extension type that implements tp_traverse but not tp_clear and that there is also a weakref involved in the trash cycle.  In my original design of the GC, not implementing tp_clear should have been okay.  It would mean that the GC might find garbage reference cycles that it couldn't cleanup.  Those would leak memory but would not crash Python.

I'm not sure what has changed since to require that tp_clear actually successfully clears the cycle.  Tim was the origin of the code that handles weakrefs. The GC is (relatively) simple if not for handling weakrefs and finalizers.  Tim did almost all of the difficult and brain exploding work on that stuff.  There was a big re-org some years ago to handle "legacy finalizers" and PEP 442 finalizers.  That made things more complicated yet.  Maybe the requirement for a working tp_clear came into existence then?  I added Tim to the nosy list since he might have insight.

To me, it seems problematic that we would require a type to have a tp_clear method that actually breaks cycles.  For mutable objects like dicts and lists, we can have tp_clear do its thing while leaving the object in a valid state.  The tp_clear added for functions was not like that because other code expected structure fields to be non-NULL.  At least that's my understanding.

Is the behavior of tp_clear the key to this bug?  In the original design GC, calling tp_clear was not guaranteed to cause objects in the cylce to be freed.  So, after tp_clear, the object still needs to be in some kind of valid state.  That is one of the reasons why the tuple type doesn't have a tp_clear (also that they can't normally be directly be used to create cycles).

One last observation: the reference cycle GC is now being relied upon in ways that were never originally envisioned.  Originally the cyclic GC was an optional feature and you could build Python without it.  Reference cycles were supposed to exist in rare cases and only be created by user code.  In that world, having tp_clear on the function type is be unneeded.  Now, CPython is creating reference cycles itself and the GC is being relied on much more to free memory.  I'm afraid we are evolving into a complicated and low performance GC (refcount + cycle collection).  I don't have any good suggestions on a better approach.  Maintaining the C-API ties us to reference counting.  I mention it because it might be helpful to have a high-level view of the evolution of this part of CPython.
msg353412 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-27 21:14
> Is the behavior of tp_clear the key to this bug?

Once func_clear(my_func) is called, calling my_func() will crash: my_func() is unsuable.

Because of a complex dance involving borrowed references, the function is called *after* it's cleared.

Pablo's PR 15645 works around the problem by detecting this very specific case: call a function "after it's cleared" (while it's being garbage collected in practice).

Honestly, I'm still not sure how *exactly* the bug is triggered. I tried but failed to reproduce the initial cffi crash. reproducer.tar.gz is still a giant piece of code. I failed to simplify it to a few objects.

According to Pablo, gc_crash.py is a *different* (but similar) bug.

I would only be confident in approving a fix once I would be confident that I understood *exactly* how the initial bug (reproducer.tar.gz) occurred.
msg353414 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-27 21:16
> If I understand Victor's test case correctly, the problem is caused if you have an extension type that implements tp_traverse but not tp_clear and that there is also a weakref involved in the trash cycle.

I'm not sure that "implements tp_traverse but not tp_clear" is enough to trigger the issue. It's unclear to me neither if it's a good practice or not to implement tp_clear... Since it seems like implementing tp_clear for functions (func_clear) caused this regression... Or maybe it's very specific to the magic combo: weak references with callback.
msg353439 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-28 03:59
tp_clear implementations are necessary to reclaim trash cycles.  They're always best practice for objects that may be in trash cycles.   tuples are just "cute rebels" that way ;-)

Best guess is that the (some) extension isn't playing by the rules.  A weakref callback shouldn't be called at all unless the weakref is reachable - but in that case the callback function is necessarily reachable too (revealed by a weakrefobject's tp_traverse), so delete_garbage() should never have called clear() on the callback function to begin with.

Note:  I don't believe the weakref design gave any thought to cyclic gc.  The PEP didn't even mention it.  I got dragged into it when Zope started segfaulting.  They don't play well together voluntarily, but they've been forced to coexist, and Zope long ago provoked every fundamental problem I'm aware of.

That "the rules" are subtle doesn't excuse not following them ;-)
msg353480 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-29 02:49
> call_func and remove are part of a reference cycle. A forced garbage
> collection breaks the cycle and removes the two objects, but they are
> not removed in the expected order:
>
> * first: call_func
> * then: remove
>
> The crash requires to destroy the objects in the reverse order

Victor, I'm not sure what you had in mind there, so please flesh it out if the following isn't clear?

In any case "like" the one you constructed, __del__  will always run first, with everything needed wholly intact.

Because tp_clear can leave an object in a useless (or worse) state, gc runs all finalizers and all weakref callbacks before tp_clear is run on anything.  tp_clear is used only in delete_garbage, which is the last non-trivial thing gc does.  Before then, clears and deallocations happen only as side effects of refcounting semantics while running finalizers and callbacks.  So refcount precedence order is respected, and nothing will get torn down before its refcount hits 0.

At the time gc first calls tp_clear, the intent is that no non-trivial code will run again - just tp_clear implementations applying Py_CLEAR to objects' members, and deallocations happening as refcounts fall to 0.  No finalizers, no callbacks, none, ever ;-)

So what I saw your program doing wasn't just an accident relying on the order the objects happened to appear in the list.  Regardless of that order, gc forces __del__ to run before tp_clear is applied to anything, and the call_func instance is wholly intact when its __del__ runs.
msg353506 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-29 18:44
Victor:
> I'm surprised that the function would be seen as "unreachable" if
> it's reference counter was equal to 135:

If all those references are contained within that garbage set, it is
possible.


Pablo:
> For the weakref to be handled correctly the ctypedescr needs to be
> identified correctly as part of the isolated cycle. If is not in
> the isolated cycle something may be missing tp_traverse or the GC
> flags. 

I think this analysis is correct.  I did some poking around with
using 'rr'.  It is very handy to run the program in reverse and use
conditional breakpoints based on pointer addresses.

The objects involved seem to be:

CF: cfield
CT: ctypedescr
F: function
W: weakref

They have the following structure:

CF -> CT -> W -> F

The first link the chain is not found by the GC because CF does not
implement tp_traverse.  All those objects are part of garbage kept
alive by the reference cycle and the GC calls delete_garbage() on
it.  CT is not part of 'unreachable' and handle_weakrefs() does not
find W.

What I previously said about tp_clear was incorrect.  We take great
pains to avoid running non-trivial code after calling tp_clear (no
finalizers and no callbacks).  So, the func_clear method should be
safe.  Even if tp_clear doesn't succeed in breaking the cycle, there
should be no way for user Python code to access those objects that
have had tp_clear called on them.  Is gc.get_objects() a hole in
this?  Seems like you could cause crashes with it.

This bug is surprising to me in another way.  If the analysis is
correct then fully implementing the GC protocols is no longer
optional (as it was when cyclic GC was first introduced).  If the GC
cannot find garbage weakrefs handle_weakrefs() doesn't work and we
end up with crashes like this.

Maybe Pablos fix to not call the weakref callback if we are in
delete_garbage() is a good enough band-aid.  The alternative would
seem to be ensuring that tp_traverse methods exist so that every
single reference can be followed by the GC.  I imagine that would be
a huge task and there are many extension types that don't have them.
msg353512 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-29 22:16
Sorry, this is very hard for me - broke an arm near the shoulder on Tuesday, and between bouts of pain and lack of good sleep, concentration is nearly impossible.  Typing with one hand just makes it worse :-(

We must know that F is trash, else we never would have called tp_clear(F).

We must NOT know CT is trash.  If we did know that, then we would have found W too, and then:

1. If we thought W was trash, we would have suppressed the callback.

2. If we thought W was not trash, we would have invoked the callback directly, long before delete_garbage even started (i.e., F would have been intact).

So if CT _is_ trash, we didn't know it.  If/when delete_garbage broke enough cyclea that CT's refcount fell to 0, other parts of Python would have invoked W's callback as a matter of course, with F possibly already tp_clear'ed.

Neil, at a high level it's not REALLY surprisimg that gc can screw up if it can't tell the difference between trash & treasure ;-)  Long ago, though, it's possible the only real failure mode was leaking objects that were actually unreachable.

Offhand not worried about get_objects().  That returns objects from generation lists, but gc moves possible trash among temporary internal (not generation) lists as it goes along.

Am concerned that weakref callbacks may not be the only hole.  We force-run finalizers before ever calling tp_clear for the same reason we force-run weakref callbacks:  finalizers may also need to access trash objecta while they're still sane.  Having finalizers on objects in trash cycles isn't special anymore (we used to refuse to run them).  But my brain is fried again for now ;-)
msg353515 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-29 23:49
Fleshing out something I left implicit:  if there's a trash object T with a finalizer but we don't KNOW it's trash, we won't force-run its finalizer before delete_garbage starts either.  Then, really the same thing:  we may tp_clear some piece of trash T's finalizer needs before enough cycles are broken that T's finalizer gets run as a routine consequence of T's refcount falling to 0.
msg353520 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 01:00
I hacked up my copy of CPython to add flags to PyObject to see the
GC state of objects.  That makes it easier to know if an object is
in the 'unreachable' list or not.

> We must know that F is trash, else we never would have called tp_clear(F).

Yup.  In the crashing test program, F is in 'unreachable'.

> We must NOT know CT is trash.  If we did know that, then we would
> have found W too, and then:

Yes again.  CT is not in 'unreachable'.

> So if CT _is_ trash, we didn't know it.  If/when delete_garbage
> broke enough cyclea that CT's refcount fell to 0, other parts of
> Python would have invoked W's callback as a matter of course, with
> F possibly already tp_clear'ed.

I haven't verified this but it must be what's happening.  Note that my flags
show that W *is* in 'unreachable'.  It has to be otherwise F would not have
tp_clear called on it.

> Neil, at a high level it's not REALLY surprisimg that gc can screw
> up if it can't tell the difference between trash & treasure ;-)
> Long ago, though, it's possible the only real failure mode was
> leaking objects that were actually unreachable.

It has been so long I can't be sure the GC worked that way but I think
it did.  It would only call tp_clear on things it could really prove
were garbage.  If tp_clear didn't actually break the cycle, the only
harmful result was leaking memory.  That behavior is similar to a
conservative GC.

> Offhand not worried about get_objects().  That returns objects from
> generation lists, but gc moves possible trash among temporary
> internal (not generation) lists as it goes along.

But when delete_garbage() gets done, it moves the object to 'old'.  I
think that means gc.get_objects() called after the collection completes
can reveal objects that have had their tp_clear called (if tp_clear
didn't result in them being freed).
msg353521 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 01:06
Since W is in the unreachable set, we should not be executing its callback.  Would the attached rough patch (gc_disable_wr_callback.txt) be a possible fix?  When we find W inside handle_weakrefs(), we mark it as trash and will not execute the callback.

This is similar to Pablo's bpo-38009 but I think attacks the problem in a better way.  Having func_clear() being run is only one possible bad outcome of executing the callback.  We want to avoid executing *any* user level Python code if the weakref has access to objects found by the GC and which have tp_clear called on them.
msg353522 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 01:43
> Fleshing out something I left implicit: if there's a trash object T
> with a finalizer but we don't KNOW it's trash, we won't force-run its
> finalizer before delete_garbage starts either.  Then, really the same
> thing: we may tp_clear some piece of trash T's finalizer needs before
> enough cycles are broken that T's finalizer gets run as a routine
> consequence of T's refcount falling to 0.

Definition: a 'valid' object is one that hasn't had tp_clear called

I think the difference is that non-weakref finalizers have strong
references to objects that they can access when they run.  So, if we
haven't found them, they will keep all the objects that they refer
to alive as well (subtract_refs() cannot account for those refs).
So those objects will all be valid.

There seems a hole though.  Non-weakref finalizers could have a
weakref (without callback) to something in the garbage set.  Then,
when the finalizer runs during delete_garbage(), that finalizer code
can see non-valid objects via the weakref.  I think this can only happen if there are missing/incomplete tp_traverse methods.

We can have finalizer code running during delete_garbage().  That
code should not have access to non-valid objects.  Weakrefs seem be
a way to violate that.  handle_weakrefs() take care of some of them
but it seems there are other issues.
msg353524 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 02:08
> We can have finalizer code running during delete_garbage().  That
> code should not have access to non-valid objects.  Weakrefs seem be
> a way to violate that.  handle_weakrefs() take care of some of them
> but it seems there are other issues.

I see that handle_weakrefs() calls _PyWeakref_ClearRef() and that
will clear the weakref even if it doesn't have callback.  So, I
think that takes care for the hole I was worried about.  I.e. a
__del__ method could have a weakref to an non-valid object.
However, because handle_weakrefs() will find that weakref, it will
have been cleared when the __del__ method executes.
msg353527 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 03:39
> Note that my flags show that W *is* in 'unreachable'.  It has
> to be otherwise F would not have tp_clear called on it.

Right!  W holds a strong reference to F, so if W were thought to be reachable, F would be too.  But F isn't.


> But when delete_garbage() gets done, it moves the object to 'old'.  I
> think that means gc.get_objects() called after the collection completes
> can reveal objects that have had their tp_clear called (if tp_clear
> didn't result in them being freed).

If the tp_clear implementations aren't complete enough to break all cycles, that could happen.

In a world that took cyclic trash "very seriously", it would move them to a new internal list instead, then at the end of the function die if that list isn't empty ("we ran all tp_clears but cyclic trash still remains:  your object implementations inherently leak memory").


> I think the difference is that non-weakref finalizers have strong
> references to objects that they can access when they run. 
> So, if we haven't found them, they will keep all the objects that
> they refer to alive as well (subtract_refs() cannot account for
> those refs).  So those objects will all be valid.

That's persuasive :-)  For essentially the same reason, if a "surprise" finalizer runs during delete_garbage, it can't ressurect (or in any other way access) anything we knew was trash.


> There seems a hole though.  Non-weakref finalizers could have a
> weakref (without callback) to something in the garbage set.  Then,
> when the finalizer runs during delete_garbage(), that finalizer
> code can see non-valid objects via the weakref.  I think this can
> only happen if there are missing/incomplete tp_traverse methods.

Which is what I mean by a "surprise" finalizer:  a trash object T with a finalizer but we don't KNOW T is trash.  If we know T is trash then we force-run its finalizer before delete_garbage starts, so it can only see valid objects.


> We can have finalizer code running during delete_garbage().

Except that was never the intent (quite the contrary!).  The more people have moved to demanding that gc be a full-featured all-purpose collector, the more important it becomes that types implement what such a thing needs.  At a bare minimum, any robust gc needs to be 100% sure of the difference between trash and non-trash, and tp_traverse is the one & only way we have to discover anything about the object graph.  "Almost all" types that can participate in cycles need to implement it.

Or suffer rare shutdown segfaults that somehow nobody ever managed to stumble into before ;-)
msg353528 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 03:56
> I see that handle_weakrefs() calls _PyWeakref_ClearRef() and that
> will clear the weakref even if it doesn't have callback.  So, I
> think that takes care for the hole I was worried about.  I.e. a
> __del__ method could have a weakref to an non-valid object.
> However, because handle_weakrefs() will find that weakref, it will
> have been cleared when the __del__ method executes.

There's no problem with objects we _know_ are trash.  Their finalizers are all force-run before delete_garbage starts.

It's "surprise" finalizers, triggered by refcounts falling to 0 while delete_garbage is running.  They're invisible to handle_weakrefs.

However, to the extent that delete_garage is effective in deallocating objects soon after clearing them, to that extent also will invalidated objects clear weak references to them as a matter of course.












'
msg353531 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 04:05
Ah, nevermind my last comment - yes. handle_weakrefs will clear all weakrefs to the objects we know are trash.
msg353580 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-30 14:43
> 2019-09-30 14:41:36	lukasz.langa	set	priority: release blocker

Why setting it back to release blocker? As far as I recall, this issue is not a Python 3.8 regression. The regression which triggered this old and existing bug has been fixed, see previous comments.
msg353592 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 16:22
> Would the attached rough patch (gc_disable_wr_callback.txt)
> be a possible fix?  When we find W inside handle_weakrefs(),
> we mark it as trash and will not execute the callback.

It's semantically correct since we never wanted to execute a callback from a trash weakref to begin with.  The nature of the error violated that intent:  the callback from a trash weakref occurs outside gcmodule, where it has no idea that the weakref is trash.  After your patch, it would know.

How effective is it?  Well, failure modes "like this" ;-)

If a callback function is trash, delete_garbage will eventually clear it.  But if the function is trash the weakref containing it must also be trash (if it weren't trash, neither would be the reachable-from-it callback function).  Then since the containing weakref is trash, your patch will find it and prevent the callback.

So this "should" fix the original problem, even if tp_clear were restored for function objects (which it probably should be).
msg353593 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 16:42
Neil, how about this alternative:  leave the weakref implementation alone.  If we find a trash weakref, simply clear it instead.  That would prevent callbacks too, & would also prevent the weakref from being used to retrieve its possibly-trash-too referent.
msg353596 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 16:51
> Why setting it back to release blocker? As far as I recall, this issue is not
> a Python 3.8 regression. The regression which triggered this old and existing
> bug has been fixed, see previous comments.

I leave it up to our glorious release manager to decide how to proceed.  IMHO,
GH-15826 (revert the addition of tp_clear for functions) and GH-15641 (Avoid
closure in weakref.WeakValueDictionary) are two cases of not fixing the real
bug, just hiding symptoms.  However, since this bug has existed for decades, I
wouldn't fault him for taking the conservative approach and not trying to
address the real bug in 3.8.

I've created a new PR (GH-16083) that I think is the correct fix.  If that PR
is applied, I think we should also restore tp_clear for functions (revert
GH-15826).  GH-15641 can probably stay.


Tim on my rough patch:
> It's semantically correct since we never wanted to execute a callback from a
> trash weakref to begin with

Thanks for your help Tim.  I think my PR is a cleaner fix that does essentially
the same thing.  Just call _PyWeakref_ClearRef() on weakrefs that are in
'unreachable'.
msg353597 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 16:53
Oops, I linked to wrong PR, my proposed fix is GH-16495.  It is the same as what Tim suggests in his last comment.
msg353598 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 16:54
Neil, my brief msg 10 minutes before yours suggested the same thing (just clear the weakref), so it must be right ;-)
msg353600 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 16:59
FWIW, I agree with Neil in all respects about the release:  his patch is the best approach, plugs segfaulting holes that have been there for many years, and the earlier patches aren't needed anymore.
msg353601 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-09-30 17:20
> If that PR is applied, I think we should also restore tp_clear for functions (revert GH-15826).

If that's safe and easy, let's go for it. Would it help with memory usage in functions or was BPO-33418 addressed in another way since?
msg353602 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 17:29
It's unclear to me whether BPO-33418 was a bug or a contrived annoyance :-)

If someone believes it was worth addressing, then what it did is the only way to fix it, so should be restored now.
msg353603 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 17:45
> Would [func tp_clear] help with memory usage in functions or was BPO-33418 addressed in another way since?

Having a tp_clear for all container objects that can be involved in reference cycles will help the GC free memory.  BPO-33418 may be contrived but it does demonstrate a problem that could happen in "real" code.  If you have a long-running program, tracking down a memory leak due to reference cycles can be fiendishly difficult.  That's why Tim originally wrote cyclops and why I started working on cyclic GC in the first place.

It is a "quality of implementation" issue.  It is not wrong for CPython to leak memory if you create reference cycles (not all types are required to implement the GC protocol).  I expect users would be surprised if it happens since the majority have never used Python without the cyclic GC.
msg353604 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-09-30 17:57
Is introducing tp_clear on functions a thing that has ABI consequences? In other words, if we take our time on this, would it land in 3.8.1 or 3.9.0?

I'm kind of nervous about the rate of change in the past 48 hours.
msg353605 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 18:01
Yes, it's better to have tp_clear than not for a variety of reasons (including setting examples of best practice).

Best I can tell, the patch for BPO-33418 was reverted _only_ to worm around the crash in _this_ report.  That's no longer needed.  Or, if it is, we've missed something fundamental in the fix for this bug.  In which case, reverting the other patch is the only clear way to find that out quickly.  I favor getting it right.
msg353606 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-09-30 18:10
Now we only need a volunteer to prepare a PR to revert the revert... :>
msg353607 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-09-30 18:21
Łukasz, all type objects have tp_clear slots, and always did.  The patch in question put something useful in the function object's tp_clear slot instead of leaving it NULL.  No interface, as such, changes either way.
msg353608 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 18:26
> Is introducing tp_clear on functions a thing that has ABI consequences? In other words, if we take our time on this, would it land in 3.8.1 or 3.9.0?

I think it should not have ABI consequences.  However, I see the addition of tp_clear as a new feature.  Leaking memory due to reference cycles is bad behavior but not a bug to be fixed in a maintenance release.  Unless we require full GC protocol on every container object, we can't promise not to leak.

Inaka's change to add func tp_clear has been in all the 3.8.0 pre-releases (I think).  So, it should be pretty well exercised by now (at least, as well as the per-releases are tested).
msg353610 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-09-30 18:58
If either of you resurrects tp_clear in functions in the next 12 hours or so, I'll merge it.
msg353623 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-09-30 20:50
I created GH-16502.  I'm not exactly sure of the process of making a revert on a branch like '3.8' so hopefully it is correct.  The code change is exactly what has been reverted in 3.8.  The net effect will be as if the revert was never done.
msg353631 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-10-01 03:10
I worked some on trying to create a unit test this evening.  Attached is my best result so far (gc_weakref_bug_demo.py).  It requires that you enable the 'xx' module so we have a container object without tp_traverse.  We should probably add one to _testcapi for the real unit test so we have it.

I can make the weakref callback run from within delete_garbage().  I haven't been able to make tp_clear for the function execute before the callback and I think that is needed to replicate the crash in this bug report.
msg353642 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-10-01 07:49
New changeset b3612070b746f799901443b65725008bc035872b by Łukasz Langa (Neil Schemenauer) in branch '3.8':
Restore tp_clear for function object. (#16502)
https://github.com/python/cpython/commit/b3612070b746f799901443b65725008bc035872b
msg353643 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-10-01 07:51
Thanks a million everyone, this is now hopefully solved. I'm leaving it open for Neil to experiment with his unit test (which would be amazing to have!).
msg353708 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-01 18:29
Neil, about this comment:

#   - ct is not yet trash (it actually is but the GC doesn't know because of
#     the missing tp_traverse method).

I believe gc should know ct is trash.  ct is in the cf list, and the latter does have tp_traverse.

What gc won't know is that `a` is trash, because `a` is attached to ct, and ct doesn't have tp_traverse.

It should blow up anyway :-)

Maybe with a simpler structure it would be easier to rearrange code to nudge the callback into getting cleared before use?

Z <- Y <- A <-> B -> WZ -> C

where WZ is a weakref to Z with callback C, and Y doesn't implement tp_traverse.  The only cycle is between A and B, which could just as well be the same object.  All the other stuff hangs off that cycle.

It's all trash, but we won't know in advance that Z is part of it.
msg353713 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-10-01 20:55
Łukasz, is there some reason you removed old versions (2.7, 3.6, etc)?  The bug is present on those versions of Python and it should be trivial to backport the fix.  If those branches are maintained, I think we should fix it.

Attached is a small test program (gc_weakref_bug_demo2.py) that causes the same crash as in this bug report (SEGV inside _PyFunction_Vectorcall).  Setting things up is quite tricky and it depends on the order that things are cleared in delete_garbage().  Maybe still worth making a unit test for it?

Instead of using the 'dummy' function, you can also use types.SimpleNamespace().  Calling repr() on it after tp_clear will result in a crash or an assert failure.  Those two crashes are not needed to confirm the bug.  Just the fact that 'callback' runs is proof enough.  So, in a unit test, the callback to just set a global to indicate failure.
msg353726 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-02 08:23
Oh, Neil missed "bpo-38006: " prefix in his commit. Here it is:

commit bcda460baf25062ab68622b3f043f52b9db4d21d
Author: Neil Schemenauer <nas-github@arctrix.com>
Date:   Mon Sep 30 10:06:45 2019 -0700

    Clear weakrefs in garbage found by the GC (#16495)
    
    Fix a bug due to the interaction of weakrefs and the cyclic garbage
    collector. We must clear any weakrefs in garbage in order to prevent
    their callbacks from executing and causing a crash.
msg353729 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-10-02 10:21
> Łukasz, is there some reason you removed old versions (2.7, 3.6, etc)?  The bug is present on those versions of Python and it should be trivial to backport the fix.  If those branches are maintained, I think we should fix it.

Sorry, I did that by mistake.
msg353867 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-03 17:57
Loose ends.  Telegraphic because typing is hard.

1. Docs should be changed to encourage implementing the full gc protocol for "all" containers.  Spell out what can go wrong if they don't.  Be upfront about that history has, at times, proved us too optimistic about that ever since weakrefs were added.

2. Surprise finalizers (SF).  All weakrefs to trash objects were already cleared.  So if a SF can access trash, it must be via a chain of strong refs.  But in that case, they wouldn't be trash to begin with (since they're reachable from the SF, which is a S because we believed it wasn't trash).

So best optimistic guess is that only this can go wrong:  we promise that a finalizer will run at most once (even if the object is resurrected), but a SF neither records that it has been run nor recognizes that (if so) it's already been run.  IOW, a finalizer will be run at most once when it's not a surprise, but will run every time it is a SF.

3. get_objects() exposing invalid objects.  That's no good.  Not due to missing tp_traverse, but that the aggregate of trash objects revealed by tp_traverses exceed the aggregate of those reclaimed by tp_clears and subsequent refcounting.

Must not move invalids to older generation.  Move to new internal (to delete_garbage) list instead.  That list should be empty at delete_garbage's end.  If not, I'd be happy to die with a fatal runtime error.  Else, e.g.,

- add to count of trash objects that could not be collected

- append them to gc.garbage, each as the sole attribute of a new (say) `_GCInvalid` container, whose repr/str only shows the address of the invalid object.  So the top-level objects in gc.garbage remain valid, but the daring can access the invalid objects indirectly.  At least their type pointers should be intact.
msg353875 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-03 20:52
> 1. Docs should be changed to encourage implementing the full gc protocol for "all" containers.  Spell out what can go wrong if they don't.  Be upfront about that history has, at times, proved us too optimistic about that ever since weakrefs were added.

Would it make any sense to add an opt-in option to emit a warning when a new type is created with Py_TPFLAGS_HAVE_GC but it doesn't implement tp_traverse? Maybe also emit a warning if it doesn't implement tp_clear?

Maybe it could be a ResourceWarning emitted in development mode, when -X dev is used on the command line.
msg353883 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-10-03 22:35
> Would it make any sense to add an opt-in option to emit a warning when a new type is created with Py_TPFLAGS_HAVE_GC but it doesn't implement tp_traverse?

Or even fatal error out?  There's no reason to have Py_TPFLAGS_HAVE_GC but not implement tp_traverse, AFAIR.
msg353884 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-03 22:41
> Or even fatal error out?  There's no reason to have Py_TPFLAGS_HAVE_GC but not implement tp_traverse, AFAIR.

Well... It would prefer a smooth transition.

Such sanity checks which mostly concerns developers perfectly fit the -X dev mode.

I mean, calling Py_FatalError() only in development mode would be acceptable, rather than blocking the usage of C extensions which are working "perfectly well" on Python 3.7.
msg353896 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-04 03:40
My understanding is that the CFFI types at issue don't even have Py_TPFLAGS_HAVE_GC.  They're completely invisible to gc.  As Armin says in the CFFI issue report (linked to earlier), he never got the impression from the docs that he needed to implement anything related to cyclic gc.

Since Armin is extremely capable, conscientious, and reasonable, that tells me our docs are lacking.

It was intended at the start that the worst that could happen if a type ignored the gc protocols was that memory may leak.  That story changed radically when weakrefs with callbacks were added - but nobody knew it at the time because the weakref design gave no thought to cyclic gc.  It's been driven by segfaults ever since ;-)

We're doing all we can to keep non-cooperating code "working", and will continue to do so.  But who knows?  The next segfault may be one we can't hack around.  It's fundamentally insane to expect any gc to work perfectly when it may be blind to what the object graph _is_.
msg353961 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-04 17:51
BTW, the phrase "missing tp_traverse" is misleading.  If an object with a NULL tp_traverse appears in a gc generation, gc will blow up the next time that generation is collected.  That's always been so - gc doesn't check whether tp_traverse is NULL, it just calls it.  It's tp_clear that it checks, because that one is optional.

I don't recall any problem we've had with extensions implementing the gc protocol incorrectly or incompletely.  It's this issue's problem:  containers not participating in gc _at all_.

If we had a traditional mark-sweep collector, that would be massively catastrophic.  Miss a pointer and you can conclude a live object is actually trash, and tear it down while it's still in use.

Our problem is the opposite:  miss a pointer and we can conclude a trash object is actually live.  At the start, the worst that _could_ do is leak memory.  It's the later introduction of time-to-die finalizers (weakref callbacks at first) that created the opportunity for segfaults:  the only 100% clearly safe way to run finalizers in cyclic trash is to force them to run _before_ anything at all is torn down by force (tp_clear).  But to run them in advance, we have to know the relevant objects _are_ trash.  Which we can't always know if containers don't always participate.

While Neil & I haven't thought of ways that can go wrong now beyond that a "surprise finalizer" may get run any number of times, that doesn't mean far worse things can't happen - just that they'll surprise us when they do :-)
msg354674 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-10-14 22:44
> It's fundamentally insane to expect any gc to work perfectly when it may be blind to what the object graph _is_.

I'm often amazed it works at all, let alone perfectly. ;-P

This did trigger me to think of another possible problem.  I setup my unit test as you suggested:

        #   Z <- Y <- A--+--> WZ -> C
        #             ^  |
        #             +--+
        # where:
        #   WZ is a weakref to Z with callback C
        #   Y doesn't implement tp_traverse
        #   A contains a reference to itself, Y and WZ

But what happens if the GC doesn't see that WZ is trash?  Then it will not be cleared.  Hang it off Y so the GC can't find it.  We can set things up so that Z is freed before WZ (e.g. WZ in a second and newer cycle).  Then, the callback might still run.

On further thought, this seems safe (but maybe broken) because of the handle_weakrefs() logic.  The GC will think WZ is going to outlive Z so it will call it before doing any tp_clear calls.
msg354678 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-15 00:19
> I'm often amazed it works at all, let alone perfectly. ;-P

Indeed!  Every time I take a break from gc and come back, I burn another hour wondering why it doesn't recycle _everything_ ;-)

> But what happens if the GC doesn't see that WZ is trash?
> Then it will not be cleared.  Hang it off Y so the GC
> can't find it.

In that case it won't think C is trash either (a weakref holds a strong reference to its callback), so C won't be tp_clear'ed - if WZ isn't in the unreachable set, C can't be either.

> We can set things up so that Z is freed before WZ (e.g.
> WZ in a second and newer cycle).  Then, the callback
> might still run.

Since C hasn't been damaged, and nothing reachable from C would have been considered trash either, I don't see a problem.

> On further thought, this seems safe (but maybe broken) because
> of the handle_weakrefs() logic.  The GC will think WZ is going
> to outlive Z so it will call it before doing any tp_clear calls.

Don't think that applies.  As above, WZ isn't in the unreachable set, so gc knows nothing about it.

Z will eventually be reclaimed via refcounting side effect, C _will_ run, and then WZ, and then C, will be reclaimed by refcounting.

Or am I missing something?
msg354685 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-15 03:01
> While Neil & I haven't thought of ways that can go wrong now
> beyond that a "surprise finalizer" may get run any number of
> times ...

Speaking of which, I no longer believe that's true.  Thanks to the usual layers of baffling renaming via macro after macro, I missed object.c's PyObject_CallFinalizerFromDealloc(), which both respects and sets the "am I finalized?" flag regardless of how, e.g., an object with a __del__ ends its life (and possibly resurrects).
msg354766 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-10-16 03:56
New changeset 392a13bb9346331b087bcd8bb1b37072c126abee by Neil Schemenauer in branch 'master':
bpo-38006: Add unit test for weakref clear bug (GH-16788)
https://github.com/python/cpython/commit/392a13bb9346331b087bcd8bb1b37072c126abee
History
Date User Action Args
2019-10-16 03:56:53naschemesetmessages: + msg354766
2019-10-15 13:36:27vstinnersetnosy: - vstinner
2019-10-15 03:01:38tim.peterssetmessages: + msg354685
2019-10-15 00:19:05tim.peterssetmessages: + msg354678
2019-10-14 22:44:02naschemesetmessages: + msg354674
2019-10-14 19:22:30naschemesetpull_requests: + pull_request16348
2019-10-04 17:51:10tim.peterssetmessages: + msg353961
2019-10-04 03:40:38tim.peterssetmessages: + msg353896
2019-10-03 22:41:55vstinnersetmessages: + msg353884
2019-10-03 22:35:15pitrousetmessages: + msg353883
2019-10-03 20:52:20vstinnersetmessages: + msg353875
2019-10-03 17:57:54tim.peterssetmessages: + msg353867
2019-10-02 10:21:06lukasz.langasetmessages: + msg353729
versions: + Python 2.7, Python 3.5, Python 3.6, Python 3.7
2019-10-02 08:23:36vstinnersetmessages: + msg353726
2019-10-01 20:55:23naschemesetfiles: + gc_weakref_bug_demo2.py

messages: + msg353713
2019-10-01 18:29:30tim.peterssetmessages: + msg353708
2019-10-01 07:51:08lukasz.langasetpriority: release blocker -> normal

messages: + msg353643
versions: - Python 2.7, Python 3.5, Python 3.6, Python 3.7
2019-10-01 07:49:41lukasz.langasetmessages: + msg353642
2019-10-01 03:10:16naschemesetfiles: + gc_weakref_bug_demo.py

messages: + msg353631
2019-09-30 20:50:40naschemesetnosy: + larry, ned.deily, benjamin.peterson

versions: + Python 2.7, Python 3.5, Python 3.6, Python 3.7
2019-09-30 20:50:01naschemesetmessages: + msg353623
2019-09-30 20:45:00naschemesetpull_requests: + pull_request16092
2019-09-30 18:58:03lukasz.langasetmessages: + msg353610
2019-09-30 18:26:46naschemesetmessages: + msg353608
2019-09-30 18:21:58tim.peterssetmessages: + msg353607
2019-09-30 18:10:38lukasz.langasetmessages: + msg353606
2019-09-30 18:01:14tim.peterssetmessages: + msg353605
2019-09-30 17:57:21lukasz.langasetmessages: + msg353604
2019-09-30 17:45:13naschemesetmessages: + msg353603
2019-09-30 17:29:20tim.peterssetmessages: + msg353602
2019-09-30 17:20:48lukasz.langasetmessages: + msg353601
2019-09-30 17:09:23miss-islingtonsetpull_requests: + pull_request16088
2019-09-30 16:59:24tim.peterssetmessages: + msg353600
2019-09-30 16:54:26tim.peterssetmessages: + msg353598
2019-09-30 16:53:36naschemesetmessages: + msg353597
2019-09-30 16:51:20naschemesetmessages: + msg353596
2019-09-30 16:42:38tim.peterssetmessages: + msg353593
2019-09-30 16:38:28naschemesetpull_requests: + pull_request16083
2019-09-30 16:22:22tim.peterssetmessages: + msg353592
2019-09-30 14:43:20vstinnersetmessages: + msg353580
2019-09-30 14:41:36lukasz.langasetpriority: release blocker
2019-09-30 04:05:05tim.peterssetmessages: + msg353531
2019-09-30 03:56:41tim.peterssetmessages: + msg353528
2019-09-30 03:39:44tim.peterssetmessages: + msg353527
2019-09-30 02:08:05naschemesetmessages: + msg353524
2019-09-30 01:43:32naschemesetmessages: + msg353522
2019-09-30 01:06:31naschemesetfiles: + gc_disable_wr_callback.txt

messages: + msg353521
2019-09-30 01:00:51naschemesetmessages: + msg353520
2019-09-29 23:49:50tim.peterssetmessages: + msg353515
2019-09-29 22:16:50tim.peterssetmessages: + msg353512
2019-09-29 18:44:18naschemesetmessages: + msg353506
2019-09-29 02:49:18tim.peterssetmessages: + msg353480
2019-09-28 03:59:19tim.peterssetmessages: + msg353439
2019-09-27 21:16:47vstinnersetmessages: + msg353414
2019-09-27 21:14:41vstinnersetmessages: + msg353412
2019-09-27 20:51:49naschemesetnosy: + tim.peters, nascheme
messages: + msg353410
2019-09-27 20:09:01vstinnersetpriority: release blocker -> (no value)

messages: + msg353399
2019-09-27 20:08:19vstinnersetmessages: + msg353398
2019-09-27 17:00:12pablogsalsetmessages: + msg353378
2019-09-27 14:45:45christian.heimessetmessages: + msg353371
2019-09-10 14:32:57vstinnersetmessages: + msg351676
2019-09-09 22:31:25vstinnersetmessages: + msg351561
2019-09-09 16:24:19miss-islingtonsetnosy: + miss-islington
messages: + msg351519
2019-09-09 15:00:00vstinnersetpull_requests: + pull_request15442
2019-09-09 14:56:12miss-islingtonsetpull_requests: + pull_request15440
2019-09-09 14:56:03vstinnersetmessages: + msg351492
2019-09-04 10:49:03inada.naokisetnosy: + inada.naoki
2019-09-03 19:30:29vstinnersetmessages: + msg351099
2019-09-03 19:22:20vstinnersetmessages: + msg351098
2019-09-03 14:20:39vstinnersetfiles: + gc_crash.patch
2019-09-03 14:20:30vstinnersetfiles: + gc_crash.py

messages: + msg351086
2019-09-03 12:45:01vstinnersetmessages: + msg351083
2019-09-03 12:21:13vstinnersetmessages: + msg351082
2019-09-03 08:45:31vstinnersetfiles: + reproducer.tar.gz

messages: + msg351077
2019-09-03 07:56:10christian.heimessetmessages: + msg351072
2019-09-03 07:07:54christian.heimessetmessages: + msg351067
2019-09-02 21:53:39vstinnersetmessages: + msg351042
2019-09-02 21:49:58vstinnersetmessages: + msg351041
2019-09-02 21:38:36vstinnersetmessages: + msg351040
2019-09-02 21:25:34vstinnersetmessages: + msg351039
2019-09-02 21:23:30vstinnersetmessages: + msg351038
2019-09-02 21:09:22vstinnersetmessages: + msg351037
2019-09-02 19:38:25pablogsalsetmessages: - msg351022
2019-09-02 16:26:02pablogsalsetmessages: + msg351022
2019-09-02 16:23:15pablogsalsetmessages: + msg351020
2019-09-02 16:10:40christian.heimessetmessages: + msg351017
2019-09-02 15:44:02pablogsalsetmessages: + msg351015
2019-09-02 15:15:57pablogsalsetmessages: + msg351012
2019-09-02 15:13:37petr.viktorinsetmessages: + msg351010
2019-09-02 15:11:22christian.heimessetmessages: + msg351009
2019-09-02 15:06:40pablogsalsetmessages: + msg351007
2019-09-02 14:46:17petr.viktorinsetmessages: + msg351006
2019-09-02 13:15:12pablogsalsetmessages: + msg350999
2019-09-02 13:09:25vstinnersetmessages: + msg350998
2019-09-02 13:07:54vstinnersetmessages: + msg350997
2019-09-02 12:55:34petr.viktorinsetmessages: + msg350996
2019-09-02 12:46:24pablogsalsetmessages: + msg350995
2019-09-02 12:41:34pablogsalsetmessages: + msg350992
2019-09-02 12:34:51pablogsalsetmessages: + msg350991
2019-09-02 12:29:24pablogsalsetnosy: + pitrou, pablogsal
2019-09-02 12:09:12christian.heimessetmessages: + msg350990
2019-09-02 11:40:33vstinnersetmessages: + msg350987
2019-09-02 11:37:57vstinnersetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request15307
2019-09-02 11:27:53vstinnersetmessages: + msg350985
2019-09-02 11:10:03vstinnersettitle: _PyFunction_Vectorcall() can segfault on process exit -> Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit
2019-09-02 11:08:05vstinnersetmessages: + msg350983
2019-09-02 11:00:31xtreaksetnosy: + jdemeyer
2019-09-02 10:57:55vstinnersetmessages: + msg350981
2019-09-02 10:11:48christian.heimessetmessages: + msg350976
2019-09-02 10:06:56petr.viktorinsetmessages: + msg350975
2019-09-02 09:48:27christian.heimescreate