classification
Title: Add gc.enable_object_debugger(): detect corrupted Python objects in the GC
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: inada.naoki, pablogsal, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2019-03-21 11:02 by vstinner, last changed 2019-04-15 15:54 by vstinner.

Pull Requests
URL Status Linked Edit
PR 12480 open vstinner, 2019-03-21 11:14
PR 12770 merged vstinner, 2019-04-11 02:07
PR 12780 closed vstinner, 2019-04-11 09:45
PR 12782 merged vstinner, 2019-04-11 10:38
PR 12788 merged vstinner, 2019-04-11 15:47
PR 12803 merged vstinner, 2019-04-12 13:51
PR 12844 merged vstinner, 2019-04-15 15:16
Messages (16)
msg338536 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 11:02
That's the follow-up of a thread that I started on python-dev in June 2018:

  [Python-Dev] Idea: reduce GC threshold in development mode (-X dev)
  https://mail.python.org/pipermail/python-dev/2018-June/153857.html

When an application crash during a garbage collection, we are usually clueless about the cause of the crash. The crash usually occur in visit_decref() on a corrupted Python object. Sadly, not only there are too many possible reasons which can explain why a Python object is corrupted, but the crash usually occur too late after the object is corrupted. Using a smaller GC threshold can help, but it's not enough.

It would help to be able to enable a builtin checker for corrupted objects. Something that we would triggered by the GC with a threshold specified by the user and that would have zero impact on performance when it's not used.

The implementation would be to iterate on objects and ensure that they are consistent.

Attached PR is an implementation of this idea. It uses new API that I wrote recently:

* _PyObject_ASSERT()
* _PyObject_IsFreed()
* _PyType_CheckConsistency()
* _PyUnicode_CheckConsistency()
* _PyDict_CheckConsistency()

If an inconsistency is detected, _PyObject_ASSERT() will call _PyObject_Dump() to dump info about the object. This function can crash, but well, anything can crash on a memory corruption...
msg338538 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 12:53
Hum, _PyType_CheckConsistency() fails on the following assertion during Python finalization:

    ASSERT(type->tp_mro != NULL && PyTuple_Check(type->tp_mro));

Error:
---
/home/vstinner/prog/python/master/python: No module named asyncio.__main__; 'asyncio' is a package and cannot be directly executed
Objects/typeobject.c:149: _PyType_CheckConsistency: Assertion "(type->tp_mro != ((void *)0) && ((((((PyObject*)(type->tp_mro))->ob_type))->tp_flags & ((1UL << 26))) != 0))" failed
Enable tracemalloc to get the memory block allocation traceback

object  : <enum 'AddressFamily'>
type    : EnumMeta
refcount: 1
address : 0x9138f0
Fatal Python error: _PyObject_AssertFailed

Current thread 0x00007ffff7be8740 (most recent call first):
---

gdb traceback:
---
(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c0f895 in __GI_abort () at abort.c:79
#2  0x000000000055bf91 in fatal_error (prefix=0x0, msg=0x68a47d "_PyObject_AssertFailed", status=-1) at Python/pylifecycle.c:2088
#3  0x000000000055bfbd in Py_FatalError (msg=0x68a47d "_PyObject_AssertFailed") at Python/pylifecycle.c:2098
#4  0x000000000047be0f in _PyObject_AssertFailed (obj=<EnumMeta() at remote 0x9138f0>, 
    expr=0x68d720 "(type->tp_mro != ((void *)0) && ((((((PyObject*)(type->tp_mro))->ob_type))->tp_flags & ((1UL << 26))) != 0))", msg=0x0, 
    file=0x68d632 "Objects/typeobject.c", line=149, function=0x690ba0 <__func__.14343> "_PyType_CheckConsistency") at Objects/object.c:2197
#5  0x000000000048ebd8 in _PyType_CheckConsistency (type=0x9138f0) at Objects/typeobject.c:149
#6  0x00000000004770a5 in _PyObject_CheckConsistency (op=<AddressFamily at remote 0x7fffea7a0870>) at Objects/object.c:35
#7  0x000000000058de3c in gc_check_object (gc=0x7fffea7a0860) at Modules/gcmodule.c:1280
#8  0x000000000058de90 in _PyGC_CheckAllObjets () at Modules/gcmodule.c:1290
#9  0x000000000058f4a0 in gc_check_object_debugger () at Modules/gcmodule.c:2007
#10 0x000000000058f7c7 in PyObject_GC_Del (op=0x7fffea7a27c0) at Modules/gcmodule.c:2109
#11 0x0000000000467188 in dict_dealloc (mp=0x7fffea7a27c0) at Objects/dictobject.c:1996
#12 0x000000000047be44 in _Py_Dealloc (op={}) at Objects/object.c:2212
#13 0x0000000000461cce in _Py_DECREF (filename=0x6855a0 "./Include/object.h", lineno=533, op={}) at ./Include/object.h:470
#14 0x0000000000461d1c in _Py_XDECREF (op={}) at ./Include/object.h:533
#15 0x0000000000462fef in free_keys_object (keys=0x9120e0) at Objects/dictobject.c:580
#16 0x000000000046284f in dictkeys_decref (dk=0x9120e0) at Objects/dictobject.c:324
#17 0x00000000004664f2 in PyDict_Clear (op={}) at Objects/dictobject.c:1722
#18 0x00000000004969b8 in type_clear (type=0x911c50) at Objects/typeobject.c:3637
#19 0x0000000000490d03 in subtype_clear (self=<EnumMeta() at remote 0x911c50>) at Objects/typeobject.c:1118
#20 0x000000000058d390 in delete_garbage (collectable=0x7fffffffcef0, old=0x7d2560 <_PyRuntime+416>) at Modules/gcmodule.c:931
#21 0x000000000058d833 in collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1) at Modules/gcmodule.c:1100
#22 0x000000000058f175 in _PyGC_CollectNoFail () at Modules/gcmodule.c:1915
#23 0x000000000054759c in PyImport_Cleanup () at Python/import.c:589
#24 0x000000000055a442 in Py_FinalizeEx () at Python/pylifecycle.c:1162
#25 0x000000000055c1e5 in Py_Exit (sts=1) at Python/pylifecycle.c:2188
#26 0x00000000005677fb in handle_system_exit () at Python/pythonrun.c:642
#27 0x0000000000567821 in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:652
#28 0x00000000005674c5 in PyErr_Print () at Python/pythonrun.c:548
#29 0x00000000004222f2 in pymain_run_module (modname=0x7db6d0 L"asyncio", set_argv0=1) at Modules/main.c:566
#30 0x0000000000422ad6 in pymain_run_python (interp=0x7da350, exitcode=0x7fffffffd254) at Modules/main.c:799
#31 0x0000000000422c6a in pymain_main (args=0x7fffffffd2a0) at Modules/main.c:877
#32 0x0000000000422d4e in _Py_UnixMain (argc=3, argv=0x7fffffffd3c8) at Modules/main.c:922
#33 0x0000000000420796 in main (argc=3, argv=0x7fffffffd3c8) at ./Programs/python.c:16
---

Maybe my assumption on tp_mro was wrong. I will remove the assertion.
msg338539 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 12:55
I'm not sure if I should include an unit test. WIP patch for that:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 350ef77163..9c0d0cf41a 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -4718,6 +4718,18 @@ negative_refcount(PyObject *self, PyObject *Py_UNUSED(args))
 #endif
 
 
+static PyObject *
+corrupted_object(PyObject *self, PyObject *Py_UNUSED(args))
+{
+    PyObject *obj = PyList_New(0);
+    if (obj == NULL) {
+        return NULL;
+    }
+    obj->ob_type = NULL;
+    return obj;
+}
+
+
 static PyMethodDef TestMethods[] = {
     {"raise_exception",         raise_exception,                 METH_VARARGS},
     {"raise_memoryerror",       raise_memoryerror,               METH_NOARGS},
@@ -4948,6 +4960,7 @@ static PyMethodDef TestMethods[] = {
 #ifdef Py_REF_DEBUG
     {"negative_refcount", negative_refcount, METH_NOARGS},
 #endif
+    {"corrupted_object", corrupted_object, METH_NOARGS},
     {NULL, NULL} /* sentinel */
 };
 


Tested manually using this script:
---
import gc, _testcapi, sys

gc.enable_object_debugger(1)
x = _testcapi.corrupted_object()
y = []
y = None
# Debugger should trigger here
x = None
---
msg338541 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-21 13:05
It is better to not use assert(foo && bar). Use instead two separate asserts: assrte(foo) and assert(bar).
msg338542 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 14:14
> It is better to not use assert(foo && bar). Use instead two separate asserts: assrte(foo) and assert(bar).

Hum, I looked at my PR and I'm not sure that I added such new assertion.

Note: "assert" on calling assert(_PyDict_CheckConsistency(mp)) is only used to remove the call in release build, but the function always return 1. The function does uses assert() or _PyObject_ASSERT() internally with a different line number and the exact failing expression.

Do you want me to enhance existing _PyDict_CheckConsistency() assertions in the same PR?
msg338603 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-22 13:32
I don't think calling APIs like _PyDict_CheckConsistency() is super useful.  Can the PR find bugs like issue33803 quickly?

I think calling tp_traverse is better.


static int
check_object(PyObject *obj, void *unused)
{
    _PyObject_ASSERT(obj, Py_REFCNT(obj) > 0);
    return 0;
}

static void
gc_check_object(PyGC_Head *gc)
{
    PyObject *op = FROM_GC(gc);
    _PyObject_ASSERT(op, Py_REFCNT(obj) > 0);
    _PyObject_ASSERT(op, _PyObject_GC_IS_TRACKED(op));
    Py_Type(op)->tp_traverse(op, (visitproc)check_object, NULL);
}
msg338612 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-22 15:55
> I don't think calling APIs like _PyDict_CheckConsistency() is super useful.

I looked a multiple old issues which contain "visit_decref". Most of them are really strange crashes and were closed with a message like "we don't have enough info to debug, sorry". So honestly, I'm not sure of what is the most "efficient" way to detect corrupted objects.

I guess that we need a trade-off between completeness of the checks and the performance.

gc.enable_object_debugger(1) simply makes Python completely unusable. Maybe such very bad performance makes the feature basically useless. I'm not sure at this point.

I tried to find an old bug which mentioned "visit_decref", tried to reintroduced the fixed bug, but I'm not really convinced by my experimental tests so far.

That being said, I *like* your idea of reusing tp_traverse. Not only it fits very well into the gc module (I chose to put the new feature in the gc module on purpose), but it's closer to the existing "visit_decref crash". If someone gets a crash if visit_decref() and the object debugger uses tp_traverse, object debugger *will* catch the same bug. The expectation is to be able to get it early.

--

Oh by the way, why not using lower GC thresholds? I proposed this idea, but there are multiple issues with that. It can hide the bug (objects destroyed in a different order). It can also change the behavior of the application, which is linked to my previous point (again, objects destroyed in a different order).

That's how Serhiy Storchaka proposed the design of gc.enable_object_debugger(): traverse without touching the reference counter.
https://mail.python.org/pipermail/python-dev/2018-June/153860.html

Thanks Serhiy for this nice idea ;-)
msg339910 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 00:35
I modified my PR to reuse tp_traverse.

Inada-san: would you mind to review my change?
msg339911 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 00:40
Do you think that a gc.is_object_debugger_enabled() function would be needed?

The tracemalloc module has 3 functions:

* start(), stop()
* is_tracing()

The faulthandler module has 3 functions:

* enable() / disable()
* is_enabled()
msg339916 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 02:18
bpo-33803 bug can be reintroduced using the following patch:

diff --git a/Python/hamt.c b/Python/hamt.c
index 67af04c437..67da8ec22c 100644
--- a/Python/hamt.c
+++ b/Python/hamt.c
@@ -2478,8 +2478,10 @@ hamt_alloc(void)
     if (o == NULL) {
         return NULL;
     }
+#if 0
     o->h_count = 0;
     o->h_root = NULL;
+#endif
     o->h_weakreflist = NULL;
     PyObject_GC_Track(o);
     return o;

And then run:

./python -m test -v test_context

The best would be to also be able to catch the bug in:

./python -m test -v test_asyncio

Problem: Right now, my GC object debugger implementation is way too slow to use a threshold lower than 100, whereas the bug is catched like "immediately" using gc.set_threshold(5).

Maybe my implementation should be less naive: rather than always check *all* objects tracked by the GC, have different thresholds depending on the generation? Maybe reuse GC thresholds?
msg339948 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 09:33
New changeset 2b00db68554422ec37faba2a80179a0172df6349 by Victor Stinner in branch 'master':
bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770)
https://github.com/python/cpython/commit/2b00db68554422ec37faba2a80179a0172df6349
msg339960 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 11:01
New changeset 4c409beb4c360a73d054f37807d3daad58d1b567 by Victor Stinner in branch 'master':
bpo-36389: Change PyMem_SetupDebugHooks() constants (GH-12782)
https://github.com/python/cpython/commit/4c409beb4c360a73d054f37807d3daad58d1b567
msg339966 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 11:23
Currently, I'm using the following patch to try to detect bpo-33803 bug using my GC object debugger. It's still a work-in-progress.

diff --git a/Lib/site.py b/Lib/site.py
index ad1146332b..e184080b19 100644
--- a/Lib/site.py
+++ b/Lib/site.py
@@ -638,3 +638,8 @@ def _script():

 if __name__ == '__main__':
     _script()
+
+if 'dev' in sys._xoptions:
+    import gc
+    gc.enable_object_debugger(100)
+    #gc.set_threshold(5)
diff --git a/Python/hamt.c b/Python/hamt.c
index 67af04c437..67da8ec22c 100644
--- a/Python/hamt.c
+++ b/Python/hamt.c
@@ -2478,8 +2478,10 @@ hamt_alloc(void)
     if (o == NULL) {
         return NULL;
     }
+#if 0
     o->h_count = 0;
     o->h_root = NULL;
+#endif
     o->h_weakreflist = NULL;
     PyObject_GC_Track(o);
     return o;
msg339995 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 20:30
New changeset 9e23f0a27cb8bf6e4ea1d2aef36a91502282bbc9 by Victor Stinner in branch '3.7':
[3.7] bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770) (GH-12788)
https://github.com/python/cpython/commit/9e23f0a27cb8bf6e4ea1d2aef36a91502282bbc9
msg340101 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-12 19:51
New changeset 0fc91eef34a1d9194904fa093c9fbd711af0f26c by Victor Stinner in branch 'master':
bpo-36389: Add _PyObject_CheckConsistency() function (GH-12803)
https://github.com/python/cpython/commit/0fc91eef34a1d9194904fa093c9fbd711af0f26c
msg340287 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-15 15:54
New changeset 0810fa79885276114d1a94e2ce61da367ebb1ffc by Victor Stinner in branch 'master':
bpo-36389: Cleanup gc.set_threshold() (GH-12844)
https://github.com/python/cpython/commit/0810fa79885276114d1a94e2ce61da367ebb1ffc
History
Date User Action Args
2019-04-15 15:54:13vstinnersetmessages: + msg340287
2019-04-15 15:16:16vstinnersetpull_requests: + pull_request12769
2019-04-12 19:51:37vstinnersetmessages: + msg340101
2019-04-12 13:51:42vstinnersetpull_requests: + pull_request12730
2019-04-11 20:30:35vstinnersetmessages: + msg339995
2019-04-11 15:47:51vstinnersetpull_requests: + pull_request12715
2019-04-11 11:23:19vstinnersetmessages: + msg339966
2019-04-11 11:01:19vstinnersetmessages: + msg339960
2019-04-11 10:38:23vstinnersetpull_requests: + pull_request12709
2019-04-11 09:45:56vstinnersetpull_requests: + pull_request12707
2019-04-11 09:33:30vstinnersetmessages: + msg339948
2019-04-11 02:18:16vstinnersetmessages: + msg339916
2019-04-11 02:07:46vstinnersetpull_requests: + pull_request12699
2019-04-11 00:40:40vstinnersetmessages: + msg339911
2019-04-11 00:35:15vstinnersetmessages: + msg339910
2019-03-22 15:55:29vstinnersetmessages: + msg338612
2019-03-22 13:32:40inada.naokisetmessages: + msg338603
2019-03-21 14:14:02vstinnersetmessages: + msg338542
2019-03-21 13:05:14serhiy.storchakasetmessages: + msg338541
2019-03-21 12:55:24vstinnersetmessages: + msg338539
2019-03-21 12:53:46vstinnersetmessages: + msg338538
2019-03-21 11:14:16vstinnersetnosy: + inada.naoki
2019-03-21 11:14:00vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request12432
2019-03-21 11:02:41vstinnercreate