classification
Title: Add gc.enable_object_debugger(): detect corrupted Python objects in the GC
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.8
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: inada.naoki, pablogsal, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2019-03-21 11:02 by vstinner, last changed 2019-10-16 01:37 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12480 closed vstinner, 2019-03-21 11:14
PR 12770 merged vstinner, 2019-04-11 02:07
PR 12780 closed vstinner, 2019-04-11 09:45
PR 12782 merged vstinner, 2019-04-11 10:38
PR 12788 merged vstinner, 2019-04-11 15:47
PR 12803 merged vstinner, 2019-04-12 13:51
PR 12844 merged vstinner, 2019-04-15 15:16
PR 16612 merged vstinner, 2019-10-07 10:14
PR 16624 merged vstinner, 2019-10-07 19:57
PR 16629 merged vstinner, 2019-10-07 21:18
PR 16796 merged vstinner, 2019-10-14 22:27
PR 16816 closed vstinner, 2019-10-16 01:37
Messages (24)
msg338536 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 11:02
That's the follow-up of a thread that I started on python-dev in June 2018:

  [Python-Dev] Idea: reduce GC threshold in development mode (-X dev)
  https://mail.python.org/pipermail/python-dev/2018-June/153857.html

When an application crash during a garbage collection, we are usually clueless about the cause of the crash. The crash usually occur in visit_decref() on a corrupted Python object. Sadly, not only there are too many possible reasons which can explain why a Python object is corrupted, but the crash usually occur too late after the object is corrupted. Using a smaller GC threshold can help, but it's not enough.

It would help to be able to enable a builtin checker for corrupted objects. Something that we would triggered by the GC with a threshold specified by the user and that would have zero impact on performance when it's not used.

The implementation would be to iterate on objects and ensure that they are consistent.

Attached PR is an implementation of this idea. It uses new API that I wrote recently:

* _PyObject_ASSERT()
* _PyObject_IsFreed()
* _PyType_CheckConsistency()
* _PyUnicode_CheckConsistency()
* _PyDict_CheckConsistency()

If an inconsistency is detected, _PyObject_ASSERT() will call _PyObject_Dump() to dump info about the object. This function can crash, but well, anything can crash on a memory corruption...
msg338538 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 12:53
Hum, _PyType_CheckConsistency() fails on the following assertion during Python finalization:

    ASSERT(type->tp_mro != NULL && PyTuple_Check(type->tp_mro));

Error:
---
/home/vstinner/prog/python/master/python: No module named asyncio.__main__; 'asyncio' is a package and cannot be directly executed
Objects/typeobject.c:149: _PyType_CheckConsistency: Assertion "(type->tp_mro != ((void *)0) && ((((((PyObject*)(type->tp_mro))->ob_type))->tp_flags & ((1UL << 26))) != 0))" failed
Enable tracemalloc to get the memory block allocation traceback

object  : <enum 'AddressFamily'>
type    : EnumMeta
refcount: 1
address : 0x9138f0
Fatal Python error: _PyObject_AssertFailed

Current thread 0x00007ffff7be8740 (most recent call first):
---

gdb traceback:
---
(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c0f895 in __GI_abort () at abort.c:79
#2  0x000000000055bf91 in fatal_error (prefix=0x0, msg=0x68a47d "_PyObject_AssertFailed", status=-1) at Python/pylifecycle.c:2088
#3  0x000000000055bfbd in Py_FatalError (msg=0x68a47d "_PyObject_AssertFailed") at Python/pylifecycle.c:2098
#4  0x000000000047be0f in _PyObject_AssertFailed (obj=<EnumMeta() at remote 0x9138f0>, 
    expr=0x68d720 "(type->tp_mro != ((void *)0) && ((((((PyObject*)(type->tp_mro))->ob_type))->tp_flags & ((1UL << 26))) != 0))", msg=0x0, 
    file=0x68d632 "Objects/typeobject.c", line=149, function=0x690ba0 <__func__.14343> "_PyType_CheckConsistency") at Objects/object.c:2197
#5  0x000000000048ebd8 in _PyType_CheckConsistency (type=0x9138f0) at Objects/typeobject.c:149
#6  0x00000000004770a5 in _PyObject_CheckConsistency (op=<AddressFamily at remote 0x7fffea7a0870>) at Objects/object.c:35
#7  0x000000000058de3c in gc_check_object (gc=0x7fffea7a0860) at Modules/gcmodule.c:1280
#8  0x000000000058de90 in _PyGC_CheckAllObjets () at Modules/gcmodule.c:1290
#9  0x000000000058f4a0 in gc_check_object_debugger () at Modules/gcmodule.c:2007
#10 0x000000000058f7c7 in PyObject_GC_Del (op=0x7fffea7a27c0) at Modules/gcmodule.c:2109
#11 0x0000000000467188 in dict_dealloc (mp=0x7fffea7a27c0) at Objects/dictobject.c:1996
#12 0x000000000047be44 in _Py_Dealloc (op={}) at Objects/object.c:2212
#13 0x0000000000461cce in _Py_DECREF (filename=0x6855a0 "./Include/object.h", lineno=533, op={}) at ./Include/object.h:470
#14 0x0000000000461d1c in _Py_XDECREF (op={}) at ./Include/object.h:533
#15 0x0000000000462fef in free_keys_object (keys=0x9120e0) at Objects/dictobject.c:580
#16 0x000000000046284f in dictkeys_decref (dk=0x9120e0) at Objects/dictobject.c:324
#17 0x00000000004664f2 in PyDict_Clear (op={}) at Objects/dictobject.c:1722
#18 0x00000000004969b8 in type_clear (type=0x911c50) at Objects/typeobject.c:3637
#19 0x0000000000490d03 in subtype_clear (self=<EnumMeta() at remote 0x911c50>) at Objects/typeobject.c:1118
#20 0x000000000058d390 in delete_garbage (collectable=0x7fffffffcef0, old=0x7d2560 <_PyRuntime+416>) at Modules/gcmodule.c:931
#21 0x000000000058d833 in collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1) at Modules/gcmodule.c:1100
#22 0x000000000058f175 in _PyGC_CollectNoFail () at Modules/gcmodule.c:1915
#23 0x000000000054759c in PyImport_Cleanup () at Python/import.c:589
#24 0x000000000055a442 in Py_FinalizeEx () at Python/pylifecycle.c:1162
#25 0x000000000055c1e5 in Py_Exit (sts=1) at Python/pylifecycle.c:2188
#26 0x00000000005677fb in handle_system_exit () at Python/pythonrun.c:642
#27 0x0000000000567821 in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:652
#28 0x00000000005674c5 in PyErr_Print () at Python/pythonrun.c:548
#29 0x00000000004222f2 in pymain_run_module (modname=0x7db6d0 L"asyncio", set_argv0=1) at Modules/main.c:566
#30 0x0000000000422ad6 in pymain_run_python (interp=0x7da350, exitcode=0x7fffffffd254) at Modules/main.c:799
#31 0x0000000000422c6a in pymain_main (args=0x7fffffffd2a0) at Modules/main.c:877
#32 0x0000000000422d4e in _Py_UnixMain (argc=3, argv=0x7fffffffd3c8) at Modules/main.c:922
#33 0x0000000000420796 in main (argc=3, argv=0x7fffffffd3c8) at ./Programs/python.c:16
---

Maybe my assumption on tp_mro was wrong. I will remove the assertion.
msg338539 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 12:55
I'm not sure if I should include an unit test. WIP patch for that:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 350ef77163..9c0d0cf41a 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -4718,6 +4718,18 @@ negative_refcount(PyObject *self, PyObject *Py_UNUSED(args))
 #endif
 
 
+static PyObject *
+corrupted_object(PyObject *self, PyObject *Py_UNUSED(args))
+{
+    PyObject *obj = PyList_New(0);
+    if (obj == NULL) {
+        return NULL;
+    }
+    obj->ob_type = NULL;
+    return obj;
+}
+
+
 static PyMethodDef TestMethods[] = {
     {"raise_exception",         raise_exception,                 METH_VARARGS},
     {"raise_memoryerror",       raise_memoryerror,               METH_NOARGS},
@@ -4948,6 +4960,7 @@ static PyMethodDef TestMethods[] = {
 #ifdef Py_REF_DEBUG
     {"negative_refcount", negative_refcount, METH_NOARGS},
 #endif
+    {"corrupted_object", corrupted_object, METH_NOARGS},
     {NULL, NULL} /* sentinel */
 };
 


Tested manually using this script:
---
import gc, _testcapi, sys

gc.enable_object_debugger(1)
x = _testcapi.corrupted_object()
y = []
y = None
# Debugger should trigger here
x = None
---
msg338541 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-21 13:05
It is better to not use assert(foo && bar). Use instead two separate asserts: assrte(foo) and assert(bar).
msg338542 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 14:14
> It is better to not use assert(foo && bar). Use instead two separate asserts: assrte(foo) and assert(bar).

Hum, I looked at my PR and I'm not sure that I added such new assertion.

Note: "assert" on calling assert(_PyDict_CheckConsistency(mp)) is only used to remove the call in release build, but the function always return 1. The function does uses assert() or _PyObject_ASSERT() internally with a different line number and the exact failing expression.

Do you want me to enhance existing _PyDict_CheckConsistency() assertions in the same PR?
msg338603 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-22 13:32
I don't think calling APIs like _PyDict_CheckConsistency() is super useful.  Can the PR find bugs like issue33803 quickly?

I think calling tp_traverse is better.


static int
check_object(PyObject *obj, void *unused)
{
    _PyObject_ASSERT(obj, Py_REFCNT(obj) > 0);
    return 0;
}

static void
gc_check_object(PyGC_Head *gc)
{
    PyObject *op = FROM_GC(gc);
    _PyObject_ASSERT(op, Py_REFCNT(obj) > 0);
    _PyObject_ASSERT(op, _PyObject_GC_IS_TRACKED(op));
    Py_Type(op)->tp_traverse(op, (visitproc)check_object, NULL);
}
msg338612 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-22 15:55
> I don't think calling APIs like _PyDict_CheckConsistency() is super useful.

I looked a multiple old issues which contain "visit_decref". Most of them are really strange crashes and were closed with a message like "we don't have enough info to debug, sorry". So honestly, I'm not sure of what is the most "efficient" way to detect corrupted objects.

I guess that we need a trade-off between completeness of the checks and the performance.

gc.enable_object_debugger(1) simply makes Python completely unusable. Maybe such very bad performance makes the feature basically useless. I'm not sure at this point.

I tried to find an old bug which mentioned "visit_decref", tried to reintroduced the fixed bug, but I'm not really convinced by my experimental tests so far.

That being said, I *like* your idea of reusing tp_traverse. Not only it fits very well into the gc module (I chose to put the new feature in the gc module on purpose), but it's closer to the existing "visit_decref crash". If someone gets a crash if visit_decref() and the object debugger uses tp_traverse, object debugger *will* catch the same bug. The expectation is to be able to get it early.

--

Oh by the way, why not using lower GC thresholds? I proposed this idea, but there are multiple issues with that. It can hide the bug (objects destroyed in a different order). It can also change the behavior of the application, which is linked to my previous point (again, objects destroyed in a different order).

That's how Serhiy Storchaka proposed the design of gc.enable_object_debugger(): traverse without touching the reference counter.
https://mail.python.org/pipermail/python-dev/2018-June/153860.html

Thanks Serhiy for this nice idea ;-)
msg339910 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 00:35
I modified my PR to reuse tp_traverse.

Inada-san: would you mind to review my change?
msg339911 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 00:40
Do you think that a gc.is_object_debugger_enabled() function would be needed?

The tracemalloc module has 3 functions:

* start(), stop()
* is_tracing()

The faulthandler module has 3 functions:

* enable() / disable()
* is_enabled()
msg339916 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 02:18
bpo-33803 bug can be reintroduced using the following patch:

diff --git a/Python/hamt.c b/Python/hamt.c
index 67af04c437..67da8ec22c 100644
--- a/Python/hamt.c
+++ b/Python/hamt.c
@@ -2478,8 +2478,10 @@ hamt_alloc(void)
     if (o == NULL) {
         return NULL;
     }
+#if 0
     o->h_count = 0;
     o->h_root = NULL;
+#endif
     o->h_weakreflist = NULL;
     PyObject_GC_Track(o);
     return o;

And then run:

./python -m test -v test_context

The best would be to also be able to catch the bug in:

./python -m test -v test_asyncio

Problem: Right now, my GC object debugger implementation is way too slow to use a threshold lower than 100, whereas the bug is catched like "immediately" using gc.set_threshold(5).

Maybe my implementation should be less naive: rather than always check *all* objects tracked by the GC, have different thresholds depending on the generation? Maybe reuse GC thresholds?
msg339948 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 09:33
New changeset 2b00db68554422ec37faba2a80179a0172df6349 by Victor Stinner in branch 'master':
bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770)
https://github.com/python/cpython/commit/2b00db68554422ec37faba2a80179a0172df6349
msg339960 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 11:01
New changeset 4c409beb4c360a73d054f37807d3daad58d1b567 by Victor Stinner in branch 'master':
bpo-36389: Change PyMem_SetupDebugHooks() constants (GH-12782)
https://github.com/python/cpython/commit/4c409beb4c360a73d054f37807d3daad58d1b567
msg339966 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 11:23
Currently, I'm using the following patch to try to detect bpo-33803 bug using my GC object debugger. It's still a work-in-progress.

diff --git a/Lib/site.py b/Lib/site.py
index ad1146332b..e184080b19 100644
--- a/Lib/site.py
+++ b/Lib/site.py
@@ -638,3 +638,8 @@ def _script():

 if __name__ == '__main__':
     _script()
+
+if 'dev' in sys._xoptions:
+    import gc
+    gc.enable_object_debugger(100)
+    #gc.set_threshold(5)
diff --git a/Python/hamt.c b/Python/hamt.c
index 67af04c437..67da8ec22c 100644
--- a/Python/hamt.c
+++ b/Python/hamt.c
@@ -2478,8 +2478,10 @@ hamt_alloc(void)
     if (o == NULL) {
         return NULL;
     }
+#if 0
     o->h_count = 0;
     o->h_root = NULL;
+#endif
     o->h_weakreflist = NULL;
     PyObject_GC_Track(o);
     return o;
msg339995 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-11 20:30
New changeset 9e23f0a27cb8bf6e4ea1d2aef36a91502282bbc9 by Victor Stinner in branch '3.7':
[3.7] bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770) (GH-12788)
https://github.com/python/cpython/commit/9e23f0a27cb8bf6e4ea1d2aef36a91502282bbc9
msg340101 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-12 19:51
New changeset 0fc91eef34a1d9194904fa093c9fbd711af0f26c by Victor Stinner in branch 'master':
bpo-36389: Add _PyObject_CheckConsistency() function (GH-12803)
https://github.com/python/cpython/commit/0fc91eef34a1d9194904fa093c9fbd711af0f26c
msg340287 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-15 15:54
New changeset 0810fa79885276114d1a94e2ce61da367ebb1ffc by Victor Stinner in branch 'master':
bpo-36389: Cleanup gc.set_threshold() (GH-12844)
https://github.com/python/cpython/commit/0810fa79885276114d1a94e2ce61da367ebb1ffc
msg347497 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-08 14:24
I opened this issue, because I was convinced that it would be easy to implement checks faster than gc.setthreshold(), but I failed to write efficient tests which detect the bugs that I listed above.

My approach was basically to check all objects tracked by the GC every N memory allocations (PyGC_Malloc): too slow. I tried to put thresholds per generation: it was still too slow.

Maybe recent objects should be checked often, but old objects should be checked less often. For example, only check objects in generation 0 and scan new objects, and then remember the size of generation 0. At the next check, ignore objects already checked.

I failed to find time and interest to implement this approach. I abandon this issue and my PR.

In the meanwhile, gc.set_threshold(5) can be used. It isn't too slow and is quite good to find most bugs listed in this issue.
msg347498 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-08 14:25
Ah by the way, this issue was mostly motivated by a customer issue, but the bug disappeared from customer's production.

Moreover, Python 3.8 now allows to use debug build without having to recompile all C extensions:
https://docs.python.org/dev/whatsnew/3.8.html#debug-build-uses-the-same-abi-as-release-build

A debug build may also help to catch more bugs.
msg354073 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 10:29
Update: I added an assertion which should help to detect some kind of bugs in debug mode:

commit d91d4de31745fc1ed4c7e6c208917827c9c472b6
Author: Victor Stinner <vstinner@redhat.com>
Date:   Mon Sep 9 17:44:59 2019 +0200

    bpo-38070: visit_decref() calls _PyObject_IsFreed() (GH-15782)
    
    In debug mode, visit_decref() now calls _PyObject_IsFreed() to ensure
    that the object is not freed. If it's freed, the program fails with
    an assertion error and Python dumps informations about the freed
    object.
msg354076 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 10:56
I created bpo-38392 "Ensure that objects entering the GC are valid".
msg354117 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 16:42
New changeset 6876257eaabdb30f27ebcbd7d2557278ce2e5705 by Victor Stinner in branch 'master':
bpo-36389: _PyObject_CheckConsistency() available in release mode (GH-16612)
https://github.com/python/cpython/commit/6876257eaabdb30f27ebcbd7d2557278ce2e5705
msg354138 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 20:31
New changeset 60ec6efd96d95476fe5e38c491491add04f026e5 by Victor Stinner in branch 'master':
bpo-36389: Fix _PyBytesWriter in release mode (GH-16624)
https://github.com/python/cpython/commit/60ec6efd96d95476fe5e38c491491add04f026e5
msg354147 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 21:44
New changeset 7775349895088a7ae68cecf0c74cf817f15e2c74 by Victor Stinner in branch 'master':
bpo-36389: Add newline to _PyObject_AssertFailed() (GH-16629)
https://github.com/python/cpython/commit/7775349895088a7ae68cecf0c74cf817f15e2c74
msg354679 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-15 01:06
New changeset f82ce5b1b12873b65927149a016be6a7c65e979d by Victor Stinner in branch '3.8':
[3.8] bpo-36389: Backport debug enhancements from master (GH-16796)
https://github.com/python/cpython/commit/f82ce5b1b12873b65927149a016be6a7c65e979d
History
Date User Action Args
2019-10-16 01:37:51vstinnersetpull_requests: + pull_request16367
2019-10-15 01:06:20vstinnersetmessages: + msg354679
2019-10-14 22:27:11vstinnersetpull_requests: + pull_request16352
2019-10-07 21:44:08vstinnersetmessages: + msg354147
2019-10-07 21:18:36vstinnersetpull_requests: + pull_request16215
2019-10-07 20:31:46vstinnersetmessages: + msg354138
2019-10-07 19:57:16vstinnersetpull_requests: + pull_request16211
2019-10-07 16:42:03vstinnersetmessages: + msg354117
2019-10-07 10:56:20vstinnersetmessages: + msg354076
2019-10-07 10:29:26vstinnersetmessages: + msg354073
2019-10-07 10:14:01vstinnersetpull_requests: + pull_request16200
2019-07-08 14:25:44vstinnersetmessages: + msg347498
2019-07-08 14:24:04vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg347497

stage: patch review -> resolved
2019-04-15 15:54:13vstinnersetmessages: + msg340287
2019-04-15 15:16:16vstinnersetpull_requests: + pull_request12769
2019-04-12 19:51:37vstinnersetmessages: + msg340101
2019-04-12 13:51:42vstinnersetpull_requests: + pull_request12730
2019-04-11 20:30:35vstinnersetmessages: + msg339995
2019-04-11 15:47:51vstinnersetpull_requests: + pull_request12715
2019-04-11 11:23:19vstinnersetmessages: + msg339966
2019-04-11 11:01:19vstinnersetmessages: + msg339960
2019-04-11 10:38:23vstinnersetpull_requests: + pull_request12709
2019-04-11 09:45:56vstinnersetpull_requests: + pull_request12707
2019-04-11 09:33:30vstinnersetmessages: + msg339948
2019-04-11 02:18:16vstinnersetmessages: + msg339916
2019-04-11 02:07:46vstinnersetpull_requests: + pull_request12699
2019-04-11 00:40:40vstinnersetmessages: + msg339911
2019-04-11 00:35:15vstinnersetmessages: + msg339910
2019-03-22 15:55:29vstinnersetmessages: + msg338612
2019-03-22 13:32:40inada.naokisetmessages: + msg338603
2019-03-21 14:14:02vstinnersetmessages: + msg338542
2019-03-21 13:05:14serhiy.storchakasetmessages: + msg338541
2019-03-21 12:55:24vstinnersetmessages: + msg338539
2019-03-21 12:53:46vstinnersetmessages: + msg338538
2019-03-21 11:14:16vstinnersetnosy: + inada.naoki
2019-03-21 11:14:00vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request12432
2019-03-21 11:02:41vstinnercreate