classification
Title: Ensure that objects entering the GC are valid
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: nascheme, pablogsal, scoder, serhiy.storchaka, tim.peters, twouters, vstinner
Priority: normal Keywords: patch

Created on 2019-10-07 10:48 by vstinner, last changed 2019-10-10 07:32 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 16615 merged vstinner, 2019-10-07 10:54
PR 16689 merged vstinner, 2019-10-10 06:50
Messages (11)
msg354074 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 10:48
A bug in Python/hamt.c was only discovered 4 months after the code was added to Python:
* https://mail.python.org/pipermail/python-dev/2018-June/153857.html
* https://bugs.python.org/issue33803

The problem was that an object was tracked by the GC, whereas calling its traverse method crashed Python... depending when the traverse method is called exactly.

Pseudo-code:

  PyObject_GC_Track(<partially initialized object>);
  ...
  obj2 = PyObject_GC_NewVar(...);
  ...
  <finish to initialize the first object>

The problem is that PyObject_GC_NewVar() can trigger a GC collection which uses the partially initialized object, and then we get a cryptic crash in the GC (usually seen in visit_decref()).

I propose to make PyObject_GC_Track() stricter: validate that the object seems to be "valid" or "fully initialized". From the GC point of view, it means that calling tp_traverse must not crash.

Attached PR is a concrete implementation.

This issue is a follow-up of my previously failed attempt bpo-36389 "Add gc.enable_object_debugger(): detect corrupted Python objects in the GC". While bpo-36389 attempts to discoverd corrupted objects once a GC collection is triggered, this issue is restricted to objects entering the GC.

First problem: my PR breaks Modules/pyexpat.c whereas the code is not strictly a bug: in Python 3.8, it's ok to put a partially initialized object into the GC if no GC collection can be triggered before the object is fully initialized. In Modules/pyexpat.c case, the code looks safe from that point of view.

It means that such change is a backward incompatible change. IMHO it's a good step forward, since it is likely to discovered hidden bugs.
msg354075 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 10:55
Pablo, Tim, Neal: what do you think of this idea?
msg354080 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2019-10-07 11:21
I'm pretty sure you meant nascheme, not nnorwitz.
msg354081 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-10-07 11:35
It could be possible to do this in backward compatible way. PyObject_GC_Track() could add the object to the list of new objects (all objects are already linked in bi-linked list, so it would need to just move the object to the specified list), and PyObject_GC_NewVar() could check all new objects and clear the list of new objects (just move the head).

I am not sure it is worth to do such complication.
msg354082 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-10-07 11:42
From documentation:

.. c:function:: void PyObject_GC_Track(PyObject *op)

   Adds the object *op* to the set of container objects tracked by the
   collector.  The collector can run at unexpected times so objects must be
   valid while being tracked.  This should be called once all the fields
   followed by the :c:member:`~PyTypeObject.tp_traverse` handler become valid, usually near the
   end of the constructor.

So there was a bug in Modules/pyexpat.c.
msg354083 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 11:57
> I'm pretty sure you meant nascheme, not nnorwitz.

Oops, right :-)
msg354087 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 12:06
> It could be possible to do this in backward compatible way. PyObject_GC_Track() could add the object to the list of new objects (all objects are already linked in bi-linked list, so it would need to just move the object to the specified list), and PyObject_GC_NewVar() could check all new objects and clear the list of new objects (just move the head).

That's different than what I'm proposing here. Checking "later" is more on the bpo-36389 side: my bpo-36389 PR modified PyObject_GC_NewVar() to check that young object are valid.

Moreover, my bpo-36389 PR required threshold: checking all young objects at each PyObject_GC_NewVar() simply killed performance which made the whole approach unusable in practice. Because of the threshold, my implementation failed to detect the hamt.c bug bpo-33803: the partially initialized was already fully initialized when my "object debugger" ran. PR 16615 detects immediately the hamt.c bug bpo-33803.

I designed bpo-36389 to attempt to detect when an object is corrupted after its creation, corrupted by code running "later".

We had a customer with a bug in OpenStack Nova. Nova crashed in visit_decref() but I failed to debug the issue. It motivated me to modify the debug ABI in Python 3.8, so later it will become possible to switch from release to debug mode to attempt to debug a program using additional checks added by the debug build.

Here my intent is to ensure that objects entering the GC are valid to help to keep the GC list consistent and valid. But it doesn't prevent an object from being corrupted after its creation. We should try to find a different approach for that (maybe revisit bpo-36389).
msg354148 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 22:09
New changeset 1b1845569539db5c1a6948a5d32daea381f1e35f by Victor Stinner in branch 'master':
bpo-38392: PyObject_GC_Track() validates object in debug mode (GH-16615)
https://github.com/python/cpython/commit/1b1845569539db5c1a6948a5d32daea381f1e35f
msg354149 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-07 22:26
Ok, I pushed a change. Let's see how third party projects like it :-) We still have time to revert it if it causes too many damage.
msg354173 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-10-08 07:10
I am sure this will break third-party projects. But I thing that we should not revert this change, as it completely matches the documentation. It is good that it was applied so early at development cycle.

Stefan, could you please test Cython and lxml with this?
msg354318 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-10 07:32
New changeset a5447735c334a041ee2ffdeb5c7e13d7d4502ea2 by Victor Stinner in branch 'master':
bpo-38392: Only declare visit_validate() if Py_DEBUG is defined (GH-16689)
https://github.com/python/cpython/commit/a5447735c334a041ee2ffdeb5c7e13d7d4502ea2
History
Date User Action Args
2019-10-10 07:32:19vstinnersetmessages: + msg354318
2019-10-10 06:50:59vstinnersetpull_requests: + pull_request16275
2019-10-08 07:10:02serhiy.storchakasetnosy: + scoder
messages: + msg354173
2019-10-07 22:26:18vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg354149

stage: resolved
2019-10-07 22:09:34vstinnersetmessages: + msg354148
2019-10-07 12:06:07vstinnersetmessages: + msg354087
2019-10-07 11:57:20vstinnersetmessages: + msg354083
2019-10-07 11:42:08serhiy.storchakasetmessages: + msg354082
2019-10-07 11:35:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg354081
2019-10-07 11:21:36twouterssetnosy: + twouters, nascheme, - nnorwitz
messages: + msg354080
2019-10-07 10:55:45vstinnersetnosy: + tim.peters, nnorwitz, pablogsal

messages: + msg354075
stage: patch review -> (no value)
2019-10-07 10:54:28vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request16204
2019-10-07 10:48:58vstinnercreate