Segfault with deep recursion into object().dir #78111

a-j-buxton · 2018-06-21T11:27:57Z

BPO	33930
Nosy	@nascheme, @rhettinger, @terryjreedy, @ronaldoussoren, @vstinner, @taleinat, @benjaminp, @methane, @ambv, @serhiy-storchaka, @pablogsal, @miss-islington, @augustogoulart, @iritkatriel
PRs	bpo-33930: Fix segfault with deep recursion when cleaning method objects #27678 [3.10] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) #27719 [3.9] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) #27720 [3.8] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) #27721 bpo-33930: Fix typo in the test name. #27733 [3.9] bpo-33930: Fix typo in the test name. (GH-27733) #27734 [3.8] bpo-33930: Fix typo in the test name. (GH-27733) #27735 [3.10] bpo-33930: Fix typo in the test name. #27736
Files	bpo-33930.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-08-13.09:44:37.226>
created_at = <Date 2018-06-21.11:27:57.218>
labels = ['interpreter-core', '3.10', '3.9', 'type-crash', '3.11']
title = 'Segfault with deep recursion into object().__dir__'
updated_at = <Date 2021-08-13.09:45:17.009>
user = 'https://bugs.python.org/a-j-buxton'

bugs.python.org fields:

activity = <Date 2021-08-13.09:45:17.009>
actor = 'miss-islington'
assignee = 'none'
closed = True
closed_date = <Date 2021-08-13.09:44:37.226>
closer = 'lukasz.langa'
components = ['Interpreter Core']
creation = <Date 2018-06-21.11:27:57.218>
creator = 'a-j-buxton'
dependencies = []
files = ['47952']
hgrepos = []
issue_num = 33930
keywords = ['patch']
message_count = 38.0
messages = ['320167', '320263', '320413', '320418', '320458', '320478', '320504', '320509', '329839', '329846', '330516', '330520', '330563', '330579', '330635', '330776', '330780', '330785', '399125', '399265', '399357', '399366', '399373', '399374', '399375', '399402', '399403', '399404', '399408', '399409', '399411', '399421', '399425', '399429', '399430', '399431', '399511', '399512']
nosy_count = 15.0
nosy_names = ['nascheme', 'rhettinger', 'terry.reedy', 'ronaldoussoren', 'vstinner', 'taleinat', 'benjamin.peterson', 'methane', 'lukasz.langa', 'serhiy.storchaka', 'pablogsal', 'miss-islington', 'augustogoulart', 'a-j-buxton', 'iritkatriel']
pr_nums = ['27678', '27719', '27720', '27721', '27733', '27734', '27735', '27736']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue33930'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

a-j-buxton · 2018-06-21T11:27:57Z

The following small snippet of code will crash 3.6.5 with a segfault.

The crash occurs at cleanup, so it won't happen in the interactive interpreter (until you exit).

# --- code ---

o = object()
for x in range(1000000):
    o = o.__dir__
    print(x, id(o.__dir__))

# --- end code ---

terryjreedy · 2018-06-22T20:28:46Z

Serhiy, is this one of the cases we decided to leave as won't fix?

ronaldoussoren · 2018-06-25T10:58:57Z

The crash is likely caused by recursion during the clean-up of the object().__dir__.__dir__.... chain.

The trashcan API (see Py_TRASHCAN_SAFE_BEGIN/Py_TRASHCAN_SAFE_END in Include/object.h) can help here.

Those do some real work and have a cost, I can't predict if that cost is substational enough to affect benchmarks (method objects are destroyed a lot).

serhiy-storchaka · 2018-06-25T12:15:01Z

The trashcan mechanism is used for slot wrappers. The following code doesn't crash:

o = object()
for x in range(1000000):
    o = o.__hash__

rhettinger · 2018-06-26T00:13:32Z

Alistair, how did you happen upon this case?

vstinner · 2018-06-26T08:14:23Z

C traceback in gdb:

Program received signal SIGSEGV, Segmentation fault.
0x00000000004812a8 in meth_dealloc (
m=<error reading variable: Cannot access memory at address 0x7fffff7fefd8>)
at Objects/methodobject.c:86
86 {
(gdb) where
#0 0x00000000004812a8 in meth_dealloc (
m=<error reading variable: Cannot access memory at address 0x7fffff7fefd8>)
at Objects/methodobject.c:86
#1 0x0000000000488d9f in _Py_Dealloc (
op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe3973490>) at Objects/object.c:1931
#2 0x00000000004813c4 in meth_dealloc (m=0x7fffe3973580)
at Objects/methodobject.c:91
#3 0x0000000000488d9f in _Py_Dealloc (
op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe3973508>) at Objects/object.c:1931
#4 0x00000000004813c4 in meth_dealloc (m=0x7fffe39735f8)
at Objects/methodobject.c:91
#5 0x0000000000488d9f in _Py_Dealloc (
op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe3973580>) at Objects/object.c:1931
#6 0x00000000004813c4 in meth_dealloc (m=0x7fffe3973670)
at Objects/methodobject.c:91
#7 0x0000000000488d9f in _Py_Dealloc (
op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe39735f8>) at Objects/object.c:1931
#8 0x00000000004813c4 in meth_dealloc (m=0x7fffe39736e8)
at Objects/methodobject.c:91
(...)

a-j-buxton · 2018-06-26T16:55:01Z

I found this while chatting on IRC about the implementation details of objects - specifically, since objects have a dict of attributes, and dict subclasses object, it seems like every dict will have another dict inside it. (Eventually someone explained that base object() doesn't have any state at all, as a result dict doesn't have a dict of attributes, and there is no cycle.)

Anyway, as I was trying to explain my question to the other users, I had intended to post this snippet and ask "why doesn't this crash python?" Ironically it turned out that it does crash python.

ronaldoussoren · 2018-06-26T19:10:54Z

@victor: thanks for confirming my analysis.

The trashcan API does fix this issue, although I did have to change "_PyObject_GC_UNTRACK(m);" to "PyObject_GC_UnTrack(m);" as well. I haven't checked yet why that's necessary.

augustogoulart · 2018-11-13T13:13:35Z

ronaldoussoren, I have tested (Debian 9) substituting "_PyObject_GC_UNTRACK(m);" to "PyObject_GC_UnTrack(m);", and although the tests pass, I'm can still see the segfault error.

diff --git a/Objects/methodobject.c b/Objects/methodobject.c
index cfea8cf410..cb8e279c7d 100644
--- a/Objects/methodobject.c
+++ b/Objects/methodobject.c
@@ -84,7 +84,7 @@ PyCFunction_GetFlags(PyObject *op)
 static void
 meth_dealloc(PyCFunctionObject *m)
 {
-    _PyObject_GC_UNTRACK(m);
+    PyObject_GC_UnTrack(m);
     if (m->m_weakreflist != NULL) {
         PyObject_ClearWeakRefs((PyObject*) m);
     }

augustogoulart · 2018-11-13T14:30:38Z

Sorry, I forgot to include the trashcan API. I'll double check and update the issue.

augustogoulart · 2018-11-27T12:19:41Z

After a couple of rounds, this is the benchmark result I got for the attached patch. It seems to be a simple, non-performance degrading fix.

python -m perf compare_to origin.json patch.json -G --min-speed=2
Slower (1):

sqlalchemy_declarative: 213 ms +- 4 ms -> 219 ms +- 9 ms: 1.03x slower (+3%)

Faster (9):

logging_format: 15.1 us +- 0.3 us -> 14.5 us +- 0.3 us: 1.05x faster (-5%)
hexiom: 15.0 ms +- 0.5 ms -> 14.5 ms +- 0.2 ms: 1.03x faster (-3%)
json_loads: 39.6 us +- 0.7 us -> 38.5 us +- 0.5 us: 1.03x faster (-3%)
json_dumps: 17.2 ms +- 0.4 ms -> 16.7 ms +- 0.2 ms: 1.03x faster (-3%)
fannkuch: 634 ms +- 4 ms -> 619 ms +- 6 ms: 1.02x faster (-2%)
django_template: 183 ms +- 3 ms -> 178 ms +- 5 ms: 1.02x faster (-2%)
float: 153 ms +- 2 ms -> 150 ms +- 1 ms: 1.02x faster (-2%)
raytrace: 761 ms +- 19 ms -> 745 ms +- 8 ms: 1.02x faster (-2%)
regex_effbot: 3.69 ms +- 0.06 ms -> 3.61 ms +- 0.05 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (47): 2to3, chaos, crypto_pyaes, deltablue, dulwich_log, go, html5lib, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, regex_compile, regex_dna, regex_v8, richards, scimark_fft, scimark_lu, scimark_monte_carlo, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

vstinner · 2018-11-27T13:59:26Z

_PyObject_GC_UNTRACK(m);
+ PyObject_GC_UnTrack(m);

Can someone please explain me why this change has any impact? Does it mean that _PyObject_GC_UNTRACK() macro has a bug?

The function is quite straighforward :-)

void
PyObject_GC_Track(void *op)
{
    _PyObject_GC_TRACK(op);
}

ronaldoussoren · 2018-11-28T00:07:59Z

I don’t understand this either. For some reason the patch works when using the function and not when using the macro.

I haven’t looked into this yet, and am currently traveling. I should have time to look into this starting December 5th when I’m back home.

serhiy-storchaka · 2018-11-28T08:27:46Z

It is hard to get a noticeable difference in macrobenchmarks. I want to see results of microbenchmarks.

For tuples, the trashcan mechanism is the main culprit of the performance hit when use short-living tuples for passing arguments. For bound method objects the relative effect should be smaller, because looking up the method is more costly than building a tuple, and less bound method objects are created after implementing the method call optimization. But in any case we should know the maximal cost of this change before merging it.

pablogsal · 2018-11-28T22:53:22Z

Can someone please explain me why this change has any impact? Does it mean that _PyObject_GC_UNTRACK() macro has a bug?

Interestingly, this lonely and perplexing comment is left in PyObject_GC_UnTrack (the function):

void
PyObject_GC_UnTrack(void *op)
{
     /* Obscure:  the Py_TRASHCAN mechanism requires that we be able to
      * call PyObject_GC_UnTrack twice on an object.
      */
  if (_PyObject_GC_IS_TRACKED(op)) {
      _PyObject_GC_UNTRACK(op);
  }
}

methane · 2018-11-30T11:57:26Z

microbench:

$ ./python.patched -m perf timeit --compare-to ./python.master -s 'o="foo"' 'o.encode'
python.master: ..................... 29.3 ns +- 0.6 ns
python.patched: ..................... 30.9 ns +- 1.3 ns

Mean +- std dev: [python.master] 29.3 ns +- 0.6 ns -> [python.patched] 30.9 ns +- 1.3 ns: 1.05x slower (+5%)

And this is perf report of python.patched:

33.17% _PyEval_EvalFrameDefault
15.25% _PyObject_GenericGetAttrWithDict
9.63% PyCFunction_NewEx
8.25% _PyType_Lookup
6.42% meth_dealloc
3.50% _Py_bytes_contains
3.22% PyObject_GC_UnTrack
3.20% method_get
2.73% PyObject_GetAttr
2.55% repeat_next
0.88% _Py_Dealloc
0.81% PyObject_GenericGetAttr
0.78% sre_ucs1_match
0.50% lookdict_unicode_nodummy

serhiy-storchaka · 2018-11-30T12:17:22Z

Thank you Inada. There is a large overhead for iterating. Try to add --duplicate=1000.

methane · 2018-11-30T12:54:04Z

$ ./python.patched -m perf timeit --compare-to ./python.master -s 'o="foo"' 'o.encode' --duplicate=10000
python.master: ..................... 23.1 ns +- 0.5 ns
python.patched: ..................... 24.6 ns +- 0.5 ns

Mean +- std dev: [python.master] 23.1 ns +- 0.5 ns -> [python.patched] 24.6 ns +- 0.5 ns: 1.06x slower (+6%)

$ perf record ./python.patched -m perf timeit -s 'o="foo"' 'o.encode' --duplicate=10000
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
.....................
Mean +- std dev: 24.9 ns +- 0.7 ns
[ perf record: Woken up 9 times to write data ]
[kernel.kallsyms] with build id debb7f6ce8aad7263cc6564650a88db5c8003389 not found, continuing without symbols
[ perf record: Captured and wrote 2.491 MB perf.data (61787 samples) ]

$ perf report | cat
(snip)
# Total Lost Samples: 0
#
# Samples: 61K of event 'cycles:ppp'
# Event count (approx.): 52880893198
#
# Overhead  Command         Shared Object                                     Symbol
# ........  ..............  ................................................  ..........................................
#
    18.78%  python.patched  python.patched                                    [.] _PyEval_EvalFrameDefault
    14.85%  python.patched  python.patched                                    [.] _PyObject_GenericGetAttrWithDict
     9.93%  python.patched  python.patched                                    [.] PyCFunction_NewEx
     8.56%  python.patched  python.patched                                    [.] _PyType_Lookup
     6.61%  python.patched  python.patched                                    [.] meth_dealloc
     4.12%  python.patched  python.patched                                    [.] PyParser_AddToken
     3.36%  python.patched  python.patched                                    [.] PyObject_GetAttr
     3.33%  python.patched  python.patched                                    [.] PyObject_GC_UnTrack
     3.25%  python.patched  python.patched                                    [.] method_get
     2.92%  python.patched  python.patched                                    [.] _Py_bytes_contains
     2.13%  python.patched  python.patched                                    [.] freechildren
     1.41%  python.patched  python.patched                                    [.] pymalloc_alloc.isra.6.part.7

iritkatriel · 2021-08-06T18:43:08Z

I've reproduced the segfault on 3.11 on a Mac.

vstinner · 2021-08-09T13:22:02Z

Correctness matters more than performance for 1 nanosecond. IMO a fix like attached bpo-33930.patch looks like the good approach to fix the crash.

nascheme · 2021-08-10T21:13:17Z

I'm thinking that the explicit call to PyObject_GC_UnTrack should be made unnecessary by integrating it into the trashcan code. That way, we avoid someone else running into this kind of bug in the future. See bpo-44881.

pablogsal · 2021-08-10T23:34:40Z

New changeset bfc2d5a by Pablo Galindo Salgado in branch 'main':
bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678)
bfc2d5a

vstinner · 2021-08-11T08:32:29Z

New changeset 15f0a45 by Miss Islington (bot) in branch '3.9':
bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) (GH-27720)
15f0a45

vstinner · 2021-08-11T08:32:48Z

New changeset d6d2d54 by Miss Islington (bot) in branch '3.10':
bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) (GH-27719)
d6d2d54

vstinner · 2021-08-11T08:33:31Z

Thanks for the fix Augusto Goulart and Pablo! I close the issue.

Pablo asked in PR 27678 if a Python 3.8 Release Manager (Lukasz) wants a backport to 3.8.

ambv · 2021-08-11T15:35:46Z

New changeset 09d814b by Łukasz Langa in branch '3.8':
[3.8] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) (GH-27721)
09d814b

ambv · 2021-08-11T15:52:04Z

Apparently we've got ourselves refleaks with the patches.

It doesn't always reproduce but put test_exceptions -R: in a loop in 3.10 and soon enough you'll get:

0:00:00 load avg: 4.47 Run tests sequentially
0:00:00 load avg: 4.47 [1/1] test_exceptions
beginning 9 repetitions
123456789
.........
test_exceptions leaked [2, 2, 2, 2] references, sum=8
test_exceptions leaked [2, 2, 2, 2] memory blocks, sum=8
test_exceptions failed (reference leak) in 59.6 sec

== Tests result: FAILURE ==

1 test failed:
test_exceptions

It's much harder to reproduce on 3.9 and 3.8 because test_exceptions has fewer tests and at least on my MBP finishes twice as fast. I ran 3.9 in a loop for an hour and no refleak.

ambv · 2021-08-11T15:57:34Z

Interestingly, I can't reproduce the refleaks on main either, only on 3.10.

ambv · 2021-08-11T16:37:51Z

I gotta run for the day. The shortest way for me to reproduce is to focus on the tests that were added in 3.10 and shortening the repetitions a little:

$ time ./python.exe -E -Wd -m test test_exceptions -R3:3 -m 'test_attribute*' -m 'test_getattr*' -m 'test_lineno*' -m 'test_name_error*' -m 'test_recursion*' -m 'test_no_hang*'

0:00:00 load avg: 5.92 Run tests sequentially
0:00:00 load avg: 5.92 [1/1] test_exceptions
beginning 6 repetitions
123456
......
test_exceptions leaked [2, 2, 2] references, sum=6
test_exceptions leaked [2, 2, 2] memory blocks, sum=6
test_exceptions failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
test_exceptions

1 re-run test:
test_exceptions

Total duration: 11.8 sec
Tests result: FAILURE
12.13 real 11.44 user 0.61 sys

ambv · 2021-08-11T16:44:32Z

(BTW, with this command, it is indeed possible to trigger the ref leak on main as well, which makes it somewhat less scary.)

iritkatriel · 2021-08-11T18:48:28Z

It seems to have been introduced by
#27626

I don't see how the code change could have introduced a leak. Maybe the new tests exposed a pre-existing leak?

ambv · 2021-08-11T22:35:27Z

Entirely possible, Irit. We've had just this sort of case just last week! See: https://bugs.python.org/issue44524#msg399157

iritkatriel · 2021-08-12T00:14:32Z

I've narrowed it down a bit more and created bpo-44895.

benjaminp · 2021-08-12T01:56:51Z

New changeset f08e6d1 by Benjamin Peterson in branch 'main':
bpo-33930: Fix typo in the test name. (bpo-27733)
f08e6d1

benjaminp · 2021-08-12T02:17:31Z

New changeset d7f5796 by Miss Islington (bot) in branch '3.8':
bpo-33930: Fix typo in the test name. (GH-27735)
d7f5796

miss-islington · 2021-08-12T02:20:29Z

New changeset f7635f0 by Miss Islington (bot) in branch '3.9':
[3.9] bpo-33930: Fix typo in the test name. (GH-27733) (GH-27734)
f7635f0

ambv · 2021-08-13T09:44:37Z

Since the refleak investigation moved to its own issue, I'm re-closing this one for clarity.

miss-islington · 2021-08-13T09:45:17Z

New changeset ef36dfe by Benjamin Peterson in branch '3.10':
[3.10] bpo-33930: Fix typo in the test name. (GH-27736)
ef36dfe

a-j-buxton mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Jun 21, 2018

iritkatriel added the 3.11 only security fixes label Aug 6, 2021

vstinner closed this as completed Aug 11, 2021

ambv added 3.9 only security fixes 3.10 only security fixes labels Aug 11, 2021

ambv reopened this Aug 11, 2021

ambv closed this as completed Aug 13, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Segfault with deep recursion into object().__dir__ #78111

Segfault with deep recursion into object().__dir__ #78111

Comments

a-j-buxton mannequin commented Jun 21, 2018

a-j-buxton mannequin commented Jun 21, 2018

terryjreedy commented Jun 22, 2018

ronaldoussoren commented Jun 25, 2018

serhiy-storchaka commented Jun 25, 2018

rhettinger commented Jun 26, 2018

vstinner commented Jun 26, 2018

a-j-buxton mannequin commented Jun 26, 2018

ronaldoussoren commented Jun 26, 2018

augustogoulart mannequin commented Nov 13, 2018

augustogoulart mannequin commented Nov 13, 2018

augustogoulart mannequin commented Nov 27, 2018

vstinner commented Nov 27, 2018

ronaldoussoren commented Nov 28, 2018

serhiy-storchaka commented Nov 28, 2018

pablogsal commented Nov 28, 2018

methane commented Nov 30, 2018

serhiy-storchaka commented Nov 30, 2018

methane commented Nov 30, 2018

iritkatriel commented Aug 6, 2021

vstinner commented Aug 9, 2021

nascheme commented Aug 10, 2021

pablogsal commented Aug 10, 2021

vstinner commented Aug 11, 2021

vstinner commented Aug 11, 2021

vstinner commented Aug 11, 2021

ambv commented Aug 11, 2021

ambv commented Aug 11, 2021

ambv commented Aug 11, 2021

ambv commented Aug 11, 2021

ambv commented Aug 11, 2021

iritkatriel commented Aug 11, 2021

ambv commented Aug 11, 2021

iritkatriel commented Aug 12, 2021

benjaminp commented Aug 12, 2021

benjaminp commented Aug 12, 2021

miss-islington commented Aug 12, 2021

ambv commented Aug 13, 2021

miss-islington commented Aug 13, 2021

Segfault with deep recursion into object().dir #78111

Segfault with deep recursion into object().dir #78111