Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with deep recursion into object().__dir__ #78111

Closed
a-j-buxton mannequin opened this issue Jun 21, 2018 · 38 comments
Closed

Segfault with deep recursion into object().__dir__ #78111

a-j-buxton mannequin opened this issue Jun 21, 2018 · 38 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@a-j-buxton
Copy link
Mannequin

a-j-buxton mannequin commented Jun 21, 2018

BPO 33930
Nosy @nascheme, @rhettinger, @terryjreedy, @ronaldoussoren, @vstinner, @taleinat, @benjaminp, @methane, @ambv, @serhiy-storchaka, @pablogsal, @miss-islington, @augustogoulart, @iritkatriel
PRs
  • bpo-33930: Fix segfault with deep recursion when cleaning method objects #27678
  • [3.10] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) #27719
  • [3.9] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) #27720
  • [3.8] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) #27721
  • bpo-33930: Fix typo in the test name. #27733
  • [3.9] bpo-33930: Fix typo in the test name. (GH-27733) #27734
  • [3.8] bpo-33930: Fix typo in the test name. (GH-27733) #27735
  • [3.10] bpo-33930: Fix typo in the test name. #27736
  • Files
  • bpo-33930.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-08-13.09:44:37.226>
    created_at = <Date 2018-06-21.11:27:57.218>
    labels = ['interpreter-core', '3.10', '3.9', 'type-crash', '3.11']
    title = 'Segfault with deep recursion into object().__dir__'
    updated_at = <Date 2021-08-13.09:45:17.009>
    user = 'https://bugs.python.org/a-j-buxton'

    bugs.python.org fields:

    activity = <Date 2021-08-13.09:45:17.009>
    actor = 'miss-islington'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-08-13.09:44:37.226>
    closer = 'lukasz.langa'
    components = ['Interpreter Core']
    creation = <Date 2018-06-21.11:27:57.218>
    creator = 'a-j-buxton'
    dependencies = []
    files = ['47952']
    hgrepos = []
    issue_num = 33930
    keywords = ['patch']
    message_count = 38.0
    messages = ['320167', '320263', '320413', '320418', '320458', '320478', '320504', '320509', '329839', '329846', '330516', '330520', '330563', '330579', '330635', '330776', '330780', '330785', '399125', '399265', '399357', '399366', '399373', '399374', '399375', '399402', '399403', '399404', '399408', '399409', '399411', '399421', '399425', '399429', '399430', '399431', '399511', '399512']
    nosy_count = 15.0
    nosy_names = ['nascheme', 'rhettinger', 'terry.reedy', 'ronaldoussoren', 'vstinner', 'taleinat', 'benjamin.peterson', 'methane', 'lukasz.langa', 'serhiy.storchaka', 'pablogsal', 'miss-islington', 'augustogoulart', 'a-j-buxton', 'iritkatriel']
    pr_nums = ['27678', '27719', '27720', '27721', '27733', '27734', '27735', '27736']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue33930'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @a-j-buxton
    Copy link
    Mannequin Author

    a-j-buxton mannequin commented Jun 21, 2018

    The following small snippet of code will crash 3.6.5 with a segfault.

    The crash occurs at cleanup, so it won't happen in the interactive interpreter (until you exit).

    # --- code ---

    o = object()
    for x in range(1000000):
        o = o.__dir__
        print(x, id(o.__dir__))

    # --- end code ---

    @a-j-buxton a-j-buxton mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Jun 21, 2018
    @terryjreedy
    Copy link
    Member

    Serhiy, is this one of the cases we decided to leave as won't fix?

    @ronaldoussoren
    Copy link
    Contributor

    The crash is likely caused by recursion during the clean-up of the object().__dir__.__dir__.... chain.

    The trashcan API (see Py_TRASHCAN_SAFE_BEGIN/Py_TRASHCAN_SAFE_END in Include/object.h) can help here.

    Those do some real work and have a cost, I can't predict if that cost is substational enough to affect benchmarks (method objects are destroyed a lot).

    @serhiy-storchaka
    Copy link
    Member

    The trashcan mechanism is used for slot wrappers. The following code doesn't crash:

    o = object()
    for x in range(1000000):
        o = o.__hash__

    @rhettinger
    Copy link
    Contributor

    Alistair, how did you happen upon this case?

    @vstinner
    Copy link
    Member

    C traceback in gdb:

    Program received signal SIGSEGV, Segmentation fault.
    0x00000000004812a8 in meth_dealloc (
    m=<error reading variable: Cannot access memory at address 0x7fffff7fefd8>)
    at Objects/methodobject.c:86
    86 {
    (gdb) where
    #0 0x00000000004812a8 in meth_dealloc (
    m=<error reading variable: Cannot access memory at address 0x7fffff7fefd8>)
    at Objects/methodobject.c:86
    #1 0x0000000000488d9f in _Py_Dealloc (
    op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe3973490>) at Objects/object.c:1931
    #2 0x00000000004813c4 in meth_dealloc (m=0x7fffe3973580)
    at Objects/methodobject.c:91
    #3 0x0000000000488d9f in _Py_Dealloc (
    op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe3973508>) at Objects/object.c:1931
    #4 0x00000000004813c4 in meth_dealloc (m=0x7fffe39735f8)
    at Objects/methodobject.c:91
    #5 0x0000000000488d9f in _Py_Dealloc (
    op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe3973580>) at Objects/object.c:1931
    #6 0x00000000004813c4 in meth_dealloc (m=0x7fffe3973670)
    at Objects/methodobject.c:91
    #7 0x0000000000488d9f in _Py_Dealloc (
    op=<built-in method __dir__ of builtin_function_or_method object at remote 0x7fffe39735f8>) at Objects/object.c:1931
    #8 0x00000000004813c4 in meth_dealloc (m=0x7fffe39736e8)
    at Objects/methodobject.c:91
    (...)

    @a-j-buxton
    Copy link
    Mannequin Author

    a-j-buxton mannequin commented Jun 26, 2018

    I found this while chatting on IRC about the implementation details of objects - specifically, since objects have a dict of attributes, and dict subclasses object, it seems like every dict will have another dict inside it. (Eventually someone explained that base object() doesn't have any state at all, as a result dict doesn't have a dict of attributes, and there is no cycle.)

    Anyway, as I was trying to explain my question to the other users, I had intended to post this snippet and ask "why doesn't this crash python?" Ironically it turned out that it does crash python.

    @ronaldoussoren
    Copy link
    Contributor

    @victor: thanks for confirming my analysis.

    The trashcan API does fix this issue, although I did have to change "_PyObject_GC_UNTRACK(m);" to "PyObject_GC_UnTrack(m);" as well. I haven't checked yet why that's necessary.

    @augustogoulart
    Copy link
    Mannequin

    augustogoulart mannequin commented Nov 13, 2018

    ronaldoussoren, I have tested (Debian 9) substituting "_PyObject_GC_UNTRACK(m);" to "PyObject_GC_UnTrack(m);", and although the tests pass, I'm can still see the segfault error.

    diff --git a/Objects/methodobject.c b/Objects/methodobject.c
    index cfea8cf410..cb8e279c7d 100644
    --- a/Objects/methodobject.c
    +++ b/Objects/methodobject.c
    @@ -84,7 +84,7 @@ PyCFunction_GetFlags(PyObject *op)
     static void
     meth_dealloc(PyCFunctionObject *m)
     {
    -    _PyObject_GC_UNTRACK(m);
    +    PyObject_GC_UnTrack(m);
         if (m->m_weakreflist != NULL) {
             PyObject_ClearWeakRefs((PyObject*) m);
         }

    @augustogoulart
    Copy link
    Mannequin

    augustogoulart mannequin commented Nov 13, 2018

    Sorry, I forgot to include the trashcan API. I'll double check and update the issue.

    @augustogoulart
    Copy link
    Mannequin

    augustogoulart mannequin commented Nov 27, 2018

    After a couple of rounds, this is the benchmark result I got for the attached patch. It seems to be a simple, non-performance degrading fix.

    python -m perf compare_to origin.json patch.json -G --min-speed=2
    Slower (1):

    • sqlalchemy_declarative: 213 ms +- 4 ms -> 219 ms +- 9 ms: 1.03x slower (+3%)

    Faster (9):

    • logging_format: 15.1 us +- 0.3 us -> 14.5 us +- 0.3 us: 1.05x faster (-5%)
    • hexiom: 15.0 ms +- 0.5 ms -> 14.5 ms +- 0.2 ms: 1.03x faster (-3%)
    • json_loads: 39.6 us +- 0.7 us -> 38.5 us +- 0.5 us: 1.03x faster (-3%)
    • json_dumps: 17.2 ms +- 0.4 ms -> 16.7 ms +- 0.2 ms: 1.03x faster (-3%)
    • fannkuch: 634 ms +- 4 ms -> 619 ms +- 6 ms: 1.02x faster (-2%)
    • django_template: 183 ms +- 3 ms -> 178 ms +- 5 ms: 1.02x faster (-2%)
    • float: 153 ms +- 2 ms -> 150 ms +- 1 ms: 1.02x faster (-2%)
    • raytrace: 761 ms +- 19 ms -> 745 ms +- 8 ms: 1.02x faster (-2%)
    • regex_effbot: 3.69 ms +- 0.06 ms -> 3.61 ms +- 0.05 ms: 1.02x faster (-2%)

    Benchmark hidden because not significant (47): 2to3, chaos, crypto_pyaes, deltablue, dulwich_log, go, html5lib, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, regex_compile, regex_dna, regex_v8, richards, scimark_fft, scimark_lu, scimark_monte_carlo, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

    @vstinner
    Copy link
    Member

    • _PyObject_GC_UNTRACK(m);
      + PyObject_GC_UnTrack(m);

    Can someone please explain me why this change has any impact? Does it mean that _PyObject_GC_UNTRACK() macro has a bug?

    The function is quite straighforward :-)

    void
    PyObject_GC_Track(void *op)
    {
        _PyObject_GC_TRACK(op);
    }

    @ronaldoussoren
    Copy link
    Contributor

    I don’t understand this either. For some reason the patch works when using the function and not when using the macro.

    I haven’t looked into this yet, and am currently traveling. I should have time to look into this starting December 5th when I’m back home.

    @serhiy-storchaka
    Copy link
    Member

    It is hard to get a noticeable difference in macrobenchmarks. I want to see results of microbenchmarks.

    For tuples, the trashcan mechanism is the main culprit of the performance hit when use short-living tuples for passing arguments. For bound method objects the relative effect should be smaller, because looking up the method is more costly than building a tuple, and less bound method objects are created after implementing the method call optimization. But in any case we should know the maximal cost of this change before merging it.

    @pablogsal
    Copy link
    Member

    Can someone please explain me why this change has any impact? Does it mean that _PyObject_GC_UNTRACK() macro has a bug?

    Interestingly, this lonely and perplexing comment is left in PyObject_GC_UnTrack (the function):

    void
    PyObject_GC_UnTrack(void *op)
    {
         /* Obscure:  the Py_TRASHCAN mechanism requires that we be able to
          * call PyObject_GC_UnTrack twice on an object.
          */
      if (_PyObject_GC_IS_TRACKED(op)) {
          _PyObject_GC_UNTRACK(op);
      }
    }

    @methane
    Copy link
    Member

    methane commented Nov 30, 2018

    microbench:

    $ ./python.patched -m perf timeit --compare-to ./python.master -s 'o="foo"' 'o.encode'
    python.master: ..................... 29.3 ns +- 0.6 ns
    python.patched: ..................... 30.9 ns +- 1.3 ns

    Mean +- std dev: [python.master] 29.3 ns +- 0.6 ns -> [python.patched] 30.9 ns +- 1.3 ns: 1.05x slower (+5%)

    And this is perf report of python.patched:

    33.17% _PyEval_EvalFrameDefault
    15.25% _PyObject_GenericGetAttrWithDict
    9.63% PyCFunction_NewEx
    8.25% _PyType_Lookup
    6.42% meth_dealloc
    3.50% _Py_bytes_contains
    3.22% PyObject_GC_UnTrack
    3.20% method_get
    2.73% PyObject_GetAttr
    2.55% repeat_next
    0.88% _Py_Dealloc
    0.81% PyObject_GenericGetAttr
    0.78% sre_ucs1_match
    0.50% lookdict_unicode_nodummy

    @serhiy-storchaka
    Copy link
    Member

    Thank you Inada. There is a large overhead for iterating. Try to add --duplicate=1000.

    @methane
    Copy link
    Member

    methane commented Nov 30, 2018

    $ ./python.patched -m perf timeit --compare-to ./python.master -s 'o="foo"' 'o.encode' --duplicate=10000
    python.master: ..................... 23.1 ns +- 0.5 ns
    python.patched: ..................... 24.6 ns +- 0.5 ns

    Mean +- std dev: [python.master] 23.1 ns +- 0.5 ns -> [python.patched] 24.6 ns +- 0.5 ns: 1.06x slower (+6%)

    $ perf record ./python.patched -m perf timeit -s 'o="foo"' 'o.encode' --duplicate=10000
    Couldn't record kernel reference relocation symbol
    Symbol resolution may be skewed if relocation was used (e.g. kexec).
    Check /proc/kallsyms permission or run as root.
    .....................
    Mean +- std dev: 24.9 ns +- 0.7 ns
    [ perf record: Woken up 9 times to write data ]
    [kernel.kallsyms] with build id debb7f6ce8aad7263cc6564650a88db5c8003389 not found, continuing without symbols
    [ perf record: Captured and wrote 2.491 MB perf.data (61787 samples) ]
    
    $ perf report | cat
    (snip)
    # Total Lost Samples: 0
    #
    # Samples: 61K of event 'cycles:ppp'
    # Event count (approx.): 52880893198
    #
    # Overhead  Command         Shared Object                                     Symbol
    # ........  ..............  ................................................  ..........................................
    #
        18.78%  python.patched  python.patched                                    [.] _PyEval_EvalFrameDefault
        14.85%  python.patched  python.patched                                    [.] _PyObject_GenericGetAttrWithDict
         9.93%  python.patched  python.patched                                    [.] PyCFunction_NewEx
         8.56%  python.patched  python.patched                                    [.] _PyType_Lookup
         6.61%  python.patched  python.patched                                    [.] meth_dealloc
         4.12%  python.patched  python.patched                                    [.] PyParser_AddToken
         3.36%  python.patched  python.patched                                    [.] PyObject_GetAttr
         3.33%  python.patched  python.patched                                    [.] PyObject_GC_UnTrack
         3.25%  python.patched  python.patched                                    [.] method_get
         2.92%  python.patched  python.patched                                    [.] _Py_bytes_contains
         2.13%  python.patched  python.patched                                    [.] freechildren
         1.41%  python.patched  python.patched                                    [.] pymalloc_alloc.isra.6.part.7

    @iritkatriel
    Copy link
    Member

    I've reproduced the segfault on 3.11 on a Mac.

    @iritkatriel iritkatriel added the 3.11 only security fixes label Aug 6, 2021
    @vstinner
    Copy link
    Member

    vstinner commented Aug 9, 2021

    Correctness matters more than performance for 1 nanosecond. IMO a fix like attached bpo-33930.patch looks like the good approach to fix the crash.

    @nascheme
    Copy link
    Member

    I'm thinking that the explicit call to PyObject_GC_UnTrack should be made unnecessary by integrating it into the trashcan code. That way, we avoid someone else running into this kind of bug in the future. See bpo-44881.

    @pablogsal
    Copy link
    Member

    New changeset bfc2d5a by Pablo Galindo Salgado in branch 'main':
    bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678)
    bfc2d5a

    @vstinner
    Copy link
    Member

    New changeset 15f0a45 by Miss Islington (bot) in branch '3.9':
    bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) (GH-27720)
    15f0a45

    @vstinner
    Copy link
    Member

    New changeset d6d2d54 by Miss Islington (bot) in branch '3.10':
    bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) (GH-27719)
    d6d2d54

    @vstinner
    Copy link
    Member

    Thanks for the fix Augusto Goulart and Pablo! I close the issue.

    Pablo asked in PR 27678 if a Python 3.8 Release Manager (Lukasz) wants a backport to 3.8.

    @ambv
    Copy link
    Contributor

    ambv commented Aug 11, 2021

    New changeset 09d814b by Łukasz Langa in branch '3.8':
    [3.8] bpo-33930: Fix segfault with deep recursion when cleaning method objects (GH-27678) (GH-27721)
    09d814b

    @ambv
    Copy link
    Contributor

    ambv commented Aug 11, 2021

    Apparently we've got ourselves refleaks with the patches.

    It doesn't always reproduce but put test_exceptions -R: in a loop in 3.10 and soon enough you'll get:

    0:00:00 load avg: 4.47 Run tests sequentially
    0:00:00 load avg: 4.47 [1/1] test_exceptions
    beginning 9 repetitions
    123456789
    .........
    test_exceptions leaked [2, 2, 2, 2] references, sum=8
    test_exceptions leaked [2, 2, 2, 2] memory blocks, sum=8
    test_exceptions failed (reference leak) in 59.6 sec

    == Tests result: FAILURE ==

    1 test failed:
    test_exceptions

    It's much harder to reproduce on 3.9 and 3.8 because test_exceptions has fewer tests and at least on my MBP finishes twice as fast. I ran 3.9 in a loop for an hour and no refleak.

    @ambv ambv added 3.9 only security fixes 3.10 only security fixes labels Aug 11, 2021
    @ambv ambv reopened this Aug 11, 2021
    @ambv
    Copy link
    Contributor

    ambv commented Aug 11, 2021

    Interestingly, I can't reproduce the refleaks on main either, only on 3.10.

    @ambv
    Copy link
    Contributor

    ambv commented Aug 11, 2021

    I gotta run for the day. The shortest way for me to reproduce is to focus on the tests that were added in 3.10 and shortening the repetitions a little:

    $ time ./python.exe -E -Wd -m test test_exceptions -R3:3 -m 'test_attribute*' -m 'test_getattr*' -m 'test_lineno*' -m 'test_name_error*' -m 'test_recursion*' -m 'test_no_hang*'

    0:00:00 load avg: 5.92 Run tests sequentially
    0:00:00 load avg: 5.92 [1/1] test_exceptions
    beginning 6 repetitions
    123456
    ......
    test_exceptions leaked [2, 2, 2] references, sum=6
    test_exceptions leaked [2, 2, 2] memory blocks, sum=6
    test_exceptions failed (reference leak)

    == Tests result: FAILURE ==

    1 test failed:
    test_exceptions

    1 re-run test:
    test_exceptions

    Total duration: 11.8 sec
    Tests result: FAILURE
    12.13 real 11.44 user 0.61 sys

    @ambv
    Copy link
    Contributor

    ambv commented Aug 11, 2021

    (BTW, with this command, it is indeed possible to trigger the ref leak on main as well, which makes it somewhat less scary.)

    @iritkatriel
    Copy link
    Member

    It seems to have been introduced by
    #27626

    I don't see how the code change could have introduced a leak. Maybe the new tests exposed a pre-existing leak?

    @ambv
    Copy link
    Contributor

    ambv commented Aug 11, 2021

    Entirely possible, Irit. We've had just this sort of case just last week! See: https://bugs.python.org/issue44524#msg399157

    @iritkatriel
    Copy link
    Member

    I've narrowed it down a bit more and created bpo-44895.

    @benjaminp
    Copy link
    Contributor

    New changeset f08e6d1 by Benjamin Peterson in branch 'main':
    bpo-33930: Fix typo in the test name. (bpo-27733)
    f08e6d1

    @benjaminp
    Copy link
    Contributor

    New changeset d7f5796 by Miss Islington (bot) in branch '3.8':
    bpo-33930: Fix typo in the test name. (GH-27735)
    d7f5796

    @miss-islington
    Copy link
    Contributor

    New changeset f7635f0 by Miss Islington (bot) in branch '3.9':
    [3.9] bpo-33930: Fix typo in the test name. (GH-27733) (GH-27734)
    f7635f0

    @ambv
    Copy link
    Contributor

    ambv commented Aug 13, 2021

    Since the refleak investigation moved to its own issue, I'm re-closing this one for clarity.

    @ambv ambv closed this as completed Aug 13, 2021
    @miss-islington
    Copy link
    Contributor

    New changeset ef36dfe by Benjamin Peterson in branch '3.10':
    [3.10] bpo-33930: Fix typo in the test name. (GH-27736)
    ef36dfe

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests