Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py_FatalError (msg=0x7f0e3b373232 "bad leading pad byte") at Python-2.7.5/Python/pythonrun.c:1689 #63043

Closed
mmokrejs mannequin opened this issue Aug 26, 2013 · 46 comments
Labels
type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@mmokrejs
Copy link
Mannequin

mmokrejs mannequin commented Aug 26, 2013

BPO 18843
Nosy @tim-one, @amauryfa, @pitrou, @vstinner, @yaseppochi, @skrah
Files
  • bad_leading_pad_byte_14028_shortened.txt: gdb stacktrace (partial)
  • python2.7_through_cython_valgrind.txt: python-2.7.5_or_cython0.19.1-valgrind-bugs.txt
  • core.emerge.8425.gdb.txt
  • stress.valgrind.stderr: stress.valgrind.stderr
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-12-01.20:38:49.261>
    created_at = <Date 2013-08-26.16:28:31.886>
    labels = ['invalid', 'type-crash']
    title = 'Py_FatalError (msg=0x7f0e3b373232 "bad leading pad byte") at Python-2.7.5/Python/pythonrun.c:1689'
    updated_at = <Date 2013-12-01.20:38:49.259>
    user = 'https://bugs.python.org/mmokrejs'

    bugs.python.org fields:

    activity = <Date 2013-12-01.20:38:49.259>
    actor = 'neologix'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-12-01.20:38:49.261>
    closer = 'neologix'
    components = []
    creation = <Date 2013-08-26.16:28:31.886>
    creator = 'mmokrejs'
    dependencies = []
    files = ['31474', '31490', '31511', '31532']
    hgrepos = []
    issue_num = 18843
    keywords = []
    message_count = 46.0
    messages = ['196222', '196223', '196226', '196227', '196228', '196232', '196233', '196243', '196249', '196251', '196254', '196326', '196334', '196335', '196336', '196337', '196339', '196481', '196483', '196487', '196491', '196492', '196497', '196498', '196499', '196500', '196503', '196509', '196511', '196512', '196518', '196520', '196522', '196557', '196570', '196571', '196578', '196636', '196638', '196656', '196657', '196691', '196694', '196695', '204954', '204958']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'amaury.forgeotdarc', 'mmokrejs', 'pitrou', 'vstinner', 'sjt', 'skrah', 'neologix']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue18843'
    versions = ['Python 2.7']

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 26, 2013

    Hi,
    it happened to me that using faulthandler and python compiled with --with-pydebug and C*FLAGS=-ggdb I got this stacktrace (will attach longer version as a file):

    (gdb) where
    #0 0x00007f0e3af8aacb in raise () from /lib64/libpthread.so.0
    #1 0x00007f0e3a0b05f6 in faulthandler_fatal_error (signum=6) at faulthandler.c:321
    #2 <signal handler called>
    #3 0x00007f0e3ac061f5 in raise () from /lib64/libc.so.6
    #4 0x00007f0e3ac0766b in abort () from /lib64/libc.so.6
    #5 0x00007f0e3b327828 in Py_FatalError (msg=0x7f0e3b373232 "bad leading pad byte") at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/pythonrun.c:1689
    #6 0x00007f0e3b257dc8 in _PyObject_DebugCheckAddressApi (api=111 'o', p=0x449e6900) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/obmalloc.c:1591
    #7 0x00007f0e3b257a6c in _PyObject_DebugFreeApi (api=111 'o', p=0x449e6900) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/obmalloc.c:1478
    #8 0x00007f0e3b257913 in _PyObject_DebugFree (p=0x449e6900) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/obmalloc.c:1422
    #9 0x00007f0e3b34319a in PyObject_GC_Del (op=0x449e6920) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/gcmodule.c:1561
    #10 0x00007f0e3b275ef7 in tupledealloc (op=0x449e6920) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/tupleobject.c:235
    #11 0x00007f0e3b255bac in _Py_Dealloc (op=(True,)) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/object.c:2262
    #12 0x00007f0e3b246d77 in dict_dealloc (mp=0x449b5d80) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/dictobject.c:1010
    #13 0x00007f0e3b255bac in _Py_Dealloc (op=
    {'_label': <unknown at remote 0x449dd4b8>, '_facecolors_original': (<float at remote 0xb65dd38>, <float at remote 0x5034630>, <float at remote 0xb3c6af8>), '_transform': <unknown at remote 0x449e6b50>, 'figure': <Figure(_label='', _transform=None, figure=None, _axobservers=[], images=[], texts=[], _hold=True, artists=[], _agg_filter=None, patch=<Rectangle(_label='', _transform=<BboxTransformTo(_invalid=2, _inverted=<Affine2D(_invalid=0, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bc6ec0>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xbf7d060>, data={}) at remote 0x129fbdf8>) at remote 0xd5bb0300>, _boxout=<TransformedBbox(_invalid=2, _transform=<Affine2D(_invalid=2, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bad540>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xc166450>, data={741848704: <KeyedRef at remote 0x40290a80>, 405520480: <KeyedRef at remote 0x29e86b10>, 322803328: <KeyedRef at remote 0x29e86570>, 3585872752: <KeyedRef ...(truncated)) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/object.c:2262
    #14 0x00007f0e3b27ad3d in subtype_dealloc (
    self=<PathCollection(_label=<unknown at remote 0x449dd4b8>, _facecolors_original=(<float at remote 0xb65dd38>, <float at remote 0x5034630>, <float at remote 0xb3c6af8>), _transform=<unknown at remote 0x449e6b50>, figure=<Figure(_label='', _transform=None, figure=None, _axobservers=[], images=[], texts=[], _hold=True, artists=[], _agg_filter=None, patch=<Rectangle(_label='', _transform=<BboxTransformTo(_invalid=2, _inverted=<Affine2D(_invalid=0, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bc6ec0>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xbf7d060>, data={}) at remote 0x129fbdf8>) at remote 0xd5bb0300>, _boxout=<TransformedBbox(_invalid=2, _transform=<Affine2D(_invalid=2, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bad540>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xc166450>, data={741848704: <KeyedRef at remote 0x40290a80>, 405520480: <KeyedRef at remote 0x29e86b10>, 322803328: <KeyedRef at remote 0x29e86570>, 3585872752: <KeyedR...(truncated)) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/typeobject.c:1015
    #15 0x00007f0e3b255bac in _Py_Dealloc (
    op=<PathCollection(_label=<unknown at remote 0x449dd4b8>, _facecolors_original=(<float at remote 0xb65dd38>, <float at remote 0x5034630>, <float at remote 0xb3c6af8>), _transform=<unknown at remote 0x449e6b50>, figure=<Figure(_label='', _transform=None, figure=None, _axobservers=[], images=[], texts=[], _hold=True, artists=[], _agg_filter=None, patch=<Rectangle(_label='', _transform=<BboxTransformTo(_invalid=2, _inverted=<Affine2D(_invalid=0, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bc6ec0>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xbf7d060>, data={}) at remote 0x129fbdf8>) at remote 0xd5bb0300>, _boxout=<TransformedBbox(_invalid=2, _transform=<Affine2D(_invalid=2, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bad540>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xc166450>, data={741848704: <KeyedRef at remote 0x40290a80>, 405520480: <KeyedRef at remote 0x29e86b10>, 322803328: <KeyedRef at remote 0x29e86570>, 3585872752: <KeyedR...(truncated)) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/object.c:2262
    #16 0x00007f0e3b230841 in list_dealloc (op=0x1d1740d8) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/listobject.c:309
    #17 0x00007f0e3b255bac in _Py_Dealloc (
    op=[<PathCollection(_label='_collection0', _facecolors_original=(<float at remote 0x5034630>, <float at remote 0xc489d80>, <float at remote 0xc568ab8>), _transform=<IdentityTransform(_invalid=1, _inverted=None, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xbf7d840>, data={}) at remote 0x129fb510>) at remote 0x1201cd10>, figure=<Figure(_label='', _transform=None, figure=None, _axobservers=[], images=[], texts=[], _hold=True, artists=[], _agg_filter=None, patch=<Rectangle(_label='', _transform=<BboxTransformTo(_invalid=2, _inverted=<Affine2D(_invalid=0, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bc6ec0>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xbf7d060>, data={}) at remote 0x129fbdf8>) at remote 0xd5bb0300>, _boxout=<TransformedBbox(_invalid=2, _transform=<Affine2D(_invalid=2, _inverted=None, _mtx=<numpy.ndarray at remote 0xd5bad540>, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0xc166450>, data={741848...(truncated)) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/object.c:2262
    #18 0x00007f0e3b245789 in insertdict_by_entry (mp=0x17e2a9c0, key='collections', hash=5674278088882554420, ep=0x103a7dc0, value=[]) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/dictobject.c:519
    #19 0x00007f0e3b245a5b in insertdict (mp=0x17e2a9c0, key='collections', hash=5674278088882554420, value=[]) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/dictobject.c:556
    #20 0x00007f0e3b246326 in dict_set_item_by_hash_or_entry (op=0x0, key='collections', hash=5674278088882554420, ep=0x0, value=[]) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/dictobject.c:765
    #21 0x00007f0e3b2464be in PyDict_SetItem (
    op={'transLimits': <BboxTransformFrom(_invalid=2, _boxin=<TransformedBbox(_invalid=2, _transform=<TransformWrapper(_invalid=0, transform_path_affine=<instancemethod at remote 0xbad73e0>, input_dims=2, transform_path=<instancemethod at remote 0x17f47d60>, transform=<instancemethod at remote 0xd511bee0>, transform_affine=<instancemethod at remote 0x1e0e0260>, inverted=<instancemethod at remote 0x17f47ce0>, _child=<BlendedAffine2D(_invalid=1, _y=<IdentityTransform(_invalid=1, _inverted=None, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0x10c92648>, data={315699008: <KeyedRef at remote 0xbd63a80>}) at remote 0x350f5df8>) at remote 0x2c360bc0>, _x=<IdentityTransform(_invalid=1, _inverted=None, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0x145ff990>, data={315699008: <KeyedRef at remote 0xbd63720>}) at remote 0xbfde2b8>) at remote 0x2c36c680>, _inverted=None, _mtx=None, _shorthand_name='', _parents=<WeakValueDictionary(_remove=<function at remote 0x10c...(truncated), key='collections', value=[]) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/dictobject.c:818

    It was drawing a chart using matplotlib-1.2.1. Is this a python or matplotlib bug? I shortened the gdb stacktrace hoping you don't need the rest.

    @mmokrejs mmokrejs mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Aug 26, 2013
    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 26, 2013

    Should have included from the head of gdb output:

    Program terminated with signal 6, Aborted.

    @amauryfa
    Copy link
    Member

    This is a memory corruption.
    Please look at the memory before the freed address (0x449e6900), maybe there is an indication of some buffer overflow?

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 26, 2013

    Would you please guide me what gdb commands I should issue for you? Thank you. BTW, I ran memtest86+ few days ago, although this is non-ECC memory I think HW is fine.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 26, 2013

    Grr, forgot to look into a file where I recorded STDERR.

    Debug memory block at address p=0x449e6900: API 'o'
    80 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfb):
    at p-7: 0xfb
    at p-6: 0xfb
    at p-5: 0xfa *** OUCH
    at p-4: 0xfb
    at p-3: 0xfb
    at p-2: 0xfb
    at p-1: 0xfb
    Because memory is corrupted at the start, the count of bytes requested
    may be bogus, and checking the trailing pad bytes may segfault.
    The 8 pad bytes at tail=0x449e6950 are FORBIDDENBYTE, as expected.
    The block was made by call #8155854715 to debug malloc/realloc.
    Data at p: 00 00 00 00 00 00 00 00 ... 20 af 5b 3b 0e 7f 00 00
    Fatal Python error: bad leading pad byte
    Fatal Python error: Aborted

    Current thread 0x00007f0e3b7ec700:
    File "/usr/lib64/python2.7/site-packages/matplotlib/axes.py", line 932 in cla
    File "/usr/lib64/python2.7/site-packages/matplotlib/figure.py", line 906 in clf
    File "/usr/lib64/python2.7/site-packages/matplotlib/figure.py", line 926 in clear
    File "blah.py", line 12513 in draw_hist2d_plot

    In that function I call:
    del(_series)
    del(_ax1)
    _figure.clear() bpo-12513

    @tim-one
    Copy link
    Member

    tim-one commented Aug 26, 2013

    Python's debug-mode memory allocators add some "magic values" before and after each allocated chunk of memory, and check them when the chunk is freed to make sure nobody overwrote them. In this case, someone did overwrite the byte at p-5, where p is the address of the block originally returned to the requesting program code.

    It's also suspicious that it says this block was originally allocated by the 8,155,854,715th call to a debug allocation routine: is it plausible that you (for example) allocated over 8 billion objects? This "serial number" is also stored in bytes adjacent to (and after) the allocated block, so is also vulnerable to out-of-bounds stores by rogue code.

    So we have out-of-bounds stores here both before and after the requested memory. Sorry, but it's unlikely core Python is doing this - such errors are usually due to rogue extension modules.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 26, 2013

    Thank you for explanation what is going on. I called matplotlibs drawing function to include 49308 dots and corresponding legend items with their colors. That's all I can say. I am not a native English speaker so I don't know what 'rogue extension modules' means.

    Do you think it is a matplotlib or numpy issue which were in the stacktrace? Or a memory module error?

    I am running analyses which take weeks. Will python give me a clear warning/error when the object counter overflows?

    So what should I do now? I am clue-less right now.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 26, 2013

    Memory corruption can be difficult to track down. Best thing you can do is strive to find a test case as small and fast as possible that shows the same kind of error.

    By "rogue extension module" I just mean 3rd-party C code (like, for example, matplotlib).

    I doubt it's a hardware problem. That's possible, of course, but these kinds of errors are almost always the result of errors in C code.

    The stacktrace probably isn't helpful. All we know is that memory got corrupted _sometime_ between someone asking for a block of memory and releasing it. The corruption may have happened a millisecond ago, or weeks ago (if the program ran that long) - there's no way to tell by the time the memory corruption is _detected_.

    About "object counter overflows", I'm not sure what you're asking. Python doesn't have an object counter. The "serial number" in debug-mode allocators just counts the number of times a debug-mode malloc has been called. If that overflows, it would do no harm.

    Bottom line: no matter what's to blame here, the smaller & faster a test program you can find, the more likely it is to get fixed.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 26, 2013

    By the way, if memory serves, compiling with --with-pydebug changes the memory layout of Python objects, so a Python compiled this way _cannot_ be used successfully with extension modules that were compiled without the same options. Did you rebuild your extensions too?

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 26, 2013

    Yes, I have rebuilt all python modules but even gdb exited on startup due to python ABI change. I am using Gentoo Linux (https://bugs.gentoo.org/show_bug.cgi?id=482348) and unless python-updater forgot to include some package in the listing of those needed to be recompiled, I am sane. And becuase gdb could not even start up I *hope* those no recompiled yet would NOT work AT ALL.

    Thanks for clarification, I thought that I might reach some internal number (the serial number as you say) in python and run out of some internal counters on objects. Actually, I hit these issues because I wondered why some of my application tests fail. Although all tests crunch a really lot of data they merely do the same in the end: draw charts using matplotlib which uses numpy. I have huge lists which I recently converted to generators (if possible) and now I even use imap(), izip(), ifilter() from itertools. One of the crashed tests has 153 levels in gdb stacktrace and few lines from the very top/outer already had the izip() objects. But the tests which crashed are not so huge like others, maybe take just 1/10 of the size of others, so I wonder why these failed.

    I think some crashes are related to me deleting explictly a huge list in my code even before leaving a function. Or maybe returning such lists between child/parent functions?

    Could valgring or something else help to find who is overwriting data of others? But I don't have experience with using it.

    I think this _figure.clear() crash could be manifestation of python deleting a wrong object/pointer. Some ugly for loops over lists took ... don't know how much but in total even 26GB of RAM was reserved by the process (most of it also as residual memory requirement). With itertools() I got down 10x.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 26, 2013

    Well, if you delete a giant list, and the list held the only references remaining to the objects the list contained, then the memory for those objects will be free'd, one object at a time. A debug build would then detect the memory corruption in those objects. But the corruption has nothing to do with deleting the list then - deleting the list would merely trigger the code that _detects_ the (pre-existing) corruption.

    I can just urge you again to try to find a failing test as small and fast as possible. You feel lost now precisely because you're wandering through a _mountain_ of code ;-)

    If you want to play with the debug serial numbers, you can set a breakpoint in function bumpserialno() (in Python's Objects/obmalloc.c). This is the entire function:

    static void
    bumpserialno(void)
    {
        ++serialno;
    }

    The function exists so you can easily set a breakpoint whenever serialno is increased (this function is the only place serialno is changed).

    What I _expect_ you'll find is that serialno never gets anywhere near 8155854715. If so, that just says again that the copy of serialno made when the corrupted object was created got corrupted (overwritten) by some bad C (or C++) code. It can't tell us who overwrote it, or when.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 27, 2013

    I took a crack from another angle. I converted my application using cython and then used gcc. Finally, ran valgrind over the binary. It is not finished yet but already spotted plenty of hints.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 27, 2013

    Did you read Misc/README.valgrind (in the Python tree)? The warnings you've seen so far are probably all nonsense, and README.valgrind explains how to avoid getting them.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 27, 2013

    No, I did not know that. Thanks, I did now.

    <quote>

    </quote>

    Why isn't there a configure switch for this? :(

    @tim-one
    Copy link
    Member

    tim-one commented Aug 28, 2013

    I don't know why there isn't a configure switch for this - but then I've never used valgrind - LOL ;-)

    Other developers use valgrind on Python routinely, though. So it's unlikely you'll find a legitimate problem _in Python_ reported by valgrind.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 28, 2013

    I was just checking whether configure picked up my --with-pymalloc and incidentally saw:

    --with-valgrind Enable Valgrind support

    maybe Misc/README.valgrind needs revision and should explain what that does as well? ;-)

    It should also explain what is one supposed to do with Misc/valgrind-python.supp. It becomes clear once one goes to edit that file, true. ;-)

    @tim-one
    Copy link
    Member

    tim-one commented Aug 28, 2013

    I opened bpo-18859 about the lack of --with-valgrind info in Misc/README.valgrind. Thanks for noticing!

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 29, 2013

    I wanted to move away from the --with-pydebug to a normal python and I failed with:

    # emerge dev-lang/python:2.7

    • IMPORTANT: 11 news items need reading for repository 'gentoo'.
    • Use eselect news to read news items.

    Calculating dependencies... done!
    Debug memory block at address p=0xa7f5900: API 'o'
    80 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfb):
    at p-7: 0xfb
    at p-6: 0xfb
    at p-5: 0xfa *** OUCH
    at p-4: 0xfb
    at p-3: 0xfb
    at p-2: 0xfb
    at p-1: 0xfb
    Because memory is corrupted at the start, the count of bytes requested
    may be bogus, and checking the trailing pad bytes may segfault.
    The 8 pad bytes at tail=0xa7f5950 are FORBIDDENBYTE, as expected.
    The block was made by call #21242094 to debug malloc/realloc.
    Data at p: 73 00 00 00 79 00 00 00 ... 67 00 00 00 00 00 00 00
    Fatal Python error: bad leading pad byte
    Aborted (core dumped)

    Great. So, what can we gather now from this? It got overwritten within few dozens of seconds at most since I started emerge. It is in a different memory region (so no memory module HW bug). I am attaching the resolved stacktrace just in case ...

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 29, 2013

    One more note. At that time I had running my application which at that time was parsing an XML file using xml.parsers.expat! That is being run in my pipeline before I render figures (the initially reported crash case). So, matplotlib/numpy is ruled out.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 29, 2013

    Hmm. I don't quite know what you're doing: you said you're getting away from --with-pydebug, but these "bad leading pad byte" messages can't be generated unless Python is compiled with (at least) PYMALLOC_DEBUG defined.

    That said, my advice remains the same: whittle down the program (instead of fiddling with tools) to get a failing case as small & fast as possible. In your "few dozens of seconds at most", many _millions_ of things happen, and there's scant chance of out-thinking them.

    In the newest traceback (core.emerge.8425.gdb.txt), Python detected the corruption while running its cyclic-garbage collector. Forcing gc to run more often may catch the corruption sooner (use Python's gc.set_threshold() to force more frequent collections).

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 29, 2013

    Sorry, I should explain a bit more. Gentoo Linux uses a tool named emerge to handle packages. It is written in python. So, by that command I initiated re-compilation of python itself but it crashed quickly. At that moment emerge called the python available on the system at that time which was configured using --with-pydebug.

    Meanwhile I went to install DUMA http://duma.sourceforge.net/ and started to re-run emerge.

    # emerge duma
    # export LD_PRELOAD=/usr/lib64/libduma.so.0.0.0
    # sysctl -w vm.max_map_count=1000000
    # emerge dev-lang/python:2.7

    Thanks for your analysis, will try to stitch something.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 29, 2013

    Yet Another Tool ;-) Python's "small object" allocator grabs memory in chunks of 256KB from the system, and carves up the space itself. Other memory tools (like Valgrind ...) only see that Python has grabbed 256KB chunks, so can't detect anything wrong _within_ those large blocks. For that reason, compiling Python --without-pymalloc usually increases the usefulness of other tools.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 29, 2013

    Thanks, will recompile without pymalloc.

    I don't understand why always is only 1 bit different. Could that be overwritten by another use process or do you believe it must be the python or some of the modules imported into it on runtime?

    @tim-one
    Copy link
    Member

    tim-one commented Aug 29, 2013

    It would be a severely lame OS that allowed a process to overwrite another process's memory ;-) "Bad C or C++ code", in the process you're running, is still the best guess.

    A memory module that sometimes dropped the last bit _could_ be at fault, but I doubt it (0xfb changes to 0xfa if the last bit is cleared).

    That the address is different across the two failure examples doesn't rule out bad memory, though: the memory address an OS shows you is a "logical address", an _offset_ from the start of the physical memory the OS happened to assign to the process. Across different runs, the same physical memory location may be shown as any number of different logical addresses.

    We'll understand why it's off by 1 eventually ;-) For example, it could be C code as simple as:

        char* p = (char *)allocate_some_memory(80);
        ...
        p -= 5;
        ...
        if (something)
            --*p;

    It's extremely easy for C code to do insane stores; indeed, that's why there are so many sophisticated tools for trying to find the source of memory corruption in C/C++ code.

    Question: did the last failure come from your code, or from merely running emerge? You said your code "was parsing an XML file" at the time, but there's no trace of expat (or any other xml parsing code) in the traceback, and the traceback "bottoms out" at emerge. Maybe this has nothing at all to do with your code????

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 29, 2013

    That is why I asked if other process can interfere. So, they are isolated on Linux, good. ;-) The crash in #msg196481 is just the emerge written in python, at the start it is resolving some graph of package dependencies ... once it resolves order of packages and prints a lot of output. so, it did not get to that before it crashed. In sum with what you said, my application running as another process should not interfere with the emerge process, so it turns out that was irrelevant comment from my side. ;)

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 29, 2013

    @tim-one
    Copy link
    Member

    tim-one commented Aug 29, 2013

    I sent a msg to Python-Dev, asking for a Gentoo user to try to reproduce this. Thanks!

    @yaseppochi
    Copy link
    Mannequin

    yaseppochi mannequin commented Aug 30, 2013

    I have a gentoo host, but I'd like to know how did the OP get a debug Python in the first place? The ebuild for python 2.7.5-r1 doesn't say anything about debug options. "How" would preferably include information about the C compiler used, etc. If there's no information, I can probably just hack --with-pydebug into a local ebuild, but I have something like 7 C compilers installed, I'd really like a good guess at the right one. Also, did he add any optimization flags etc when building the debug Python?

    (ebuild = the emerge configuration file that describes the build and install process for a package. In "2.7.5-r1" the "r1" indicates the first revised ebuild for the same upstream version.)

    @tim-one
    Copy link
    Member

    tim-one commented Aug 30, 2013

    Thanks for chiming in, Stephen! I can't answer your questions, but agree the original poster needs to tell you exactly what he did -- else we'd just be thrashing at random. There's been enough of that already ;-)

    @tim-one
    Copy link
    Member

    tim-one commented Aug 30, 2013

    Martin, can you please supply exact commands Stephen can use to try to reproduce the problem you saw running emerge? And try them yourself first, to make sure you can reproduce the problem too.

    Any detail may be important. For example, it's possible this is due to a bug in the C compiler used to compile Python, or due to some flag passed to the compiler. That's why "exact commands" are important.

    The good news: if Stephen can reproduce it, he'll probably figure out all the rest, and you'll be free to resume enjoying your life ;-)

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Aug 30, 2013

    Just two things:

    • running under valgrind with the suppression file would help pinpoint this
    • usually, stack/heap overflows or invalid pointer dereference affect a contiguous chunk of memory: here, there's a *single bit* flipped, not even a byte
      0xfb == 0b11111011
      0xfa == 0b11111010

    This looks like an hardware issue to me.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 30, 2013

    Hi Stephen,
    I discussed the USE=debug here (https://bugs.gentoo.org/show_bug.cgi?id=482348) and it is denied by portage maintainers because it is not only a debug addition but a whole API change. We have to live with:

    mkdir -p /etc/portage/env
    echo 'EXTRA_ECONF="--with-pydebug"' > /etc/portage/env/pydebug.conf
    echo dev-lang/python pydebug.conf >> /etc/portage/package.env

    The above what I had originally. Yesterday I tried even:

    # cat /etc/portage/env/pydebug.conf
    EXTRA_ECONF="--with-pydebug --without-pymalloc --with-valgrind"

    but I don't know what the valgrind option really does and, whether that means:
    a) python will run itself under valgrind, don't both doing it yourself
    b) you don't have to bother with uncommenting the lines in valgrind.supp
    c) "make install" will install the valgrind.supp but you have to edit it on your own
    d) you may use valgrind but don't use other tool replacing malloc(), like electric fence, DUMA, etc.

    Or some combination of them? :( The Readme.valgrind does not answer this at all.

    I let DUMA inspect "emerge boost" run overnight but my computer stopped responding (16GB RAM).

    I tried only gcc-4.7.3 and python-2.7.5-r2.

    CFLAGS="-ggdb -pipe -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx -maes -mpclmul -mpopcnt -march=corei7-avx"
    CXXFLAGS="${CFLAGS}"

    Per comment from Charles-François, so you mean that this single-bit change won't be caught by valgrind, right? Why does not memtest86+ detect that? Could python when compiled with the --with-pydebug print also physical, hardware address of the wrong value? That would be really helpful here! Thanks.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 30, 2013

    I uderstand that you are building Python using "emerge". I would try to get a release from python.org, do a normal build ...

    make distclean
    ./configure --prefix=/tmp --with-pydebug --with-valgrind
    make
    make install

    ...

    then install matplotlib using

    /tmp/bin/python setup.py

    ...

    then run (assuming you are in the build directory):

    valgrind --suppressions=Misc/valgrind-python.supp /tmp/bin/python your_program

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Aug 30, 2013

    2013/8/30 Martin Mokrejs <report@bugs.python.org>:

    Per comment from Charles-François, so you mean that this single-bit change won't be caught by valgrind, right? Why does not memtest86+ detect that? Could python when compiled with the --with-pydebug print also physical, hardware address of the wrong value? That would be really helpful here! Thanks.

    I mean that in the vast majority of cases, a single bit flip is due to
    a hardware error.
    That can be due to faulty RAM, electric noise, cosmic rays...

    Software-induced memory corruptions generally corrupt at least a byte.

    Do you reproduce the crash systematically, or is it random?

    @tim-one
    Copy link
    Member

    tim-one commented Aug 30, 2013

    Martin, would it be possible to borrow someone else's machine and try to reproduce this? If you can, that would greatly reduce the probability of this being a HW error. It would also leave us with an exact set of commands to share so others can try it on their boxes.

    It's true that SW memory corruption usually affects several contiguous bytes, but "usually" != "always" ;-)

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 30, 2013

    I think it would be of tremendous help if python reported the physical memory address though do not know how much work is that for you. I already said that, I wonder why memtest86+ could not find an error, I even tried other tools for that. Could --with-pydebug immediately after writing those pad bytes try to read them back, just to make sure they are readable? Theoretically one could catch the wrong bit of memory immediately.

    It is random, I think CPU must be fully loaded for a long while, plus something unknown has to kick in. Maybe that has to heat up the memory modules to cause the failure. My motherboard is already 3rd but doubt it is at fault, except that the SandyBridge chip is bad by design. Was replaced few days ago with CPU cooler (was the main reason for technician to come). I take that replacing motherboard was a nice offer from Dell. I would instead think that maybe CPU is heating too much.

    BIOS is weird. ACPI and PCIe ASPM are exploiting Linux kernel hotplug support and PCI developers are puzzled. I experience that SandyBridge resets e.g. its USB2.0 controller if there was a hotplug even on some of its other root ports (express card was inserted/ejected). But in overall I don't think it is a bad deal and it works. These exceptional glitches are I think a software error or BIOS error. It just wouldn't be running all the days and nights. No, it is not overheated, at least not now with the replaced cooler.

    I can put in other memory module (with which I bought the laptop from Dell). Other than that, hmm ... I could take out the hard drive and attach it to some other computer, boot from Live-DVD ... the problem is I need a lot of RAM for some of my stuff and it would have to be relatively same CPU type (I compiled binaries with avx). Yes, the emerge crash definitely was another case, I could hope to get that happen on even 1GB RAM machine, sure. I just don't have a testcase for that to trigger. I tried to stitch something but no, I probably coded something not going in the right direction. If you would have the time to stitch down some test for me to execute with the garbage collector calls, it would be probably the best.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 30, 2013

    OK, it sounds to me like you do not have a reproducible test case, of any kind. It that's true, this bug report isn't going anywhere :-(

    Python isn't a memory-testing program, so it would be insanely inefficient for it to (for example) read up every byte it writes, just to make sure the memory isn't bad. If you want to alter the source of _your_ Python to do that, fine! It won't go into the official distribution, but it might help you.

    For the same reason, core Python won't go into the business of sorting out various kinds of memory addresses. User-level code (including Python) works with the virtual (logical) addresses the operating system gives it. For a hint of some of the difficulties in going deeper than that, here's a note from Linus Torvalds:

    http://www.tldp.org/LDP/khg/HyperNews/get/devices/addrxlate.html
    

    About "If you would have the time to stitch down some test for me to execute with the garbage collector calls, it would be probably the best." I'm not sure what you're suggesting there. That we write a Python program that tries to trigger errors in your BIOS? LOL - not likely ;-)

    If you want to force a lot of allocations, deallocations, and gc runs, something like this will do it:

    """
    def stress():
    import gc

        class C(object):
            def __init__(self):
                self.loop = self
    
        N = 500000
        d = dict((C(), C()) for i in xrange(N))
        j = 0
        while 1:
            for i in xrange(N // 2):
                d.popitem()
            for i in xrange(N // 2):
                d[C()] = C()
            j += 1
            if j % 10 == 0:
                print j
                gc.set_debug(gc.DEBUG_STATS)
                gc.collect()
                gc.set_debug(0)
    
    stress()
    """

    Setting N larger will make it consume more RAM. The output is essentially meaningless, produced just so you know it's still running.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Aug 31, 2013

    So with your test program I did not yet hit the error. It used in the end 1.3GB of RAM, maybe in the good region. while this was python-2.7.5 configures as --without-pymalloc maybe the output from valgrind becomes more useful.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 31, 2013

    Hardware failures or glitches can involve any number of parameters.
    Instead of trying to reproduce the memory corruption with other Python programs, I would suggest you run the exact same workload (including Python version and build options) on another computer (preferably with different hardware characteristics) and see if you can still trigger it.

    (bonus points if you can try on a setup with ECC RAM)

    @yaseppochi
    Copy link
    Mannequin

    yaseppochi mannequin commented Aug 31, 2013

    I did "emerge =dev-lang/python-2.7.5-r1" *twice* with the environment configuration described in msg196520, then *once* with it disabled because one of the cases you described was when you tried to revert to a non-debug Python. (Besides, I am willing to risk your crash while I'm watching for it, but not a time bomb that will go off when I'm on deadline :-). All builds succeeded and all passed the test suite. Here's how the debug build describes itself:

    == CPython 2.7.5 (default, Sep 1 2013, 00:59:02) [GCC 4.6.4]
    == Linux-3.9.0-x86_64-Dual_Core_AMD_Opteron-tm-_Processor_265-with-gentoo-2.2 little-endian

    The test suite ran uneventfully (with a few DeprecationWarnings) except for this:

    6 skips unexpected on linux2:
    test_bsddb test_bsddb3 test_tcl test_tk test_ttk_guionly
    test_ttk_textonly

    but I suppose that is expected on Gentoo. If any of those modules (bsddb, tcl, tk) are built into your Python, a problem in one of those might be the culprit.

    Oh, damn. I just reread the whole thread. For some reason I thought you were using gcc 4.6.4, but now I see you report 4.7.3. OK build with 4.7.3 and your flags (also restore the --with-pydebug config):

    # export CFLAGS=" -ggdb -pipe -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx -maes -mpclmul -mpopcnt"
    # export CXXFLAGS="${CFLAGS}"
    # export CC=gcc-4.7.3

    and we crash (from make output) immediately after linking ./python:

    x86_64-pc-linux-gnu-ranlib libpython2.7.a
    gcc-4.7.3 -pthread -Wl,--hash-style=gnu -Wl,-O1 -Wl,--as-needed -L. -Xlinker -export-dynamic -o python \
    Modules/python.o \
    -L. -lpython2.7 -lpthread -ldl -lutil -lm
    LD_LIBRARY_PATH=/var/tmp/portage/dev-lang/python-2.7.5-r1/work/x86_64-pc-linux-gnu: ./python -E -S -m sysconfig --generate-posix-vars
    make: *** [pybuilddir.txt] Illegal instruction

    However, I'm pretty sure this is due to my hardware not liking your -m flags, not the crash you reported. I'll try backing those flags out, but if anybody has a suggestion for the most aggressive set similar to yours, I'd appreciate it.

    But first this process is going to go sleep(25200).

    @tim-one
    Copy link
    Member

    tim-one commented Aug 31, 2013

    Someone may find the new stress.valgrind.stderr interesting, but - since I've never used valgrind - it doesn't mean much to me.

    I _expected_ you'd run the little stress program under a debug Python and without valgrind, since that's the only combination you've tried so far that showed a definite problem ("pad leading pad byte" death, or the segfault in the other issue you filed).

    But it doesn't much matter - this is all just thrashing at random, yes? You need to find a reproducible test case, and/or try different hardware. The little stress program may or may not provoke an error under a debug-build Python, and may or may not require increasing N (to consume more RAM).

    If it does provoke an error, the next thing to try would be to write a little program that just writes 0xfb across a massive number of bytes, and then reads them all to verify they're still 0xfb. Or write one like that now, and preferably in C (it may matter how quickly the bytes are written - and it may not matter). But at this point youj're starting to write your own memory-testing program.

    In any case, there's really no evidence of an error in Python so far. Yes, Python has _detected_ a problem in some cases. But without a reproducible test case, I don't see that there's anything more we can do for you on our end - sorry.

    @yaseppochi
    Copy link
    Mannequin

    yaseppochi mannequin commented Sep 1, 2013

    OK, I backed off the aggressive CFLAGS/CXXFLAGS to " -ggdb -pipe", and ran "emerge =dev-lang/python-2.7.5-r1" *once* each with and without the 'EXTRA_ECONF="--with-pydebug"' flag. Compiled with GCC 4.7.3.

    No crash, same test results as described previously for GCC 4.6.4.

    If you have other suggestions, let me know.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 1, 2013

    Thanks for that, Stephen! I don't know of anything else you could try that would be helpful. The OP doesn't appear able to reproduce his problems either, and last I heard he was off running emerge under DUMA:

    http://duma.sourceforge.net/
    

    Why? Hope springs eternal ;-)

    @yaseppochi
    Copy link
    Mannequin

    yaseppochi mannequin commented Sep 1, 2013

    Yeah, hope is a good thing. But I've spent the last 20 years debugging an X11 application based on a Lisp interpreter, I save hope for fireflies, my dog, and my daughter these days. :-)

    To the OP:

    I don't follow Gentoo closely, but I have acquaintances who do. Between them and the occasional foray into the forums, I've gotten the impression that providing CFLAGS for optimization is associated with having hard-to-debug problems. They increase performance noticably only in a few applications. Python being a dynamic language, function calls and even variable references can be quite inefficient anyway. So I see no good reason to compile Python with aggressive CFLAGS, because it should be used only for moderately performance sensitive applications and as "glue code" and to provide UI. Instead, use them only for the specific applications that benefit (I suppose matplotlib *might* be one).

    Second, I tend to agree with the maintainers. The packages.env / pydebug.conf approach is the right thing for this kind of variant build.

    Third, you said you hoped to get better backtraces from --with-pydebug. That's a vain hope. Such options are intended to get better backtraces of C code from coredumps where the interpreter breaks down, not of Python code induced by Python exceptions caused by problems in user code. If you have trouble interpreting a backtrace, ask on python-list@python.org or comp.lang.python (they mirror each other, you only need one). If, after understanding the backtrace, you have an idea for way to get a better backtrace in this case, you can suggest it on python-ideas@python.org.

    Unfortunately, reporting "this backtrace is unintelligible, please improve it" as an RFE on the tracker is likely to get the reply "You're right, but we don't know how at this time. Patches welcome!" But you could try that if all else fails.

    @mmokrejs
    Copy link
    Mannequin Author

    mmokrejs mannequin commented Dec 1, 2013

    Hi,
    I think I should report back what I found on the hardware side. While memory testing tools like memtest86+ and other did not find any error, the built in Dell ePSA test suite likely does compute a checksum of tested memory regions. It reported some addresses/regions as failed, sadly nobody seems to know details of the failing tests. On repeated testing different memory regions were reported, so I never understood whether that is a bad CPU cache or something randomizing the issue observed. At least, only one of the two SO-DIMMs caused the problems so lets conclude it was partly baked up and failing randomly. At that time it seemed the cause was either bad CPU producing just too much heat or the fan. Fan was replaced, max temps went down from 92 oC to 82 oC. Two months later I faced more and more often that an external HDMI-connected display did not receive signal, so even the CPU got replaced. I got another drop in max temperatures, now max are about 70 oC. Cool!

    Back to python, the random crashes of my apps stopped after the memory module being replaced, actually who pair was replaced. I started to dream about linux kernel making mirroring inside memory for failure resiliency but there is nothing like that available.

    In summary, this lesson was hard and showed that there are no good tools to test hardware. Checksums should be used always and bits tested for fading over the time. The mirroring trick could have also uncovered a failing memory or CPU. Seems there is still way to go to a perfect computer.

    Thanks to everybody for their efforts on this issue. Whether python takes something from this lesson is up to you.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Dec 1, 2013

    Thanks for the feedback!

    @neologix neologix mannequin closed this as completed Dec 1, 2013
    @neologix neologix mannequin added the invalid label Dec 1, 2013
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants