New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Py_FatalError (msg=0x7f0e3b373232 "bad leading pad byte") at Python-2.7.5/Python/pythonrun.c:1689 #63043
Comments
Hi, (gdb) where It was drawing a chart using matplotlib-1.2.1. Is this a python or matplotlib bug? I shortened the gdb stacktrace hoping you don't need the rest. |
Should have included from the head of gdb output: Program terminated with signal 6, Aborted. |
This is a memory corruption. |
Would you please guide me what gdb commands I should issue for you? Thank you. BTW, I ran memtest86+ few days ago, although this is non-ECC memory I think HW is fine. |
Grr, forgot to look into a file where I recorded STDERR. Debug memory block at address p=0x449e6900: API 'o' Current thread 0x00007f0e3b7ec700: In that function I call: |
Python's debug-mode memory allocators add some "magic values" before and after each allocated chunk of memory, and check them when the chunk is freed to make sure nobody overwrote them. In this case, someone did overwrite the byte at p-5, where p is the address of the block originally returned to the requesting program code. It's also suspicious that it says this block was originally allocated by the 8,155,854,715th call to a debug allocation routine: is it plausible that you (for example) allocated over 8 billion objects? This "serial number" is also stored in bytes adjacent to (and after) the allocated block, so is also vulnerable to out-of-bounds stores by rogue code. So we have out-of-bounds stores here both before and after the requested memory. Sorry, but it's unlikely core Python is doing this - such errors are usually due to rogue extension modules. |
Thank you for explanation what is going on. I called matplotlibs drawing function to include 49308 dots and corresponding legend items with their colors. That's all I can say. I am not a native English speaker so I don't know what 'rogue extension modules' means. Do you think it is a matplotlib or numpy issue which were in the stacktrace? Or a memory module error? I am running analyses which take weeks. Will python give me a clear warning/error when the object counter overflows? So what should I do now? I am clue-less right now. |
Memory corruption can be difficult to track down. Best thing you can do is strive to find a test case as small and fast as possible that shows the same kind of error. By "rogue extension module" I just mean 3rd-party C code (like, for example, matplotlib). I doubt it's a hardware problem. That's possible, of course, but these kinds of errors are almost always the result of errors in C code. The stacktrace probably isn't helpful. All we know is that memory got corrupted _sometime_ between someone asking for a block of memory and releasing it. The corruption may have happened a millisecond ago, or weeks ago (if the program ran that long) - there's no way to tell by the time the memory corruption is _detected_. About "object counter overflows", I'm not sure what you're asking. Python doesn't have an object counter. The "serial number" in debug-mode allocators just counts the number of times a debug-mode malloc has been called. If that overflows, it would do no harm. Bottom line: no matter what's to blame here, the smaller & faster a test program you can find, the more likely it is to get fixed. |
By the way, if memory serves, compiling with --with-pydebug changes the memory layout of Python objects, so a Python compiled this way _cannot_ be used successfully with extension modules that were compiled without the same options. Did you rebuild your extensions too? |
Yes, I have rebuilt all python modules but even gdb exited on startup due to python ABI change. I am using Gentoo Linux (https://bugs.gentoo.org/show_bug.cgi?id=482348) and unless python-updater forgot to include some package in the listing of those needed to be recompiled, I am sane. And becuase gdb could not even start up I *hope* those no recompiled yet would NOT work AT ALL. Thanks for clarification, I thought that I might reach some internal number (the serial number as you say) in python and run out of some internal counters on objects. Actually, I hit these issues because I wondered why some of my application tests fail. Although all tests crunch a really lot of data they merely do the same in the end: draw charts using matplotlib which uses numpy. I have huge lists which I recently converted to generators (if possible) and now I even use imap(), izip(), ifilter() from itertools. One of the crashed tests has 153 levels in gdb stacktrace and few lines from the very top/outer already had the izip() objects. But the tests which crashed are not so huge like others, maybe take just 1/10 of the size of others, so I wonder why these failed. I think some crashes are related to me deleting explictly a huge list in my code even before leaving a function. Or maybe returning such lists between child/parent functions? Could valgring or something else help to find who is overwriting data of others? But I don't have experience with using it. I think this _figure.clear() crash could be manifestation of python deleting a wrong object/pointer. Some ugly for loops over lists took ... don't know how much but in total even 26GB of RAM was reserved by the process (most of it also as residual memory requirement). With itertools() I got down 10x. |
Well, if you delete a giant list, and the list held the only references remaining to the objects the list contained, then the memory for those objects will be free'd, one object at a time. A debug build would then detect the memory corruption in those objects. But the corruption has nothing to do with deleting the list then - deleting the list would merely trigger the code that _detects_ the (pre-existing) corruption. I can just urge you again to try to find a failing test as small and fast as possible. You feel lost now precisely because you're wandering through a _mountain_ of code ;-) If you want to play with the debug serial numbers, you can set a breakpoint in function bumpserialno() (in Python's Objects/obmalloc.c). This is the entire function: static void
bumpserialno(void)
{
++serialno;
} The function exists so you can easily set a breakpoint whenever What I _expect_ you'll find is that serialno never gets anywhere near 8155854715. If so, that just says again that the copy of serialno made when the corrupted object was created got corrupted (overwritten) by some bad C (or C++) code. It can't tell us who overwrote it, or when. |
I took a crack from another angle. I converted my application using cython and then used gcc. Finally, ran valgrind over the binary. It is not finished yet but already spotted plenty of hints. |
Did you read Misc/README.valgrind (in the Python tree)? The warnings you've seen so far are probably all nonsense, and README.valgrind explains how to avoid getting them. |
No, I did not know that. Thanks, I did now. <quote>
</quote> Why isn't there a configure switch for this? :( |
I don't know why there isn't a configure switch for this - but then I've never used valgrind - LOL ;-) Other developers use valgrind on Python routinely, though. So it's unlikely you'll find a legitimate problem _in Python_ reported by valgrind. |
I was just checking whether configure picked up my --with-pymalloc and incidentally saw: --with-valgrind Enable Valgrind support maybe Misc/README.valgrind needs revision and should explain what that does as well? ;-) It should also explain what is one supposed to do with Misc/valgrind-python.supp. It becomes clear once one goes to edit that file, true. ;-) |
I opened bpo-18859 about the lack of --with-valgrind info in Misc/README.valgrind. Thanks for noticing! |
I wanted to move away from the --with-pydebug to a normal python and I failed with: # emerge dev-lang/python:2.7
Calculating dependencies... done! Great. So, what can we gather now from this? It got overwritten within few dozens of seconds at most since I started emerge. It is in a different memory region (so no memory module HW bug). I am attaching the resolved stacktrace just in case ... |
One more note. At that time I had running my application which at that time was parsing an XML file using xml.parsers.expat! That is being run in my pipeline before I render figures (the initially reported crash case). So, matplotlib/numpy is ruled out. |
Hmm. I don't quite know what you're doing: you said you're getting away from --with-pydebug, but these "bad leading pad byte" messages can't be generated unless Python is compiled with (at least) PYMALLOC_DEBUG defined. That said, my advice remains the same: whittle down the program (instead of fiddling with tools) to get a failing case as small & fast as possible. In your "few dozens of seconds at most", many _millions_ of things happen, and there's scant chance of out-thinking them. In the newest traceback (core.emerge.8425.gdb.txt), Python detected the corruption while running its cyclic-garbage collector. Forcing gc to run more often may catch the corruption sooner (use Python's gc.set_threshold() to force more frequent collections). |
Sorry, I should explain a bit more. Gentoo Linux uses a tool named emerge to handle packages. It is written in python. So, by that command I initiated re-compilation of python itself but it crashed quickly. At that moment emerge called the python available on the system at that time which was configured using --with-pydebug. Meanwhile I went to install DUMA http://duma.sourceforge.net/ and started to re-run emerge. # emerge duma Thanks for your analysis, will try to stitch something. |
Yet Another Tool ;-) Python's "small object" allocator grabs memory in chunks of 256KB from the system, and carves up the space itself. Other memory tools (like Valgrind ...) only see that Python has grabbed 256KB chunks, so can't detect anything wrong _within_ those large blocks. For that reason, compiling Python --without-pymalloc usually increases the usefulness of other tools. |
Thanks, will recompile without pymalloc. I don't understand why always is only 1 bit different. Could that be overwritten by another use process or do you believe it must be the python or some of the modules imported into it on runtime? |
It would be a severely lame OS that allowed a process to overwrite another process's memory ;-) "Bad C or C++ code", in the process you're running, is still the best guess. A memory module that sometimes dropped the last bit _could_ be at fault, but I doubt it (0xfb changes to 0xfa if the last bit is cleared). That the address is different across the two failure examples doesn't rule out bad memory, though: the memory address an OS shows you is a "logical address", an _offset_ from the start of the physical memory the OS happened to assign to the process. Across different runs, the same physical memory location may be shown as any number of different logical addresses. We'll understand why it's off by 1 eventually ;-) For example, it could be C code as simple as: char* p = (char *)allocate_some_memory(80);
...
p -= 5;
...
if (something)
--*p; It's extremely easy for C code to do insane stores; indeed, that's why there are so many sophisticated tools for trying to find the source of memory corruption in C/C++ code. Question: did the last failure come from your code, or from merely running |
That is why I asked if other process can interfere. So, they are isolated on Linux, good. ;-) The crash in #msg196481 is just the emerge written in python, at the start it is resolving some graph of package dependencies ... once it resolves order of packages and prints a lot of output. so, it did not get to that before it crashed. In sum with what you said, my application running as another process should not interfere with the emerge process, so it turns out that was irrelevant comment from my side. ;) |
I sent a msg to Python-Dev, asking for a Gentoo user to try to reproduce this. Thanks! |
I have a gentoo host, but I'd like to know how did the OP get a debug Python in the first place? The ebuild for python 2.7.5-r1 doesn't say anything about debug options. "How" would preferably include information about the C compiler used, etc. If there's no information, I can probably just hack --with-pydebug into a local ebuild, but I have something like 7 C compilers installed, I'd really like a good guess at the right one. Also, did he add any optimization flags etc when building the debug Python? (ebuild = the emerge configuration file that describes the build and install process for a package. In "2.7.5-r1" the "r1" indicates the first revised ebuild for the same upstream version.) |
Thanks for chiming in, Stephen! I can't answer your questions, but agree the original poster needs to tell you exactly what he did -- else we'd just be thrashing at random. There's been enough of that already ;-) |
Martin, can you please supply exact commands Stephen can use to try to reproduce the problem you saw running Any detail may be important. For example, it's possible this is due to a bug in the C compiler used to compile Python, or due to some flag passed to the compiler. That's why "exact commands" are important. The good news: if Stephen can reproduce it, he'll probably figure out all the rest, and you'll be free to resume enjoying your life ;-) |
Just two things:
This looks like an hardware issue to me. |
Hi Stephen, mkdir -p /etc/portage/env The above what I had originally. Yesterday I tried even: # cat /etc/portage/env/pydebug.conf but I don't know what the valgrind option really does and, whether that means: Or some combination of them? :( The Readme.valgrind does not answer this at all. I let DUMA inspect "emerge boost" run overnight but my computer stopped responding (16GB RAM). I tried only gcc-4.7.3 and python-2.7.5-r2. CFLAGS="-ggdb -pipe -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx -maes -mpclmul -mpopcnt -march=corei7-avx" Per comment from Charles-François, so you mean that this single-bit change won't be caught by valgrind, right? Why does not memtest86+ detect that? Could python when compiled with the --with-pydebug print also physical, hardware address of the wrong value? That would be really helpful here! Thanks. |
I uderstand that you are building Python using "emerge". I would try to get a release from python.org, do a normal build ... make distclean ... then install matplotlib using /tmp/bin/python setup.py ... then run (assuming you are in the build directory): valgrind --suppressions=Misc/valgrind-python.supp /tmp/bin/python your_program |
2013/8/30 Martin Mokrejs <report@bugs.python.org>:
I mean that in the vast majority of cases, a single bit flip is due to Software-induced memory corruptions generally corrupt at least a byte. Do you reproduce the crash systematically, or is it random? |
Martin, would it be possible to borrow someone else's machine and try to reproduce this? If you can, that would greatly reduce the probability of this being a HW error. It would also leave us with an exact set of commands to share so others can try it on their boxes. It's true that SW memory corruption usually affects several contiguous bytes, but "usually" != "always" ;-) |
I think it would be of tremendous help if python reported the physical memory address though do not know how much work is that for you. I already said that, I wonder why memtest86+ could not find an error, I even tried other tools for that. Could --with-pydebug immediately after writing those pad bytes try to read them back, just to make sure they are readable? Theoretically one could catch the wrong bit of memory immediately. It is random, I think CPU must be fully loaded for a long while, plus something unknown has to kick in. Maybe that has to heat up the memory modules to cause the failure. My motherboard is already 3rd but doubt it is at fault, except that the SandyBridge chip is bad by design. Was replaced few days ago with CPU cooler (was the main reason for technician to come). I take that replacing motherboard was a nice offer from Dell. I would instead think that maybe CPU is heating too much. BIOS is weird. ACPI and PCIe ASPM are exploiting Linux kernel hotplug support and PCI developers are puzzled. I experience that SandyBridge resets e.g. its USB2.0 controller if there was a hotplug even on some of its other root ports (express card was inserted/ejected). But in overall I don't think it is a bad deal and it works. These exceptional glitches are I think a software error or BIOS error. It just wouldn't be running all the days and nights. No, it is not overheated, at least not now with the replaced cooler. I can put in other memory module (with which I bought the laptop from Dell). Other than that, hmm ... I could take out the hard drive and attach it to some other computer, boot from Live-DVD ... the problem is I need a lot of RAM for some of my stuff and it would have to be relatively same CPU type (I compiled binaries with avx). Yes, the emerge crash definitely was another case, I could hope to get that happen on even 1GB RAM machine, sure. I just don't have a testcase for that to trigger. I tried to stitch something but no, I probably coded something not going in the right direction. If you would have the time to stitch down some test for me to execute with the garbage collector calls, it would be probably the best. |
OK, it sounds to me like you do not have a reproducible test case, of any kind. It that's true, this bug report isn't going anywhere :-( Python isn't a memory-testing program, so it would be insanely inefficient for it to (for example) read up every byte it writes, just to make sure the memory isn't bad. If you want to alter the source of _your_ Python to do that, fine! It won't go into the official distribution, but it might help you. For the same reason, core Python won't go into the business of sorting out various kinds of memory addresses. User-level code (including Python) works with the virtual (logical) addresses the operating system gives it. For a hint of some of the difficulties in going deeper than that, here's a note from Linus Torvalds:
About "If you would have the time to stitch down some test for me to execute with the garbage collector calls, it would be probably the best." I'm not sure what you're suggesting there. That we write a Python program that tries to trigger errors in your BIOS? LOL - not likely ;-) If you want to force a lot of allocations, deallocations, and gc runs, something like this will do it: """ class C(object):
def __init__(self):
self.loop = self
N = 500000
d = dict((C(), C()) for i in xrange(N))
j = 0
while 1:
for i in xrange(N // 2):
d.popitem()
for i in xrange(N // 2):
d[C()] = C()
j += 1
if j % 10 == 0:
print j
gc.set_debug(gc.DEBUG_STATS)
gc.collect()
gc.set_debug(0)
stress()
""" Setting N larger will make it consume more RAM. The output is essentially meaningless, produced just so you know it's still running. |
So with your test program I did not yet hit the error. It used in the end 1.3GB of RAM, maybe in the good region. while this was python-2.7.5 configures as --without-pymalloc maybe the output from valgrind becomes more useful. |
Hardware failures or glitches can involve any number of parameters. (bonus points if you can try on a setup with ECC RAM) |
I did "emerge =dev-lang/python-2.7.5-r1" *twice* with the environment configuration described in msg196520, then *once* with it disabled because one of the cases you described was when you tried to revert to a non-debug Python. (Besides, I am willing to risk your crash while I'm watching for it, but not a time bomb that will go off when I'm on deadline :-). All builds succeeded and all passed the test suite. Here's how the debug build describes itself: == CPython 2.7.5 (default, Sep 1 2013, 00:59:02) [GCC 4.6.4] The test suite ran uneventfully (with a few DeprecationWarnings) except for this: 6 skips unexpected on linux2: but I suppose that is expected on Gentoo. If any of those modules (bsddb, tcl, tk) are built into your Python, a problem in one of those might be the culprit. Oh, damn. I just reread the whole thread. For some reason I thought you were using gcc 4.6.4, but now I see you report 4.7.3. OK build with 4.7.3 and your flags (also restore the --with-pydebug config): # export CFLAGS=" -ggdb -pipe -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx -maes -mpclmul -mpopcnt" and we crash (from make output) immediately after linking ./python: x86_64-pc-linux-gnu-ranlib libpython2.7.a However, I'm pretty sure this is due to my hardware not liking your -m flags, not the crash you reported. I'll try backing those flags out, but if anybody has a suggestion for the most aggressive set similar to yours, I'd appreciate it. But first this process is going to go sleep(25200). |
Someone may find the new stress.valgrind.stderr interesting, but - since I've never used valgrind - it doesn't mean much to me. I _expected_ you'd run the little stress program under a debug Python and without valgrind, since that's the only combination you've tried so far that showed a definite problem ("pad leading pad byte" death, or the segfault in the other issue you filed). But it doesn't much matter - this is all just thrashing at random, yes? You need to find a reproducible test case, and/or try different hardware. The little stress program may or may not provoke an error under a debug-build Python, and may or may not require increasing N (to consume more RAM). If it does provoke an error, the next thing to try would be to write a little program that just writes 0xfb across a massive number of bytes, and then reads them all to verify they're still 0xfb. Or write one like that now, and preferably in C (it may matter how quickly the bytes are written - and it may not matter). But at this point youj're starting to write your own memory-testing program. In any case, there's really no evidence of an error in Python so far. Yes, Python has _detected_ a problem in some cases. But without a reproducible test case, I don't see that there's anything more we can do for you on our end - sorry. |
OK, I backed off the aggressive CFLAGS/CXXFLAGS to " -ggdb -pipe", and ran "emerge =dev-lang/python-2.7.5-r1" *once* each with and without the 'EXTRA_ECONF="--with-pydebug"' flag. Compiled with GCC 4.7.3. No crash, same test results as described previously for GCC 4.6.4. If you have other suggestions, let me know. |
Thanks for that, Stephen! I don't know of anything else you could try that would be helpful. The OP doesn't appear able to reproduce his problems either, and last I heard he was off running
Why? Hope springs eternal ;-) |
Yeah, hope is a good thing. But I've spent the last 20 years debugging an X11 application based on a Lisp interpreter, I save hope for fireflies, my dog, and my daughter these days. :-) To the OP: I don't follow Gentoo closely, but I have acquaintances who do. Between them and the occasional foray into the forums, I've gotten the impression that providing CFLAGS for optimization is associated with having hard-to-debug problems. They increase performance noticably only in a few applications. Python being a dynamic language, function calls and even variable references can be quite inefficient anyway. So I see no good reason to compile Python with aggressive CFLAGS, because it should be used only for moderately performance sensitive applications and as "glue code" and to provide UI. Instead, use them only for the specific applications that benefit (I suppose matplotlib *might* be one). Second, I tend to agree with the maintainers. The packages.env / pydebug.conf approach is the right thing for this kind of variant build. Third, you said you hoped to get better backtraces from --with-pydebug. That's a vain hope. Such options are intended to get better backtraces of C code from coredumps where the interpreter breaks down, not of Python code induced by Python exceptions caused by problems in user code. If you have trouble interpreting a backtrace, ask on python-list@python.org or comp.lang.python (they mirror each other, you only need one). If, after understanding the backtrace, you have an idea for way to get a better backtrace in this case, you can suggest it on python-ideas@python.org. Unfortunately, reporting "this backtrace is unintelligible, please improve it" as an RFE on the tracker is likely to get the reply "You're right, but we don't know how at this time. Patches welcome!" But you could try that if all else fails. |
Hi, Back to python, the random crashes of my apps stopped after the memory module being replaced, actually who pair was replaced. I started to dream about linux kernel making mirroring inside memory for failure resiliency but there is nothing like that available. In summary, this lesson was hard and showed that there are no good tools to test hardware. Checksums should be used always and bits tested for fading over the time. The mirroring trick could have also uncovered a failing memory or CPU. Seems there is still way to go to a perfect computer. Thanks to everybody for their efforts on this issue. Whether python takes something from this lesson is up to you. |
Thanks for the feedback! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: