This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xml.sax segfault on error
Type: Stage:
Components: XML Versions: Python 2.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: nnorwitz Nosy List: adamsampson, facundobatista, jojoworks, moraes, nnorwitz
Priority: high Keywords:

Created on 2004-03-11 14:14 by adamsampson, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
testexpat.py adamsampson, 2004-03-11 14:14 Standalone code that demonstrates the problem
Messages (8)
msg20198 - (view) Author: Adam Sampson (adamsampson) Date: 2004-03-11 14:14
While (mistakenly) using Mark Pilgrim's feedparser
module to parse data from
<http://www.gothamist.com/archives/news_nyc/index.php>,
Python segfaults when it should invoke an error handler
for invalid XML. The attached code demonstrates the
problem; it occurs with Python 2.2.3 and 2.3.3 on my
system. I've tried to chop the example data down as far
as possible, but reducing it any further doesn't
exhibit the problem (it's currently just above 64k,
which might be a coincidence).

The gdb traceback I get from the example is as follows:

#0  normal_updatePosition (enc=0x404a4fc0, 
    ptr=0x40682000 <Address 0x40682000 out of bounds>, 
    end=0x81e87e0 "a></div>\n\n<div
id=\"content\">\n\n<div class=\"blog\">\n<!--\n<rdf:RDF
xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"\n
       
xmlns:trackback=\"http://madskills.com/public/xml/rss/module/trackback/\"\n"...,
pos=0x81e7dac)
    at
/120g/gar/python/python23/work/Python-2.Modules/expat/xmltok_impl.c:1745
#1  0x40484288 in XML_GetCurrentLineNumber
(parser=0x81e7c18)
    at
/120g/gar/python/python23/work/Python-2.Modules/expat/xmlparse.c:1605
#2  0x40481fc5 in set_error (self=0x0,
code=XML_ERROR_TAG_MISMATCH)
    at
/120g/gar/python/python23/work/Python-2.Modules/pyexpat.c:124
#3  0x40480ae7 in xmlparse_Parse (self=0x402fddac,
args=0x0)
    at
/120g/gar/python/python23/work/Python-2.Modules/pyexpat.c:888
#4  0x080fc25a in PyCFunction_Call (func=0x402faa0c,
arg=0x402f338c, 
    kw=0xfffffffb) at Objects/methodobject.c:108
#5  0x080aa674 in call_function (pp_stack=0xbffff03c,
oparg=0)
    at Python/ceval.c:3439
#6  0x080a8a2e in eval_frame (f=0x816e45c) at
Python/ceval.c:2116
#7  0x080a95bc in PyEval_EvalCodeEx (co=0x40303de0,
globals=0xfffffffb, 
    locals=0x0, args=0x816e5a8, argcount=2,
kws=0x816a9fc, kwcount=0, 
    defs=0x40321678, defcount=1, closure=0x0) at
Python/ceval.c:2663
#8  0x080aa729 in fast_function (func=0xfffffffb,
pp_stack=0xbffff1bc, n=2, 
    na=0, nk=135703028) at Python/ceval.c:3529
#9  0x080aa56c in call_function (pp_stack=0xbffff1bc,
oparg=0)
    at Python/ceval.c:3458
#10 0x080a8a2e in eval_frame (f=0x816a894) at
Python/ceval.c:2116
#11 0x080a95bc in PyEval_EvalCodeEx (co=0x402fd2a0,
globals=0xfffffffb, 
    locals=0x0, args=0x402f3318, argcount=2, kws=0x0,
kwcount=0, defs=0x0, 
    defcount=0, closure=0x0) at Python/ceval.c:2663
#12 0x080fbda7 in function_call (func=0x4030617c,
arg=0x402f330c, kw=0x0)
    at Objects/funcobject.c:504
#13 0x0805b899 in PyObject_Call (func=0x40682000,
arg=0x0, kw=0x0)
    at Objects/abstract.c:1755
#14 0x08062288 in instancemethod_call (func=0x4030617c,
arg=0x402f330c, kw=0x0)
    at Objects/classobject.c:2433
#15 0x0805b899 in PyObject_Call (func=0x40682000,
arg=0x0, kw=0x0)
    at Objects/abstract.c:1755
#16 0x080aa892 in do_call (func=0x4032025c,
pp_stack=0x402f330c, na=0, nk=0)
    at Python/ceval.c:3644
#17 0x080aa4f9 in call_function (pp_stack=0xbffff5fc,
oparg=0)
    at Python/ceval.c:3460
#18 0x080a8a2e in eval_frame (f=0x818b414) at
Python/ceval.c:2116
#19 0x080aa7ad in fast_function (func=0xfffffffb,
pp_stack=0xbffff71c, n=2, 
    na=0, nk=1076865996) at Python/ceval.c:3518
#20 0x080aa56c in call_function (pp_stack=0xbffff71c,
oparg=0)
    at Python/ceval.c:3458
#21 0x080a8a2e in eval_frame (f=0x8183814) at
Python/ceval.c:2116
#22 0x080a95bc in PyEval_EvalCodeEx (co=0x402ed2a0,
globals=0xfffffffb, 
    locals=0x0, args=0x0, argcount=0, kws=0x0,
kwcount=0, defs=0x0, 
    defcount=0, closure=0x0) at Python/ceval.c:2663
#23 0x080abdb9 in PyEval_EvalCode (co=0x0, globals=0x0,
locals=0x0)
    at Python/ceval.c:537
#24 0x080d7d2b in run_node (n=0x402bb79c, filename=0x0,
globals=0x0, 
    locals=0x0, flags=0x0) at Python/pythonrun.c:1265
#25 0x080d74df in PyRun_SimpleFileExFlags (fp=0x8139050, 
    filename=0xbffffa4d "testexpat.py",
closeit=-1073743283, flags=0xbffff878)
    at Python/pythonrun.c:862
#26 0x08054dd5 in Py_Main (argc=1, argv=0xbffff8f4) at
Modules/main.c:415
#27 0x0805492b in main (argc=0, argv=0x0) at
Modules/python.c:23
msg20199 - (view) Author: Mark Moraes (moraes) Date: 2004-03-15 10:22
Logged In: YES 
user_id=390363

I ran into this as well -- turns out that 64k is relevant: I
have a simpler script that reproduces this problem -- create
an unterminated character ref such as "&#171" without the
trailing semi-colon and add roughly 64k of data after it. 
The crash occurs if the sax parser has an ErrorHandler set
where the fatalError() method returns normally instead of
terminating/raising the exception.

As a defensive measure, I suggest that any call to the
fatalError method be followed by a raise of the exception if
fatalError returns.
msg20200 - (view) Author: Mark Moraes (moraes) Date: 2004-03-15 10:26
Logged In: YES 
user_id=390363

#! /usr/bin/env python

dhead = """<?xml version="1.0" encoding="ISO-8859-1" ?>
<item><title>&#187</title></item>
<item><title>
"""
dtail = """</title></item>
"""

import xml.sax
from cStringIO import StringIO as _StringIO

class _StrictFeedParser:
    def _err(self, errtype, exc):
        print errtype, exc.getMessage(), \
            'line', exc.getLineNumber(), \
            'column', exc.getColumnNumber()
    def fatalError(self, exc):
        self._err('fatalError', exc)
        # raise exc # avoids the problem
    def error(self, exc):
        self._err('error', exc)
    def warning(self, exc):
        self._err('warning', exc)

def parse(data):
    feedparser = _StrictFeedParser()
    saxparser = xml.sax.make_parser(["drv_libxml2"])
    saxparser.setErrorHandler(feedparser)
    source = xml.sax.xmlreader.InputSource()
    source.setByteStream(_StringIO(data))
    saxparser.parse(source)

if __name__ == '__main__':
    for i in xrange(65427,66000,1):
        print i
        parse(dhead + 'x'*i + dtail)
msg20201 - (view) Author: Deleted User jojoworks (jojoworks) Date: 2004-03-19 22:45
Logged In: YES 
user_id=688090

Result of a brief scan:

When the exceptional situation happens, the pyexpat.c module
trashes parser->m_positionPtr (aliased as positionPtr) (see
modules/expat/xmlparse.c, function
XML_GetCurrentLineNumber() and similar). When the
errorhandler forgets to raises an exception or exit, the
module tries to access memory through the garbage pointer
and segfaults.

It seems to be buffer overrun bug: the pointer gets thrashed
when some sort of input (the erroneous entity) is large
enough to reach it (or variable from which the pointer is
fetched).

It is impossible for me to do further investigations because
gdb/Mangrake GNU/Linux refuses to trace dlopen()ed shared
object and I don't understand it's code enough to "debug" it
by hand.
msg20202 - (view) Author: Deleted User jojoworks (jojoworks) Date: 2004-04-03 19:28
Logged In: YES 
user_id=688090

Now when I thinked out a way how to put a breakpoint into
shared library and get GDB stopped on it I investigated on
this bug more and found following:

The bug is in XML_GetBuffer() located at xmlparse.c:1498.
When this call realizes that the buffer is too small, a
larger one is allocated and data are copied. The problem is
that m_eventPtr is not transformed to be pointing into the
new buffer during this transaction and so is still pointing
to the old (and invalid) buffer.

In the case described here the invalid pointer "m_eventPtr"
(invalidated after the buffer was moved by XML_GetBuffer) is
passed (xmlparse:1606) to XmlUpdatePosition(), which assumes
that it is valid. The XmlUpdatePosition() call touches
memory through the pointer, falls into "memory hole"
(because the memory to which the pointer is pointing was
freed) and segfaults.
msg20203 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2005-10-03 01:26
Logged In: YES 
user_id=33168

This doesn't crash for me on 2.5 (CVS) (2.4 seems about the
same), but there is use of uninitialized memory in valgrind.
 (It does crash when running under valgrind.)  Either way,
there is still a problem.

Do you still see the crash?

==26881== Use of uninitialised value of size 8
==26881==    at 0x12BA4518: normal_updatePosition
(xmltok_impl.c:1745)
==26881==    by 0x12B91005: XML_GetCurrentLineNumber
(xmlparse.c:1782)
==26881==    by 0x12B884EA: set_error (pyexpat.c:125)
==26881==    by 0x12B8BCCD: get_parse_result (pyexpat.c:901)
==26881==    by 0x12B8BD6A: xmlparse_Parse (pyexpat.c:923)
==26881==    by 0x4E00B0: PyCFunction_Call (methodobject.c:73)
==26881==    by 0x48B2F3: call_function (ceval.c:3580)
==26881==    by 0x486F4C: PyEval_EvalFrameEx (ceval.c:2181)
==26881==    by 0x48932B: PyEval_EvalCodeEx (ceval.c:2754)
==26881==    by 0x48B76E: fast_function (ceval.c:3673)
==26881==    by 0x48B436: call_function (ceval.c:3601)
==26881==    by 0x486F4C: PyEval_EvalFrameEx (ceval.c:2181)
==26881==    by 0x48932B: PyEval_EvalCodeEx (ceval.c:2754)
==26881==    by 0x4DF7FA: function_call (funcobject.c:548)
==26881==    by 0x418DE2: PyObject_Call (abstract.c:1777)
==26881==
==26881== Invalid read of size 1
==26881==    at 0x12BA4514: normal_updatePosition
(xmltok_impl.c:1745)
==26881==    by 0x12B91005: XML_GetCurrentLineNumber
(xmlparse.c:1782)
==26881==    by 0x12B884EA: set_error (pyexpat.c:125)
==26881==    by 0x12B8BCCD: get_parse_result (pyexpat.c:901)
==26881==    by 0x12B8BD6A: xmlparse_Parse (pyexpat.c:923)
==26881==    by 0x4E00B0: PyCFunction_Call (methodobject.c:73)
==26881==    by 0x48B2F3: call_function (ceval.c:3580)
==26881==    by 0x486F4C: PyEval_EvalFrameEx (ceval.c:2181)
==26881==    by 0x48932B: PyEval_EvalCodeEx (ceval.c:2754)
==26881==    by 0x48B76E: fast_function (ceval.c:3673)
==26881==    by 0x48B436: call_function (ceval.c:3601)
==26881==    by 0x486F4C: PyEval_EvalFrameEx (ceval.c:2181)
==26881==    by 0x48932B: PyEval_EvalCodeEx (ceval.c:2754)
==26881==    by 0x4DF7FA: function_call (funcobject.c:548)
==26881==    by 0x418DE2: PyObject_Call (abstract.c:1777)
==26881==  Address 0x12B720C0 is 0 bytes after a block of
size 131072 alloc'd
==26881==    at 0x11B19F13: malloc (vg_replace_malloc.c:149)
==26881==    by 0x12B90AB5: XML_GetBuffer (xmlparse.c:1634)
==26881==    by 0x12B906B0: XML_Parse (xmlparse.c:1528)
==26881==    by 0x12B8BD5F: xmlparse_Parse (pyexpat.c:923)
==26881==    by 0x4E00B0: PyCFunction_Call (methodobject.c:73)
==26881==    by 0x48B2F3: call_function (ceval.c:3580)
==26881==    by 0x486F4C: PyEval_EvalFrameEx (ceval.c:2181)
==26881==    by 0x48932B: PyEval_EvalCodeEx (ceval.c:2754)
==26881==    by 0x48B76E: fast_function (ceval.c:3673)
==26881==    by 0x48B436: call_function (ceval.c:3601)
==26881==    by 0x486F4C: PyEval_EvalFrameEx (ceval.c:2181)
==26881==    by 0x48932B: PyEval_EvalCodeEx (ceval.c:2754)
==26881==    by 0x4DF7FA: function_call (funcobject.c:548)
==26881==    by 0x418DE2: PyObject_Call (abstract.c:1777)
==26881==    by 0x421034: instancemethod_call
(classobject.c:2447)
msg58281 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2007-12-07 20:07
There's no crash in 2.5.1 neither in the trunk.

As 2.3 or 2.4 won't be fixed, do you think that this bug can be closed?

Thanks!
msg58305 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2007-12-08 23:06
2007/12/8, Adam Sampson <ats@offog.org>:

> On Fri, Dec 07, 2007 at 08:07:38PM -0000, Facundo Batista wrote:
> > There's no crash in 2.5.1 neither in the trunk.
> > As 2.3 or 2.4 won't be fixed, do you think that this bug can be closed?
> 
> Yep, go for it. :)
> 
> Thanks very much,
> 
> --
> Adam Sampson <ats@offog.org>                         <http://offog.org/>
History
Date User Action Args
2022-04-11 14:56:03adminsetgithub: 40024
2007-12-08 23:06:40facundobatistasetstatus: open -> closed
resolution: fixed
messages: + msg58305
2007-12-07 20:07:38facundobatistasetnosy: + facundobatista
messages: + msg58281
2004-03-11 14:14:12adamsampsoncreate