classification
Title: expat parser throws Memory Error when parsing multiple files
Type: behavior Stage: resolved
Components: XML Versions: Python 3.5, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, andybalaam, dhgutteridge, ned.deily, ocean-city, python-dev, realpolitik, willgrainger
Priority: normal Keywords: patch

Created on 2009-08-10 16:23 by realpolitik, last changed 2014-03-28 00:52 by ned.deily. This issue is now closed.

Files
File name Uploaded Description Edit
expat-error.py realpolitik, 2009-08-10 16:23
pyexpat.patch ocean-city, 2009-10-09 14:11 review
pyexpat-2.patch amaury.forgeotdarc, 2009-10-09 15:29 review
issue6676_3x.patch ned.deily, 2014-02-27 07:17 3x version review
issue6676_27.patch ned.deily, 2014-02-27 07:18 27 version
Messages (21)
msg91452 - (view) Author: Matthew (realpolitik) Date: 2009-08-10 16:23
I'm using the Expat python interface to parse multiple XML files in an
application and have found that it throws a "Memory Error" exception if
multiple calls are made to xmlparser.ParseFile(file) on the same
xmlparser object. This occurs even with a vanilla xmlparser object
created with xml.parsers.expat.ParserCreate().

Python Version: 2.6.2
Operating System: Ubuntu
msg91455 - (view) Author: Matthew (realpolitik) Date: 2009-08-10 16:55
This also occurs with Python 2.5.1 on OS X
msg93777 - (view) Author: Andy Balaam (andybalaam) Date: 2009-10-09 08:23
I am also seeing this with Python 2.5.2 on Ubuntu.
msg93779 - (view) Author: Andy Balaam (andybalaam) Date: 2009-10-09 08:26
Just in case it wasn't obvious - the workaround is to create a new
parser (with xml.parsers.expat.ParserCreate()) for every XML file you
want to parse.
msg93782 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-10-09 09:45
I'm not familiar with expat, but we can see what is happening more
clearly with attached adhok patch.

Traceback (most recent call last):
  File "expat-error.py", line 14, in <module>
    p.ParseFile(file)
xml.parsers.expat.ExpatError: parsing finished: line 2, column 482

It seems ParseFile() doesn't support second call. I'm not sure this is
intended behavior or not.
msg93785 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-10-09 12:04
The patch is good; a test would be appreciated.

The difference now is that in case of true low-memory conditions,
ExpatError("no memory") is raised instead of MemoryError.
This is acceptable IMO.

> It seems ParseFile() doesn't support second call
This is correct; the C expat library has a function XML_ParserReset()
which could be called before calling ParseFile() again, but pyexpat does
not expose it yet (see issue1208730).
msg93790 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-10-09 13:00
Well, I tried to write test like this.

1. Check if xml.parsers.expat.error is raised.
2. Compare *code* attribute of error object with
xml.parsers.expat.errors.XML_ERROR_FINISHED

But I noticed XML_ERROR_FINISHED is not integer but string. (!)

According to
http://docs.python.org/library/pyexpat.html#expaterror-objects

> ExpatError.code
>
>    Expat’s internal error number for the specific error. This will
>    match one of the constants defined in the errors object from
>    this module.

Is this document bug or implementation bug? Personally, I think string
'parsing finished' as error constant might be useless...
msg93791 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-10-09 13:12
Looks like an implementation bug to me; far too late to change it, though.

In your test, you could use
  pyexpat.ErrorString(e.code) == pyexpat.errors.XML_ERROR_FINISHED
And the docs could mention this trick.
msg93794 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-10-09 14:11
Here is the patch. I'm not confident with my English comment though.
msg93802 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-10-09 15:29
Do you know the new "context manager" feature of assertRaises? it makes
it easier to check for exceptions.
I join a new patch that uses it.
msg93803 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-10-09 15:44
I knew existence of that new feature, but didn't know how to use it.
msg93804 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-10-09 16:13
Hmm, looks useful. I think your patch is good. Only one problem is that
we cannot use this new feature in python2.6. If we use my patch in that
branch, I think there is no problem.
msg98753 - (view) Author: Will Grainger (willgrainger) Date: 2010-02-02 18:20
I don't think this is a python specific problem. I have just seen 
the same error when working with the expat library from C, and the cause
is using the same parser to read multiple files.
msg142950 - (view) Author: David H. Gutteridge (dhgutteridge) Date: 2011-08-25 00:57
The documentation should definitely be updated to clarify that a parser instance is not reusable with more than one file.  I had a look at the equivalent documentation for Perl and TCL, and Perl's implementation explicitly does not allow attempts to reuse the parser instance (which is clearly noted in the documentation), and TCL's implementation (or one of them, anyway) offers a reset call that explicitly resets the parser in preparation for another file to be submitted.
msg143242 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2011-08-30 23:25
I agree that, at a minimum, the documentation should be updated to include a warning about not reusing a parser instance.  Whether it's worth trying to plug all the holes in the expat library is another issue (see, for instance, issue12829).  David, would you be willing to propose a wording for a documentation change?
msg143243 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2011-08-30 23:34
Also, note issue1208730 proposes a feature to expose a binding for XML_ParserReset and has the start of a patch.
msg143295 - (view) Author: David H. Gutteridge (dhgutteridge) Date: 2011-09-01 04:56
Ned: My proposed wording is: "Note that only one document can be parsed by a given instance; it is not possible to reuse an instance to parse multiple files."  To provide more detail, one could also add something like: "The isfinal argument of the Parse() method is intended to allow the parsing of a single file in fragments, not the submission of multiple files."
msg211673 - (view) Author: David H. Gutteridge (dhgutteridge) Date: 2014-02-19 23:42
Updating to reflect the Python 3.4 documentation is now also relevant to this discussion. Perhaps someone could commit a change something like my suggestion in msg143295?
msg212338 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2014-02-27 07:17
Thanks for the reminder, David.  Here are patches for 3.x and 2.7 that include updated versions of the proposed pyexpat.c and test_pyexpat.py patches along with a doc update along the lines suggested by David.
msg215005 - (view) Author: Roundup Robot (python-dev) Date: 2014-03-27 23:44
New changeset 74faca1ac59c by Ned Deily in branch '2.7':
Issue #6676: Ensure a meaningful exception is raised when attempting
http://hg.python.org/cpython/rev/74faca1ac59c

New changeset 9e3fc66ee0b8 by Ned Deily in branch '3.4':
Issue #6676: Ensure a meaningful exception is raised when attempting
http://hg.python.org/cpython/rev/9e3fc66ee0b8

New changeset ee0034434e65 by Ned Deily in branch 'default':
Issue #6676: merge from 3.4
http://hg.python.org/cpython/rev/ee0034434e65
msg215011 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2014-03-28 00:52
Applied for release in 3.5.0, 3.4.1 and 2.7.7.  Thanks, everyone!
History
Date User Action Args
2014-03-28 00:52:37ned.deilysetstatus: open -> closed
versions: + Python 3.5, - Python 3.3
messages: + msg215011

resolution: fixed
stage: patch review -> resolved
2014-03-27 23:44:36python-devsetnosy: + python-dev
messages: + msg215005
2014-02-27 07:18:14ned.deilysetfiles: + issue6676_27.patch
2014-02-27 07:17:49ned.deilysetfiles: + issue6676_3x.patch

stage: patch review
messages: + msg212338
versions: - Python 3.2
2014-02-19 23:42:36dhgutteridgesetmessages: + msg211673
versions: + Python 3.4
2011-09-01 04:56:04dhgutteridgesetmessages: + msg143295
2011-08-30 23:34:09ned.deilysetmessages: + msg143243
2011-08-30 23:25:50ned.deilysetnosy: + ned.deily

messages: + msg143242
versions: + Python 3.2, Python 3.3, - Python 2.6
2011-08-30 23:20:26ned.deilylinkissue12829 superseder
2011-08-25 00:57:12dhgutteridgesetnosy: + dhgutteridge
messages: + msg142950
2010-02-02 18:20:47willgraingersetnosy: + willgrainger
messages: + msg98753
2009-10-09 16:13:10ocean-citysetmessages: + msg93804
2009-10-09 15:44:53ocean-citysetmessages: + msg93803
2009-10-09 15:29:04amaury.forgeotdarcsetfiles: + pyexpat-2.patch

messages: + msg93802
2009-10-09 14:11:34ocean-citysetfiles: - pyexpat_addhok.patch
2009-10-09 14:11:17ocean-citysetfiles: + pyexpat.patch

messages: + msg93794
2009-10-09 13:12:12amaury.forgeotdarcsetmessages: + msg93791
2009-10-09 13:00:47ocean-citysetmessages: + msg93790
2009-10-09 12:04:56amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg93785
2009-10-09 09:45:54ocean-citysetfiles: + pyexpat_addhok.patch
versions: + Python 2.7
nosy: + ocean-city

messages: + msg93782

keywords: + patch
2009-10-09 08:26:50andybalaamsetmessages: + msg93779
2009-10-09 08:23:45andybalaamsetnosy: + andybalaam
messages: + msg93777
2009-08-10 16:55:44realpolitiksetmessages: + msg91455
2009-08-10 16:23:24realpolitikcreate