Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expat parser throws Memory Error when parsing multiple files #50925

Closed
realpolitik mannequin opened this issue Aug 10, 2009 · 21 comments
Closed

expat parser throws Memory Error when parsing multiple files #50925

realpolitik mannequin opened this issue Aug 10, 2009 · 21 comments
Labels
topic-XML type-bug An unexpected behavior, bug, or error

Comments

@realpolitik
Copy link
Mannequin

realpolitik mannequin commented Aug 10, 2009

BPO 6676
Nosy @amauryfa, @ned-deily, @andybalaam
Files
  • expat-error.py
  • pyexpat.patch
  • pyexpat-2.patch
  • issue6676_3x.patch: 3x version
  • issue6676_27.patch: 27 version
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-03-28.00:52:37.968>
    created_at = <Date 2009-08-10.16:23:24.795>
    labels = ['expert-XML', 'type-bug']
    title = 'expat parser throws Memory Error when parsing multiple files'
    updated_at = <Date 2014-03-28.00:52:37.965>
    user = 'https://bugs.python.org/realpolitik'

    bugs.python.org fields:

    activity = <Date 2014-03-28.00:52:37.965>
    actor = 'ned.deily'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-03-28.00:52:37.968>
    closer = 'ned.deily'
    components = ['XML']
    creation = <Date 2009-08-10.16:23:24.795>
    creator = 'realpolitik'
    dependencies = []
    files = ['14684', '15090', '15094', '34240', '34241']
    hgrepos = []
    issue_num = 6676
    keywords = ['patch']
    message_count = 21.0
    messages = ['91452', '91455', '93777', '93779', '93782', '93785', '93790', '93791', '93794', '93802', '93803', '93804', '98753', '142950', '143242', '143243', '143295', '211673', '212338', '215005', '215011']
    nosy_count = 8.0
    nosy_names = ['amaury.forgeotdarc', 'ocean-city', 'ned.deily', 'andybalaam', 'realpolitik', 'willgrainger', 'python-dev', 'dhgutteridge']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6676'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @realpolitik
    Copy link
    Mannequin Author

    realpolitik mannequin commented Aug 10, 2009

    I'm using the Expat python interface to parse multiple XML files in an
    application and have found that it throws a "Memory Error" exception if
    multiple calls are made to xmlparser.ParseFile(file) on the same
    xmlparser object. This occurs even with a vanilla xmlparser object
    created with xml.parsers.expat.ParserCreate().

    Python Version: 2.6.2
    Operating System: Ubuntu

    @realpolitik realpolitik mannequin added topic-XML type-bug An unexpected behavior, bug, or error labels Aug 10, 2009
    @realpolitik
    Copy link
    Mannequin Author

    realpolitik mannequin commented Aug 10, 2009

    This also occurs with Python 2.5.1 on OS X

    @andybalaam
    Copy link
    Mannequin

    andybalaam mannequin commented Oct 9, 2009

    I am also seeing this with Python 2.5.2 on Ubuntu.

    @andybalaam
    Copy link
    Mannequin

    andybalaam mannequin commented Oct 9, 2009

    Just in case it wasn't obvious - the workaround is to create a new
    parser (with xml.parsers.expat.ParserCreate()) for every XML file you
    want to parse.

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Oct 9, 2009

    I'm not familiar with expat, but we can see what is happening more
    clearly with attached adhok patch.

    Traceback (most recent call last):
      File "expat-error.py", line 14, in <module>
        p.ParseFile(file)
    xml.parsers.expat.ExpatError: parsing finished: line 2, column 482

    It seems ParseFile() doesn't support second call. I'm not sure this is
    intended behavior or not.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 9, 2009

    The patch is good; a test would be appreciated.

    The difference now is that in case of true low-memory conditions,
    ExpatError("no memory") is raised instead of MemoryError.
    This is acceptable IMO.

    It seems ParseFile() doesn't support second call
    This is correct; the C expat library has a function XML_ParserReset()
    which could be called before calling ParseFile() again, but pyexpat does
    not expose it yet (see bpo-1208730).

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Oct 9, 2009

    Well, I tried to write test like this.

    1. Check if xml.parsers.expat.error is raised.
    2. Compare *code* attribute of error object with
      xml.parsers.expat.errors.XML_ERROR_FINISHED

    But I noticed XML_ERROR_FINISHED is not integer but string. (!)

    According to
    http://docs.python.org/library/pyexpat.html#expaterror-objects

    ExpatError.code

    Expat’s internal error number for the specific error. This will
    match one of the constants defined in the errors object from
    this module.

    Is this document bug or implementation bug? Personally, I think string
    'parsing finished' as error constant might be useless...

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 9, 2009

    Looks like an implementation bug to me; far too late to change it, though.

    In your test, you could use
    pyexpat.ErrorString(e.code) == pyexpat.errors.XML_ERROR_FINISHED
    And the docs could mention this trick.

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Oct 9, 2009

    Here is the patch. I'm not confident with my English comment though.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 9, 2009

    Do you know the new "context manager" feature of assertRaises? it makes
    it easier to check for exceptions.
    I join a new patch that uses it.

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Oct 9, 2009

    I knew existence of that new feature, but didn't know how to use it.

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Oct 9, 2009

    Hmm, looks useful. I think your patch is good. Only one problem is that
    we cannot use this new feature in python2.6. If we use my patch in that
    branch, I think there is no problem.

    @willgrainger
    Copy link
    Mannequin

    willgrainger mannequin commented Feb 2, 2010

    I don't think this is a python specific problem. I have just seen
    the same error when working with the expat library from C, and the cause
    is using the same parser to read multiple files.

    @dhgutteridge
    Copy link
    Mannequin

    dhgutteridge mannequin commented Aug 25, 2011

    The documentation should definitely be updated to clarify that a parser instance is not reusable with more than one file. I had a look at the equivalent documentation for Perl and TCL, and Perl's implementation explicitly does not allow attempts to reuse the parser instance (which is clearly noted in the documentation), and TCL's implementation (or one of them, anyway) offers a reset call that explicitly resets the parser in preparation for another file to be submitted.

    @ned-deily
    Copy link
    Member

    I agree that, at a minimum, the documentation should be updated to include a warning about not reusing a parser instance. Whether it's worth trying to plug all the holes in the expat library is another issue (see, for instance, bpo-12829). David, would you be willing to propose a wording for a documentation change?

    @ned-deily
    Copy link
    Member

    Also, note bpo-1208730 proposes a feature to expose a binding for XML_ParserReset and has the start of a patch.

    @dhgutteridge
    Copy link
    Mannequin

    dhgutteridge mannequin commented Sep 1, 2011

    Ned: My proposed wording is: "Note that only one document can be parsed by a given instance; it is not possible to reuse an instance to parse multiple files." To provide more detail, one could also add something like: "The isfinal argument of the Parse() method is intended to allow the parsing of a single file in fragments, not the submission of multiple files."

    @dhgutteridge
    Copy link
    Mannequin

    dhgutteridge mannequin commented Feb 19, 2014

    Updating to reflect the Python 3.4 documentation is now also relevant to this discussion. Perhaps someone could commit a change something like my suggestion in msg143295?

    @ned-deily
    Copy link
    Member

    Thanks for the reminder, David. Here are patches for 3.x and 2.7 that include updated versions of the proposed pyexpat.c and test_pyexpat.py patches along with a doc update along the lines suggested by David.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 27, 2014

    New changeset 74faca1ac59c by Ned Deily in branch '2.7':
    Issue bpo-6676: Ensure a meaningful exception is raised when attempting
    http://hg.python.org/cpython/rev/74faca1ac59c

    New changeset 9e3fc66ee0b8 by Ned Deily in branch '3.4':
    Issue bpo-6676: Ensure a meaningful exception is raised when attempting
    http://hg.python.org/cpython/rev/9e3fc66ee0b8

    New changeset ee0034434e65 by Ned Deily in branch 'default':
    Issue bpo-6676: merge from 3.4
    http://hg.python.org/cpython/rev/ee0034434e65

    @ned-deily
    Copy link
    Member

    Applied for release in 3.5.0, 3.4.1 and 2.7.7. Thanks, everyone!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants