classification
Title: improve cElementTree iterparse error handling
Type: behavior Stage: resolved
Components: XML Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: flox Nosy List: amaury.forgeotdarc, effbot, flox, hniksic, python-dev
Priority: normal Keywords: patch

Created on 2008-05-16 12:20 by hniksic, last changed 2011-11-01 22:35 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
patch hniksic, 2008-05-16 12:20
issue2892_etree_iterparse.diff flox, 2011-10-29 00:29 review
Messages (7)
msg66935 - (view) Author: Hrvoje Nikšić (hniksic) Date: 2008-05-16 12:20
In some cases it is unfortunate that any error in the XML chunk seen by
the buffer prevents the events generated before the error from being
delivered.  For example, in some cases valid XML is embedded in a larger
file or stream, and it is useful to be able to ignore text that follows
the root tag, if any.

The iterparse API and expat itself make this possible, but it doesn't
work because in case of a parsing exception, iterparse doesn't deliver
the events generated before the exception.  A simple change to iterparse
makes this possible, however.  I would like to share the change with you
for possible inclusion in a future release.  Note that this change
shouldn't affect the semantics of iterparse: the exception is still
delivered to the caller, the only difference is that the events
generated by expat before the exception are not forgotten.

I am attaching a diff between the current implementation of iterparse,
and a modified one that fixes this problem.
msg107537 - (view) Author: Hrvoje Nikšić (hniksic) Date: 2010-06-11 08:54
Here is a small test case that demonstrates the problem, expected behavior and actual behavior:

{{{
for ev in xml.etree.cElementTree.iterparse(StringIO('<x></x>rubbish'), events=('start', 'end')):
    print ev
}}}

The above code should first print the two events (start and end), and then raise the exception.  In Python 2.7 it runs like this:

{{{
>>> for ev in xml.etree.cElementTree.iterparse(StringIO('<x></x>rubbish'), events=('start', 'end')):
...   print ev
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 84, in next
cElementTree.ParseError: junk after document element: line 1, column 7
}}}

Expected behavior, obtained with my patch, is that it runs like this:

{{{
>>> for ev in my_iterparse(StringIO('<x></x>rubbish'), events=('start', 'end')):
...  print ev
... 
('start', <Element 'x' at 0xb771cba8>)
('end', <Element 'x' at 0xb771cba8>)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 26, in __iter__
cElementTree.ParseError: junk after document element: line 1, column 7
}}}

The difference is, of course, only visible when printing events.  A side-effect-free operation, such as building a list using list(iterparse(...)) would behave exactly the same before and after the change.
msg107575 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-06-11 19:29
Note that this was fixed in upstream 1.3 (and verified by the selftests), but the fix and test was apparently lost when that code was merged into 2.7.  Since 2.7 is supposed to ship with 1.3, this is a regression, not a feature request.

(But 2.7 is in rc, and I'm on vacation, so I guess it's a bit too late to do anything about that.  I'll leave the final decision to flox and the python-dev crowd.)
msg118041 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-05 22:51
If it's a regression, it should be fixed in some 2.7.x  release
Is there a patch somewhere?
msg146580 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2011-10-28 23:12
unfortunately, I did not find the fix and the test in the upstream repository.

AFAIK, upstream should be there:
https://bitbucket.org/effbot/et-2009-provolone/src
msg146584 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2011-10-29 00:29
Proposed patch for 3.3.
msg146815 - (view) Author: Roundup Robot (python-dev) Date: 2011-11-01 22:35
New changeset 23ffaf975267 by Florent Xicluna in branch '3.2':
Closes #2892: preserve iterparse events in case of SyntaxError.
http://hg.python.org/cpython/rev/23ffaf975267

New changeset ca1e2cf2947b by Florent Xicluna in branch 'default':
Merge 3.2: issue #2892
http://hg.python.org/cpython/rev/ca1e2cf2947b

New changeset e1dde980a92c by Florent Xicluna in branch '2.7':
Issue #2892: preserve iterparse events in case of SyntaxError
http://hg.python.org/cpython/rev/e1dde980a92c
History
Date User Action Args
2011-11-01 22:35:24python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg146815

resolution: fixed
stage: patch review -> resolved
2011-10-29 00:29:27floxsetfiles: + issue2892_etree_iterparse.diff
keywords: + patch
messages: + msg146584

stage: needs patch -> patch review
2011-10-28 23:12:58floxsettype: enhancement -> behavior
messages: + msg146580
components: + XML, - Extension Modules
versions: + Python 3.3
2010-10-05 22:51:33amaury.forgeotdarcsetnosy: + amaury.forgeotdarc

messages: + msg118041
stage: needs patch
2010-06-11 19:29:43effbotsetassignee: effbot -> flox
messages: + msg107575
versions: + Python 2.7
2010-06-11 19:22:43pitrousetnosy: + flox
2010-06-11 08:54:30hniksicsetmessages: + msg107537
2010-06-09 22:19:08terry.reedysettype: behavior -> enhancement
versions: + Python 3.2, - Python 2.5
2008-05-16 13:07:30georg.brandlsetassignee: effbot
nosy: + effbot
2008-05-16 12:20:32hniksiccreate