classification
Title: json and ElementTree parsers misbehave on streams containing more than a single object
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Frederick.Ross, eli.bendersky, eric.araujo, ezio.melotti, pitrou, r.david.murray, rhettinger
Priority: normal Keywords:

Created on 2012-05-18 17:29 by Frederick.Ross, last changed 2012-06-08 12:31 by eli.bendersky. This issue is now closed.

Messages (9)
msg161068 - (view) Author: Frederick Ross (Frederick.Ross) Date: 2012-05-18 17:29
When parsing something like '<a>x</a><a>y</a>' with xml.etree.ElementTree, or '{}{}' with json, these parser throw exceptions instead of reading a single element of the kind they understand off the stream (or throwing an exception if there is no element they understand) and leaving the stream in a sane state.

So I should be able to write

import xml.etree.ElementTree as et
import StringIO
s = StringIO.StringIO("<a>x</a><a>y</a>")
elem1 = et.parse(s)
elem2 = et.parse(s)

and have elem1 correspond to "<a>x</a>" and elem2 correspond to "<a>y</a>".

At the very least, if the parsers refuse to parse partial streams, they should at least not destroy the streams.
msg161599 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-05-25 18:52
I am not sure the parsers should be lenient.  One could argue that it’s the stream that is broken if it contains non-compliant XML or JSON.  Can you tell more about the use case?
msg161605 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-25 19:09
ElementTree supports incremental parsing with the iterparse() method, not sure it fills your use case:
http://docs.python.org/dev/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse

As for the json module, it doesn't have such a facility.
msg161607 - (view) Author: Frederick Ross (Frederick.Ross) Date: 2012-05-25 19:26
Antoine, It's not iterative parsing, it's a sequence of XML docs or json objects.

Eric, the server I'm retrieving from, for real time searches, steadily produces a stream of (each properly formed) XML or json documents containing new search results. However, at the moment I have to edit the stream on the fly to wrap an outer tag around it and remove any DTD in inner elements, or I can't use the XML parser. Such a workaround isn't possible with the json parser, since it has no iterative parsing mode.
msg161609 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-05-25 19:32
I think it is perfectly reasonable for a parser to leave the file pointer in some undefined further location into the file when it detects "extra stuff" and produces an error message.  One can certainly argue that producing that error message is a feature ("detect badly formed documents").  

I also think that your use case is a perfectly reasonable one, but I think a mode that supports your use case would be an enhancement.
msg161616 - (view) Author: Frederick Ross (Frederick.Ross) Date: 2012-05-25 20:06
In the case of files, sure, it's fine. The error gives me the offset, and I can go pull it out and buffer it, and it's fine. Plus XML is strict about having only one document per file.

For streams, none of this is applicable. I can't seek in a streaming network connection. If the parser leaves it in an unusable state, then I lose everything that may follow. It makes Python unusable in certain, not very rare, cases of network programming.

I'll just add that Haskell's Parsec does this right, and should be used as an example.
msg161617 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-05-25 20:12
Well, if the stream isn't seekable then I don't see how it can be left in any state other than the same one it leaves a file (read ahead as much as it read to generate the error).  So unfortunately by our backward compatibility rules I still think this will be a new feature.
msg161762 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-05-28 09:41
I don't think this is an enhancement to ET, because ET was not designed to be a streaming parser, which is what is required here. ET was designed to read a whole valid XML document. There is 'iterparse', as Antoine mentioned, but it is designed to "track changes to the tree while it is being built" - mostly to save memory.

You have streaming XML parsers in Python - for example xml.sax. You can also relatively easily use xml.sax to find the end of your document and then parse the buffer with ET.

I don't see how a comparison with Parsec (a parser generator/DSL library) makes sense. There are tons of such libraries for Python - just pick one.
msg162060 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-06-01 08:40
I propose to close this issue. If the problem in json is real and someone thinks it has to be fixed, a separate issue specifically for json should be opened.
History
Date User Action Args
2012-06-08 12:31:38eli.benderskysetstatus: open -> closed
resolution: wont fix
stage: resolved
2012-06-01 08:40:11eli.benderskysetmessages: + msg162060
2012-05-28 09:41:30eli.benderskysetmessages: + msg161762
2012-05-25 20:12:38r.david.murraysetmessages: + msg161617
2012-05-25 20:06:38Frederick.Rosssetmessages: + msg161616
2012-05-25 19:32:34r.david.murraysetversions: + Python 3.3, - Python 2.7
nosy: + r.david.murray

messages: + msg161609

type: enhancement
2012-05-25 19:26:38Frederick.Rosssetmessages: + msg161607
2012-05-25 19:09:34pitrousetmessages: + msg161605
2012-05-25 18:52:01eric.araujosetnosy: + pitrou, ezio.melotti, rhettinger, eric.araujo, eli.bendersky

messages: + msg161599
versions: - Python 2.6
2012-05-18 17:29:21Frederick.Rosscreate