Issue 14852: json and ElementTree parsers misbehave on streams containing more than a single object

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59057

classification

Title:	json and ElementTree parsers misbehave on streams containing more than a single object
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.3

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	Frederick.Ross, eli.bendersky, eric.araujo, ezio.melotti, pitrou, r.david.murray, rhettinger
Priority:	normal	Keywords:

Created on 2012-05-18 17:29 by Frederick.Ross, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (9)
msg161068 - (view)	Author: Frederick Ross (Frederick.Ross)	Date: 2012-05-18 17:29
When parsing something like '<a>x</a><a>y</a>' with xml.etree.ElementTree, or '{}{}' with json, these parser throw exceptions instead of reading a single element of the kind they understand off the stream (or throwing an exception if there is no element they understand) and leaving the stream in a sane state. So I should be able to write import xml.etree.ElementTree as et import StringIO s = StringIO.StringIO("<a>x</a><a>y</a>") elem1 = et.parse(s) elem2 = et.parse(s) and have elem1 correspond to "<a>x</a>" and elem2 correspond to "<a>y</a>". At the very least, if the parsers refuse to parse partial streams, they should at least not destroy the streams.
msg161599 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-05-25 18:52
I am not sure the parsers should be lenient. One could argue that it’s the stream that is broken if it contains non-compliant XML or JSON. Can you tell more about the use case?
msg161605 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-25 19:09
ElementTree supports incremental parsing with the iterparse() method, not sure it fills your use case: http://docs.python.org/dev/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse As for the json module, it doesn't have such a facility.
msg161607 - (view)	Author: Frederick Ross (Frederick.Ross)	Date: 2012-05-25 19:26
Antoine, It's not iterative parsing, it's a sequence of XML docs or json objects. Eric, the server I'm retrieving from, for real time searches, steadily produces a stream of (each properly formed) XML or json documents containing new search results. However, at the moment I have to edit the stream on the fly to wrap an outer tag around it and remove any DTD in inner elements, or I can't use the XML parser. Such a workaround isn't possible with the json parser, since it has no iterative parsing mode.
msg161609 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-05-25 19:32
I think it is perfectly reasonable for a parser to leave the file pointer in some undefined further location into the file when it detects "extra stuff" and produces an error message. One can certainly argue that producing that error message is a feature ("detect badly formed documents"). I also think that your use case is a perfectly reasonable one, but I think a mode that supports your use case would be an enhancement.
msg161616 - (view)	Author: Frederick Ross (Frederick.Ross)	Date: 2012-05-25 20:06
In the case of files, sure, it's fine. The error gives me the offset, and I can go pull it out and buffer it, and it's fine. Plus XML is strict about having only one document per file. For streams, none of this is applicable. I can't seek in a streaming network connection. If the parser leaves it in an unusable state, then I lose everything that may follow. It makes Python unusable in certain, not very rare, cases of network programming. I'll just add that Haskell's Parsec does this right, and should be used as an example.
msg161617 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-05-25 20:12
Well, if the stream isn't seekable then I don't see how it can be left in any state other than the same one it leaves a file (read ahead as much as it read to generate the error). So unfortunately by our backward compatibility rules I still think this will be a new feature.
msg161762 - (view)	Author: Eli Bendersky (eli.bendersky) *	Date: 2012-05-28 09:41
I don't think this is an enhancement to ET, because ET was not designed to be a streaming parser, which is what is required here. ET was designed to read a whole valid XML document. There is 'iterparse', as Antoine mentioned, but it is designed to "track changes to the tree while it is being built" - mostly to save memory. You have streaming XML parsers in Python - for example xml.sax. You can also relatively easily use xml.sax to find the end of your document and then parse the buffer with ET. I don't see how a comparison with Parsec (a parser generator/DSL library) makes sense. There are tons of such libraries for Python - just pick one.
msg162060 - (view)	Author: Eli Bendersky (eli.bendersky) *	Date: 2012-06-01 08:40
I propose to close this issue. If the problem in json is real and someone thinks it has to be fixed, a separate issue specifically for json should be opened.

History
Date	User	Action	Args
2022-04-11 14:57:30	admin	set	github: 59057
2012-06-08 12:31:38	eli.bendersky	set	status: open -> closed resolution: wont fix stage: resolved
2012-06-01 08:40:11	eli.bendersky	set	messages: + msg162060
2012-05-28 09:41:30	eli.bendersky	set	messages: + msg161762
2012-05-25 20:12:38	r.david.murray	set	messages: + msg161617
2012-05-25 20:06:38	Frederick.Ross	set	messages: + msg161616
2012-05-25 19:32:34	r.david.murray	set	versions: + Python 3.3, - Python 2.7 nosy: + r.david.murray messages: + msg161609 type: enhancement
2012-05-25 19:26:38	Frederick.Ross	set	messages: + msg161607
2012-05-25 19:09:34	pitrou	set	messages: + msg161605
2012-05-25 18:52:01	eric.araujo	set	nosy: + pitrou, ezio.melotti, rhettinger, eric.araujo, eli.bendersky messages: + msg161599 versions: - Python 2.6
2012-05-18 17:29:21	Frederick.Ross	create