classification
Title: Make ElementTree event handling more modular to allow custom targets for the non-blocking parser
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eli.bendersky, jcea, jkloth, scoder
Priority: normal Keywords:

Created on 2013-09-02 00:03 by eli.bendersky, last changed 2013-09-28 17:43 by scoder.

Messages (8)
msg196758 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-09-02 00:03
As issue #17741 shows in detail, there's a number of implementation problems in the current ET that make a fully non-blocking parser more difficult to implement than it should be.

The main problem is that XMLParser._setevents relies on internal implementation details of the default TreeBuilder to do some of its tasks.

This has to be decoupled from implementation details completely. An XMLParser already has a non-blocking input side (feed/close) and a callback "push" API to the target. The API has to be fully expanded to include start-ns/end-ns and any other events that are required to allow custom tree builders as targets. This would make it possible to layer XMLPullParser on top of the stock XMLParser coupled with a special target that collects "events" from the callback calls.
msg196830 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-03 09:21
> This would make it possible to layer XMLPullParser on top of the stock XMLParser coupled with a special target that collects "events" from the callback calls.

Given that we have an XMLPullParser now, I think we should not clutter the API with more classes that do not add anything more than making other classes redundant. So, -1 on adding a special target that collects events. Anyone who wants to use that feature can just use the XMLPullParser.

In the Python implementation of that parser, custom targets can already be allowed right now. Only the C implementation prevents it currently (AFAICT) as it has its own built-in TreeBuilder target.

Therefore, +1 for the general cleanup to properly allow arbitrary targets as backend.

Also, +1 for allowing start-ns and end-ns event callbacks on parser targets, although that's a different feature entirely.
msg196832 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-03 09:23
(fixing subject to properly hit bug filters)
msg197552 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-13 06:18
The way the XMLPullParser is implemented in lxml.etree now is that it simply inherits from XMLParser. This would also make sense for ElementTree, even before supporting arbitrary targets. The patch in ticket #18990 makes this simple to do.

For reference, here is the implementation in lxml.

Iterparse:

https://github.com/lxml/lxml/blob/master/src/lxml/iterparse.pxi

XMLPullParser:

https://github.com/lxml/lxml/blob/d9f7cd8d12a27cafc4d65c6e280ea36156e3b837/src/lxml/parser.pxi#L1357

SAX based parser that collects events and/or maps parser callbacks to callbacks on the target object:

https://github.com/lxml/lxml/blob/master/src/lxml/saxparser.pxi
msg197590 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-13 15:24
> Also, +1 for allowing start-ns and end-ns event callbacks on parser targets, although that's a different feature entirely.

Actually, I take that back. I can't see a use case for this feature, and it doesn't really fit with the notion of fully qualified tag names in ElementTree (which itself is a great feature). I'm -0.7 on adding this.
msg197606 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-13 16:04
I created a separate ticket #19010 for the inheritance issue.
msg198526 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-28 15:42
Closing #18990 defines the API of the new XMLPullParser that 3.4 will ship with, so this ticket becomes an enhancement for future versions.
msg198537 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-28 17:43
Copying a relevant comment by Eli from http://bugs.python.org/issue18990#msg198145 and replying inline.

"""
The way the APIs are currently defined, XMLParser and XMLPullParser are different animals. XMLParser can be considered to only have one "front" in the API - feed() and close(). You feed() until the document is done and then you close() and get the parsed tree. There's no other way to get the parsed tree (unless you use a custom builder, I guess).

On the other hand XMLPullParser has two clear "fronts" - an input front with feed() and close() and an output front with read_events(). For XMLPullParser, close() is just an input signal. The canonical way to get output from XMLPullParser is read_events(). close() has no better reason to return output than feed(). When we decided to change the method names (recall that Antoine's originals were completely different), we perhaps forgot this detail.
"""

No, we didn't.


"""
Even though XMLPullParser's method is named close(), it's *not* like XMLParser's close(). If someone is using XMLPullParser for its close() he's likely using the class incorrectly.

Just as an example: consider that in a lot of use cases the programmer will want to discard parts of the tree that's parsed iteratively (similarly to the main use case of iterparse()), because the XML itself is too huge. It's a convenient streaming API, in other words. Now, if the reader discards parts of the tree (by deleting subtrees), then returning the root from close() becomes even more meaningless, because it's no longer the root and we have no idea what it actually is.
"""

Let me repeat that this was already the case before the new class was added and that it's a feature. If the target decides to discard parts of the tree, or not build a tree at all and (say) instead count elements and return their total number on close(), then that's what the user asked for by selecting that target.

Let's agree to disagree on your conclusions, but I still can't see any advantages of making the separation between the two classes. The way I see it, making XMLPullParser inherit from XMLParser makes it very easy to explain what the difference is: the read_events() method, i.e. the additional way to receive the parse events that the combination of parser and target generate. Essentially, it's the target that does all the work here and the parser only collects the results and presents them to the user. Thus my intention to keep the parser as "stupid" as it looks from the user's side, instead of adding something new right next to it.

That being said, if ElementTree keeps them separate and decides to *never* return anything from XMLPullParser.close(), then that's sufficiently compatible with lxml.etree, so I won't object to it. lxml has a long history of extending what's there in order to make it easier to use.

As long as we can find a way to keep both libraries compatible for users, I think we should be able to both move forward.
History
Date User Action Args
2013-09-28 17:43:11scodersetmessages: + msg198537
2013-09-28 15:42:25scodersettype: behavior -> enhancement
messages: + msg198526
versions: - Python 3.4
2013-09-13 16:04:35scodersetmessages: + msg197606
2013-09-13 15:24:14scodersetmessages: + msg197590
2013-09-13 06:18:29scodersetmessages: + msg197552
2013-09-03 09:23:09scodersetmessages: + msg196832
title: Make ET event handling more modular to allow custom targets for the non-blocking parser -> Make ElementTree event handling more modular to allow custom targets for the non-blocking parser
2013-09-03 09:21:13scodersetnosy: + scoder
messages: + msg196830
2013-09-02 03:52:38jceasetnosy: + jcea
2013-09-02 01:41:01jklothsetnosy: + jkloth
2013-09-02 00:03:41eli.benderskycreate