Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ElementTree event handling more modular to allow custom targets for the non-blocking parser #63102

Open
elibendersky mannequin opened this issue Sep 2, 2013 · 8 comments
Labels
stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement

Comments

@elibendersky
Copy link
Mannequin

elibendersky mannequin commented Sep 2, 2013

BPO 18902
Nosy @jcea, @scoder, @jkloth

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2013-09-02.00:03:41.532>
labels = ['type-feature', 'library']
title = 'Make ElementTree event handling more modular to allow custom targets for the non-blocking parser'
updated_at = <Date 2013-09-28.17:43:11.005>
user = 'https://bugs.python.org/elibendersky'

bugs.python.org fields:

activity = <Date 2013-09-28.17:43:11.005>
actor = 'scoder'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2013-09-02.00:03:41.532>
creator = 'eli.bendersky'
dependencies = []
files = []
hgrepos = []
issue_num = 18902
keywords = []
message_count = 8.0
messages = ['196758', '196830', '196832', '197552', '197590', '197606', '198526', '198537']
nosy_count = 4.0
nosy_names = ['jcea', 'scoder', 'jkloth', 'eli.bendersky']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue18902'
versions = ['Python 3.5']

@elibendersky
Copy link
Mannequin Author

elibendersky mannequin commented Sep 2, 2013

As issue bpo-17741 shows in detail, there's a number of implementation problems in the current ET that make a fully non-blocking parser more difficult to implement than it should be.

The main problem is that XMLParser._setevents relies on internal implementation details of the default TreeBuilder to do some of its tasks.

This has to be decoupled from implementation details completely. An XMLParser already has a non-blocking input side (feed/close) and a callback "push" API to the target. The API has to be fully expanded to include start-ns/end-ns and any other events that are required to allow custom tree builders as targets. This would make it possible to layer XMLPullParser on top of the stock XMLParser coupled with a special target that collects "events" from the callback calls.

@elibendersky elibendersky mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Sep 2, 2013
@scoder
Copy link
Contributor

scoder commented Sep 3, 2013

This would make it possible to layer XMLPullParser on top of the stock XMLParser coupled with a special target that collects "events" from the callback calls.

Given that we have an XMLPullParser now, I think we should not clutter the API with more classes that do not add anything more than making other classes redundant. So, -1 on adding a special target that collects events. Anyone who wants to use that feature can just use the XMLPullParser.

In the Python implementation of that parser, custom targets can already be allowed right now. Only the C implementation prevents it currently (AFAICT) as it has its own built-in TreeBuilder target.

Therefore, +1 for the general cleanup to properly allow arbitrary targets as backend.

Also, +1 for allowing start-ns and end-ns event callbacks on parser targets, although that's a different feature entirely.

@scoder
Copy link
Contributor

scoder commented Sep 3, 2013

(fixing subject to properly hit bug filters)

@scoder scoder changed the title Make ET event handling more modular to allow custom targets for the non-blocking parser Make ElementTree event handling more modular to allow custom targets for the non-blocking parser Sep 3, 2013
@scoder
Copy link
Contributor

scoder commented Sep 13, 2013

The way the XMLPullParser is implemented in lxml.etree now is that it simply inherits from XMLParser. This would also make sense for ElementTree, even before supporting arbitrary targets. The patch in ticket bpo-18990 makes this simple to do.

For reference, here is the implementation in lxml.

Iterparse:

https://github.com/lxml/lxml/blob/master/src/lxml/iterparse.pxi

XMLPullParser:

https://github.com/lxml/lxml/blob/d9f7cd8d12a27cafc4d65c6e280ea36156e3b837/src/lxml/parser.pxi#L1357

SAX based parser that collects events and/or maps parser callbacks to callbacks on the target object:

https://github.com/lxml/lxml/blob/master/src/lxml/saxparser.pxi

@scoder
Copy link
Contributor

scoder commented Sep 13, 2013

Also, +1 for allowing start-ns and end-ns event callbacks on parser targets, although that's a different feature entirely.

Actually, I take that back. I can't see a use case for this feature, and it doesn't really fit with the notion of fully qualified tag names in ElementTree (which itself is a great feature). I'm -0.7 on adding this.

@scoder
Copy link
Contributor

scoder commented Sep 13, 2013

I created a separate ticket bpo-19010 for the inheritance issue.

@scoder
Copy link
Contributor

scoder commented Sep 28, 2013

Closing bpo-18990 defines the API of the new XMLPullParser that 3.4 will ship with, so this ticket becomes an enhancement for future versions.

@scoder scoder added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Sep 28, 2013
@scoder
Copy link
Contributor

scoder commented Sep 28, 2013

Copying a relevant comment by Eli from http://bugs.python.org/issue18990#msg198145 and replying inline.

"""
The way the APIs are currently defined, XMLParser and XMLPullParser are different animals. XMLParser can be considered to only have one "front" in the API - feed() and close(). You feed() until the document is done and then you close() and get the parsed tree. There's no other way to get the parsed tree (unless you use a custom builder, I guess).

On the other hand XMLPullParser has two clear "fronts" - an input front with feed() and close() and an output front with read_events(). For XMLPullParser, close() is just an input signal. The canonical way to get output from XMLPullParser is read_events(). close() has no better reason to return output than feed(). When we decided to change the method names (recall that Antoine's originals were completely different), we perhaps forgot this detail.
"""

No, we didn't.

"""
Even though XMLPullParser's method is named close(), it's *not* like XMLParser's close(). If someone is using XMLPullParser for its close() he's likely using the class incorrectly.

Just as an example: consider that in a lot of use cases the programmer will want to discard parts of the tree that's parsed iteratively (similarly to the main use case of iterparse()), because the XML itself is too huge. It's a convenient streaming API, in other words. Now, if the reader discards parts of the tree (by deleting subtrees), then returning the root from close() becomes even more meaningless, because it's no longer the root and we have no idea what it actually is.
"""

Let me repeat that this was already the case before the new class was added and that it's a feature. If the target decides to discard parts of the tree, or not build a tree at all and (say) instead count elements and return their total number on close(), then that's what the user asked for by selecting that target.

Let's agree to disagree on your conclusions, but I still can't see any advantages of making the separation between the two classes. The way I see it, making XMLPullParser inherit from XMLParser makes it very easy to explain what the difference is: the read_events() method, i.e. the additional way to receive the parse events that the combination of parser and target generate. Essentially, it's the target that does all the work here and the parser only collects the results and presents them to the user. Thus my intention to keep the parser as "stupid" as it looks from the user's side, instead of adding something new right next to it.

That being said, if ElementTree keeps them separate and decides to *never* return anything from XMLPullParser.close(), then that's sufficiently compatible with lxml.etree, so I won't object to it. lxml has a long history of extending what's there in order to make it easier to use.

As long as we can find a way to keep both libraries compatible for users, I think we should be able to both move forward.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants