Issue 17741: event-driven XML parser

iterparse() is a blocking operation. It is not really suitable for event-driven applications (e.g. non-blocking I/O). Here is a patch adding a IncrementalParser class.
Note that basing iterparse() on IncrementalParser and removing the API discrepancy between the Python and C modules helps make the etree code smaller.
Thanks Antoine. This looks interesting - I'm somewhat swamped ATM but will try to review the patch in the next few days.

Incidentally, since it is a new feature would it be worthwhile to discuss it on python-ideas?
I've posted on the tulip mailing-list already:
https://groups.google.com/forum/?fromgroups=#!topic/python-tulip/SNOnS27Bctc
Antoine, the patch LGTM. There's some more cleaning that needs to be done in surrounding code, but I can do that later. Also I should probably update the documentation with a bit more details.

Just add a NEWS entry when you commit.

Thanks for working on this.
New changeset f903cf864191 by Antoine Pitrou in branch 'default':
Issue #17741: Add ElementTree.IncrementalParser, an event-driven parser for non-blocking applications.
http://hg.python.org/cpython/rev/f903cf864191
Thank you, Eli. Now committed :)
Here is a patch that cleans up the current implementation to avoid making the result of iterparse() an IncrementalParser (with all of its new API).

Please note that the tulip mailing list is not an appropriate place to discuss additions to the XML libraries, and ElementTree in particular.
Copying the discussion between Antoine and me from python-dev:

>> IMO it should mimic the interface of the TreeBuilder, which
>> calls the data reception method "feed()" and the termination method
>> "close()". There is no reason to add yet another set of methods names
>> just to do what others do already.
>
> Well, the difference here is that after calling eof_received() you can
> still (and should) call events() once to get the last events. I think
> it would be weird if you could still do something useful with the object
> after calling close().
>
> Also, the method names are not invented, they mimick the PEP 3156
> stream protocols:
> http://www.python.org/dev/peps/pep-3156/#stream-protocols

I see your point about close(). I assume your reasoning was to make the IncrementalParser an arbitrary stream end-point. However, it doesn't really make all that much sense to connect an arbitrary data source to it, as the source wouldn't know that, in addition to passing in data, it would also have to ask for events from time to time. I mean, you could do it, but then it would just fill up the memory with parser events and loose the actual advantages of incremental parsing. So, in a way, the whole point of the class is to *not* be an arbitrary stream end-point.

Anyway, given that there isn't really the One Obvious Way to do it, maybe you should just add a docstring to the class (ahem), reference the stream protocol as the base for its API, and then rename it to IncrementalStreamParser. That would at least make it clear why it doesn't really fit with the rest of the module API (which was designed some decade before PEP 3156) and instead uses its own naming scheme.
Actually, let me take that last paragraph back. There is an Obvious Way to do it, and that's the feed() and close() methods. They are the existing and established ElementTree interface for incremental parsing. The fact that close() doesn't clean up all state is IMHO a minor issue. The state will be cleaned up automatically once the iteration terminates, and that's the normal behaviour of iterators. So it actually fits both protocols quite nicely.

I'd just leave the stream protocol out completely here. For one, the implementation isn't complete anyway (the connection_*() methods don't make much sense), and as I said, it's not very useful to consider the parser a general end-point to that protocol.

I'd also suggest returning the iterator over the remaining events from close(), just like the TreeBuilder returns the tree. That covers the (less common) use case of first parsing everything and then processing the events. I'll add another patch that does that.
Thinking about the original patch some more - I wonder why it doesn't use a wrapper for TreeBuilder to process the events. Antoine, is there a reason why you had to add this _setevents() method to the XMLParser, instead of making the IncrementalParser an IncrementalTreeBuilder?
Oh, and could we reopen this ticket, please?
> Antoine, is there a reason why you had to add this _setevents() method
> to the XMLParser, instead of making the IncrementalParser an
> IncrementalTreeBuilder?

The point is not to build a tree of potentially unbounded size (think XMPP). The point is to yield events in a non-blocking way (iterparse() is blocking, which makes it useless for non-blocking applications).

An IncrementalTreeBuilder wouldn't have much point IMO. It is not more costly to accumulate a string and parse it at the end, than to progressively build a growing XML tree.
About the patch: I think changing the API names now that alpha1 has been released is a bit gratuitous. I would like to keep the encapsulation part but ditch the naming changes.
> The point is not to build a tree of potentially unbounded size (think
> XMPP). The point is to yield events in a non-blocking way (iterparse()
> is blocking, which makes it useless for non-blocking applications).

Ok, but that's the only difference. Instead of getting the events from the parser, you could equally well get them from the TreeBuilder, also in a non-blocking way.

Sticking this functionality into a parser target object has the advantage that the parser interface wouldn't have to change. So, instead of introducing an entirely new parser interface, we'd just add a class that can be used as a parser target and provides an additional events() method. That's a substantially less invasive API change.

> An IncrementalTreeBuilder wouldn't have much point IMO. It is not more
> costly to accumulate a string and parse it at the end, than to
> progressively build a growing XML tree.

I don't think I understand what you mean.
> About the patch: I think changing the API names now that alpha1 has been
> released is a bit gratuitous.

Sorry for being late, but I can't see it being my fault.

A change in an alpha release is still way better than a change after a final release. Eventually, lxml will have to go on par here, so it's better to get it right now, before any harm is done.
> > The point is not to build a tree of potentially unbounded size
> > (think
> > XMPP). The point is to yield events in a non-blocking way
> > (iterparse()
> > is blocking, which makes it useless for non-blocking applications).
> 
> Ok, but that's the only difference. Instead of getting the events
> from the parser, you could equally well get them from the
> TreeBuilder, also in a non-blocking way.

But your TreeBuilder is also growing a tree internally, and therefore
eating more and more memory, right?

I don't see the point of stuffing different kinds of functionality
inside a single class. It makes the intended use less obvious, and
errors more likely. Right now, IncrementalParser has a simple API,
and there's a single way to use it.
> A change in an alpha release is still way better than a change after
> a final release.

But worse than no change at all. Arguing about API naming is a bit futile,
*especially* when the ship has sailed.
But IncrementalParser uses an XMLParser internally, which in turn uses a TreeBuilder internally. So what exactly is your point?
> But worse than no change at all. Arguing about API naming is a bit
> futile, *especially* when the ship has sailed.

<rant>It's easy to say that as a core developer with commit rights who can simply hide changes in a low frequented bug tracker without notifying those who have to know about these changes.</rant> It's pure luck I noticed this change during the alpha release cycle.

I'm also not arguing about naming. I'm questioning the design, trying to get it into a shape that fits the existing APIs. Why do we need two incremental parsing interfaces in one and the same library that use completely different method names, although doing otherwise exactly the same? Why should the type of the *result* of the parsing process change the method name that you need to call to *insert* data into the parser?
> But IncrementalParser uses an XMLParser internally, which in turn
> uses a TreeBuilder internally. So what exactly is your point?

Well, I would rather like to understand yours. Whatever
IncrementalParser uses internally needn't impact what API it
exposes to the user.

(as a matter of fact, iterparse() doesn't expose a
TreeBuilder-like API, either)
> Well, I would rather like to understand yours.

My point is that the IncrementalParser uses a TreeBuilder that builds an XML tree in the back. So I'm wondering why you are saying that it doesn't build a tree.

> Whatever IncrementalParser uses internally needn't impact what API it
> exposes to the user.

And in fact, we don't even need an IncrementalParser, because XMLParser already *has* an incremental parsing interface. All that's missing is the bit that collects and exposes the events. And my point is that we shouldn't duplicate the existing *data entry* interface for that, especially not under different names for identical functionality, but only add an interface to access those events as *output*, i.e. to add the bit of the API that's actually missing.

As an analogon, what would you say if I asked for adding a new, separate Mapping interface to Python that uses the method name "set_value()" instead of "__setitem__()" because I want it to read data from the hard drive and not from memory?
> <rant>It's easy to say that as a core developer with commit rights
> who can simply hide changes in a low frequented bug tracker without
> notifying those who have to know about these changes.</rant> It's
> pure luck I noticed this change during the alpha release cycle.

It is the rule for most stdlib improvements that they go directly
through the bug tracker. Most core developers and outsiders
would feel swamped by the traffic if all feature additions went
through the mailing-list.

I'm honestly baffled that you think I am trying to "hide" things.
Why do you think I would feel guilty about proposing an addition,
or try to sneak things inside xml.etree?

(yes, we could theoretically run polls for every addition
we propose, collect and discuss the results, and iterate several
times until the outcome is successful; I don't think any of us
has the bandwidth to do that, which is why that practice is
only used for game-making changes (i.e. PEP material))

> I'm also not arguing about naming. I'm questioning the design, trying
> to get it into a shape that fits the existing APIs. Why do we need
> two incremental parsing interfaces in one and the same library that
> use completely different method names, although doing otherwise
> exactly the same?

Well, unless I'm missing something, TreeBuilder doesn't do parsing,
it takes already parsed data: it has a start() method to open a tag,
and a data() method to add raw text inside that tag. IncrementalParser,
OTOH, has a data_received() method to which you pass a piece of
non-parsed XML string.

The other incremental parsing API is actually iterparse(). The
reason I proposed IncrementalParser is that iterparse() is useless
for non-blocking applications. IncrementalParser produces the same
kind of output as iterparse(), but control of when to feed it data
is transferred to the user of the API.
> TreeBuilder doesn't do parsing, it takes already parsed data: it has a
> start() method to open a tag, and a data() method to add raw text
> inside that tag.

That is correct. However, the XMLParser has a feed() method that sends new data into the parser, and a close() method that tells the parser that it's done. So there already is an incremental parsing interface, and your change is duplicating that interface under a different name. Specifically, IncrementalParser has exactly the same interface as XMLParser when it comes to entering data, but uses different method names for it. This is a Bad Design because it introduces an unnecessary inconsistency in the API.

However, what you are trying to change is not the way data is *entered* into the parser. What you are after is to change the way the *parsed* data is *presented* to the user. That is not the responsibility of the parser, it's the responsibility of the TreeBuilder.

In your code, the TreeBuilder builds a tree, and the new interface *additionally* collects parse events in a list. So, the right way to do it would be to change the parser *target* to do both, i.e. to build a tree and collect events at the same time.
> > Well, I would rather like to understand yours.
> 
> My point is that the IncrementalParser uses a TreeBuilder that builds
> an XML tree in the back. So I'm wondering why you are saying that it
> doesn't build a tree.

Unless I'm reading it wrong, when _setevents() is called, the internal
hooks are rewired to populate the events list, rather than call the
corresponding TreeBuilder methods. So, yes, there's a TreeBuilder
somewhere, but it stands unused.

(the _setevents() method already existed on the C impl, by the way.
I added it to the Python impl to make things more regular and avoid
two separate iterparse() implementations)

> And my
> point is that we shouldn't duplicate the existing *data entry*
> interface for that, especially not under different names for
> identical functionality, but only add an interface to access those
> events as *output*, i.e. to add the bit of the API that's actually
> missing.

What difference would that make? In the end, you mustn't mix
event-driven/non-blocking and cumulative/blocking styles of
programming, so having two separate APIs doesn't strike me as
a problem (it may be a good thing actually, for clarity reasons).
> Unless I'm reading it wrong, when _setevents() is called, the internal
> hooks are rewired to populate the events list, rather than call the
> corresponding TreeBuilder methods. So, yes, there's a TreeBuilder
> somewhere, but it stands unused.

Yes, you *are* reading it wrong. For example, the "start" callback calls self._start_list, which in turn calls self.target.start(), thus calling into the TreeBuilder. That's the thing that constructs the elements that the IncrementalParser collects in its events list.
> <rant>It's easy to say that as a core developer with commit rights who > can simply hide changes in a low frequented bug tracker without
> notifying those who have to know about these changes.</rant> It's pure > luck I noticed this change during the alpha release cycle.

What *are* you talking about? Who is trying to hide anything? Don't you see the history of the bug? Antoine opened it very... openly... and we discussed it. Also, committing to pre-alpha trunk is not a sailed ship - if there are problems, things can be easily fixed.

Please stop complaining and learn to use the tools properly. Here's one tip: http://mail.python.org/mailman/listinfo/python-bugs-list - subscribe to it and use your mail client's filtering to see new bugs on topics that interest you. It isn't very hard.

Another tip: http://mail.python.org/mailman/listinfo/python-checkins - subscribe to it and use your mail client's filtering to see new checkins on topics that interest you. It isn't hard either.

Tip #3: ask to be added to the experts list for ET. No one would object to that, and it would increase the likelihood that people add you to the nosy list directly for ET-related issues (particularly added features).

Now back to a productive discussion please...
> ask to be added to the experts list for ET

Already done, see the corresponding python-dev thread.

> Now back to a productive discussion please...

I think we already are. Keep reading through the rest of the posts.
On Fri, Aug 9, 2013 at 8:44 AM, Stefan Behnel <report@bugs.python.org>wrote:

>
> Stefan Behnel added the comment:
>
> > ask to be added to the experts list for ET
>
> Already done, see the corresponding python-dev thread.
>

Done. 747970502b23
> > Unless I'm reading it wrong, when _setevents() is called, the
> > internal
> > hooks are rewired to populate the events list, rather than call the
> > corresponding TreeBuilder methods. So, yes, there's a TreeBuilder
> > somewhere, but it stands unused.
> 
> Yes, you *are* reading it wrong. For example, the "start" callback
> calls self._start_list, which in turn calls self.target.start(),
> thus calling into the TreeBuilder. That's the thing that constructs
> the elements that the IncrementalParser collects in its events list.

Ah, indeed, my bad. That said, using a custom TreeBuilder-alike
would necessitate changes in the C implementation of XMLParser.
(which, I suppose, is why the _setevents hack exists in the first
place)
Ok, finally. ;)

Can we agree on discarding the current implementation for now and then rewriting it based on a tree builder instead of a parser wrapper?

"Because we'd need to change internal code" is not a good argument for adding a new API, IMHO.
What about this idea: instead of changing the internal C implementation, we could provide a simple parser target wrapper class (say, "EventBuilder") that the parser would simply recognise with isinstance(). Then it would unpack the wrapped target and configure itself as it currently does. So you'd use it like this:

    target = EventBuilder(TreeBuilder())
    parser = XMLParser(target=target)

    parser.feed(some_xml_data)
    print list(target.events())

I mean, it doesn't really matter if the implementation is a fake, as long as the API is right. And the only API that the EventBuilder has is its events() method that returns the iterator.

The EventBuilder implementation would then be

    class EventBuilder:
        def __init__(self, target=None):
            if target is None:
                target = TreeBuilder()
            self._target = target
            self._events = []

        def events(self):
            # existing code for distructive iteration over self._events goes here
            ...

and the rest would be done by the XMLParser() constructor, i.e. it would register the "_events" list in the expat callbacks. What do you think?
> Can we agree on discarding the current implementation for now and then
> rewriting it based on a tree builder instead of a parser wrapper?

Only if it actually brings something (feature-wise, performance-wise,
maintenance-wise, whatever) that the current implementation doesn't
have.

Otherwise, I'd rather check in the simple inheritance-to-composition
change.

I'm sorry that you noticed this after the alpha. I would have had a
lower threshold for changes if you had manifested when I proposed the
feature.
On Fri, Aug 9, 2013 at 12:34 PM, Antoine Pitrou <report@bugs.python.org>wrote:

>
> Antoine Pitrou added the comment:
>
> > Can we agree on discarding the current implementation for now and then
> > rewriting it based on a tree builder instead of a parser wrapper?
>
> Only if it actually brings something (feature-wise, performance-wise,
> maintenance-wise, whatever) that the current implementation doesn't
> have.
>
> Otherwise, I'd rather check in the simple inheritance-to-composition
> change.
>
> I'm sorry that you noticed this after the alpha. I would have had a
> lower threshold for changes if you had manifested when I proposed the
> feature.
>

Antoine, I'm not yet very well versed in the release process, but ISTM that
if:

1. We know that a better API can be created
2. The implementation is not overly difficult

It makes sense to do the change even though Alpha 1 was already released.
Alphas are alphas, after all :-) I suppose that we can consult with Larry
and others if this is the only doubt holding the change back.
Le vendredi 09 août 2013 à 20:34 +0000, Eli Bendersky a écrit :
> Antoine, I'm not yet very well versed in the release process, but ISTM that
> if:
> 
> 1. We know that a better API can be created
> 2. The implementation is not overly difficult
> 
> It makes sense to do the change even though Alpha 1 was already released.
> Alphas are alphas, after all :-) I suppose that we can consult with Larry
> and others if this is the only doubt holding the change back.

I agree, but this assumes there is a "better API". I would like to see
an API proposal and a patch before I give an opinion :-)
> I agree, but this assumes there is a "better API". I would like to see
> an API proposal and a patch before I give an opinion :-)

Well, I have already proposed an API throughout this thread, and then given a usage example and an almost complete sketch of the implementation. Please comment.

I really can't accept the current wart for lxml, that's why I'm asking to not have it in ET either. This would be the first time I can remember that ET has a wrong API that lxml won't copy.

I was asking for removal now, because if we can't manage to come up with a new implementation quickly enough and keep the code in, we'll end up with a quirk that we'll have to keep alive basically forever, in addition to whatever clean API we add later. Time's currently running against us. If we back out the current code, definitely before alpha 2, it'll start running for us again. I clearly prefer that situation.
I gave the implementation a try and attached an incomplete patch. Some tests are failing.

It turns out that it's not entirely easy to do this. As Antoine noticed, the hack in the C implementation of the TreeBuilder makes it tricky to integrate with other targets, specifically when intercepting events that the tree builder needs . That's a more general bug and it currently gets in the way.

Anyway, here's how it should eventually look like.
Ok, so what are we going to do for the next alpha?
New changeset 815f328a6578 by Antoine Pitrou in branch 'default':
Issue #17741: use composition, rather than inheritance, for xml.etree.iterparse's result class.
http://hg.python.org/cpython/rev/815f328a6578
I've committed the inheritance removal patch. I personally don't think there's anything more to do, since the TreeEventBuilder proposal stumbled on some implementation issues.
I was asking for the current implementation to be removed until we have a working implementation that hurts neither the API nor the module design.
> I was asking for the current implementation to be removed until we
> have a working implementation that hurts neither the API nor the
> module design.

It would help if we could keep the discussion on rational terms.
I don't think I understand what you mean.

In any case, it's not to late to remove the implementation. There was only one alpha release so far that included it, so it can't really break any existing code that relies on it. The longer we wait, the more damage will be done. That's why I am asking for removal now.

I would prefer not to rush in a replacement. The current state shows where that brings us. So, I request the removal of the current code, then we can calmly come up with a good implementation of the feature in question. If that can't be done for 3.4, well, then we have at least avoided adding something that will need to be deprecated and replaced later on. If it can be done for 3.4, the better.
> In any case, it's not to late to remove the implementation.

As far as I can say, there's no reason to remove the implementation
except for your distaste of the method names.
To wrap things up:
- the feature is desireable
- the API is reasonable
- the implementation is functional and properly tested

Really, if you can't come up with an actual blocker, I'll ask you to leave this issue at rest.
Could we please keep the discussion on rational terms?

It's not just the method names. The problem is that you are duplicating an existing class (the XMLParser) for no good reason, instead of putting the feature where it belongs: *behind* the XMLParser. Even worse, you are duplicating the class and giving it a *different* API.

This goes completely against the design of the existing API and counters the reusability of the existing code.

I'm not questioning that the feature is desirable. I'm saying that the design is wrong.
> The problem is that you are duplicating an existing class (the
> XMLParser) for no good reason

No. The logic was already in _IterParseIterator, just wrapped up
differently.

And the proof is that the original commit, while adding a feature,
*removed* code from xml.etree:

$ hg di -c f903cf864191 --stat Lib/xml/etree/
 Lib/xml/etree/ElementTree.py |  209
+++++++++++++++++++++++++++++---------------------------------
 1 files changed, 97 insertions(+), 112 deletions(-)

> This goes completely against the design of the existing API and
> counters the reusability of the existing code.

No it doesn't. It irritates you, for some reason, but your feelings are
not a valid reason to revert a useful feature. And this issue is not for
you to repeat the same stubborn line of argument again and again. I'm
extremely bored, and I won't care about your comments anymore.
Given that it seems to be hard to come to a consensus in this ticket, I've asked for removal of the code on python-dev.

http://mail.python.org/pipermail/python-dev/2013-August/128095.html
Using tulip-inspired method names (when tulip hasn't landed) to duplicate existing data input functionality (feed() and close()) seems a rather dubious design decision to me.

Given how popular lxml.etree is as an alternative to the standard library's etree implementation, we shouldn't dismiss Stefan's design concerns lightly.
Reopening this as per discussion on python-dev. I haven't reverted anything at this point, as subsequent changes mean a simple "hg backout" is no longer sufficient.
I like the idea of having .events() in a special target passed to a XMLParser, which already has feed() and close(). I read through Stefan's proposal and there are a couple of issues I wanted to raise:

1. Why have the "event builder" wrap a tree builder? Can't it just be a separate target?
2. Instead of the stock XMLParser recognizing the event builder via isinstance and giving it special treatment, would it not be better to still have a separate class that implements the XMLParser interface (it can derive from it or just use duck typing)?

Note that (2) brings us closer to Antoine's original design, albeit with different method names and somewhat different underlying implementation.

Also, Stefan, could you clearly explain the specific implementation issue you ran into with your partial patch. You mentioned an unrelated bug that can/should be solved before such an implementation can work. Can you provide more details?
"""
1. Why have the "event builder" wrap a tree builder? Can't it just be a separate target?
"""

You need a TreeBuilder in order to build the tree for the events.

If you want to use a different target than a TreeBuilder, you're free to do so. That's the idea of wrapping.

"""
2. Instead of the stock XMLParser recognizing the event builder via isinstance and giving it special treatment, would it not be better to still have a separate class that implements the XMLParser interface (it can derive from it or just use duck typing)?
"""

Why would you want to restrict it to an XMLParser? If there was an HTMLParser (which exists in lxml.etree), then how would you make the two interact?

The isinstance() check was only a quick way to get the API working without changing the implementation in the back too much. The actual functionality should (eventually) be moved into the target itself. Currently, it is implemented internally in the parser inside of the C module. That's the main problem with the current implementation that needs fixing. See issue #17902 (which is not a documentation issue but a real issue, BTW).
Looking at this more, the Parser/Builder abstraction in ET leaks badly, especially when it comes to incremental parsing. This manifests in (at least) two ways:

1. The parser accepted by iterparse has to use the ET-provided TreeBuilder, because the C implementation of TreeBuilder has undocumented fields and hooks used directly by the parser.
2. Even though iterparse & IncrementalParser accept a "parser" argument, it cannot be just any parser. This parser (in addition to using the ET TreeBuilder) must implement an undocumented and obscure _setevents method; this makes the original intent of the abstraction - hooking up arbitrary parsers to the interface, hard to follow.

To *really* fix this, all these capabilities have to be exposed as APIs. TreeBuilder has to support an explicit API for collecting and reporting events. XMLParser has to call into this API and either not have _setevents at all or have something public and documented. Note also that event hookup in the parser makes sense in a way, because we don't want events to be fired when not needed (for performance). So we don't want any parser to hook up all events for firing, and only incremental tree builders to collect them. So things aren't quite *so* simple. In addition, since iterparse (an existing, stable API of the module) expects a parser argument, it won't be easy creating an IncrementalTreeBuilder, because how would we let people using custom parsers with iterparse not change any code and yet keep things working?

--

Since the above presents considerable changes in the internals, to get *really* right - we do have a problem for the upcoming release because I'm not sure if anyone has the time or will to do it. However, the feature is interesting and useful, and the code is cleaner than it used to be. So I propose the following compromise that can serve us for the 3.4 release:

IncrementalParser stays, but its methods will be renamed to feed & close, which is consistent with the current XML parsing APIs (including xml.sax.xmlreader.IncrementalParser). In a future release, we may decide to create a IncrementalTreeBuilder anyway but then IncrementalParser can be layered on top of it and offered as a convenience.

--

To conclude, I don't think that when looked at as exposing an implementation detail of iterparse, IncrementalParser is so horrible. iterparse is already its own kind of thing somewhat alien to ET, but it's very useful and important. IncrementalParser is just replacing iterparse's "parse whole source" behavior with incremental feeds followed by a close. This is similar to the APIs xml.sax.xmlreader is exposing, so there is a consistency here. Solving "all the implementation problems" would certainly be great, but unfortunately no one seems to have time for this at the moment.
I attached a patch that removes the IncrementalParser class and merges its functionality into the _IterParseIterator. It thus retains most of the refactoring without adding new functionality and/or APIs.

I did not take a look if anything else from later changes needs to be reverted, neither did I update the documentation. I guess the newly added doc sections can just be deleted.

One thing that is still dubious is the requirement for the _setevents() method on the parser instance. That was already a flaw before Antoine's change, although it only hit if the C parser was used. It now affects both the Python parser and the C parser. It's not a regression because code that relies on it not being used is broken when it is being used. However, we should be aware that we are promoting a quirk to a visible API here. I don't see it as a real problem, given that it's an internal API (and thus an implementation detail), that's why I left it in. It should eventually be replaced by a proper setup based on the parser target.
Thanks for the effort, Stefan, but I still think IncrementalParser is worth keeping. It provides functionality that doesn't currently exist in ET, and IMHO this functionality is both important and useful.

Renaming its methods to feed & close kills the API naming inconsistency vs. XMLParser and SAX's IncrementalParser. As for future directions, having it does not add additional constrains to those that already exist because of iterparse accepting a "custom" XMLParser creates compatibility problems already.

For the latter, I think I can also beef up the documentation a bit to be more explicit (such as requiring the XMLParser provided to iterparse to derive from ET.XMLParser; otherwise, it just won't have the required _setevents).
"""
TreeBuilder has to support an explicit API for collecting and reporting events. XMLParser has to call into this API and either not have _setevents at all or have something public and documented. Note also that event hookup in the parser makes sense in a way, because we don't want events to be fired when not needed (for performance). So we don't want any parser to hook up all events for firing, and only incremental tree builders to collect them.
"""

That's why I wanted to introduce the wrapper class. The idea was to extend the supported methods on the target object by those needed for iterparse(), specifically start_ns() and end_ns(). That way, the event callback setup would be just natural.

"""
In addition, since iterparse (an existing, stable API of the module) expects a parser argument, it won't be easy creating an IncrementalTreeBuilder, because how would we let people using custom parsers with iterparse not change any code and yet keep things working?
"""

Yep, that's a problem in the original design.

One way to fix it would be to allow changing the parser target after the creation of the parser, but before it actually started parsing. That could be an entirely internal feature that would then allow injecting the wrapper between the parser and the target after the fact. 

Given that the current support for using different parsers in iterparse() is pretty much non-existant, requiring the applicable parsers to support this feature would not be a regression, IMHO.
> I still think IncrementalParser is worth keeping.

If you want to keep it at all cost, I think we should at least hide it behind a function (as with iterparse()). If it's implemented as a class, chances are that people will start relying on internals by inheriting from it. And we already know that those internals should eventually be removed completely. Eventually, that function should become a two-liner or so.
> Stefan Behnel added the comment:
>
> > I still think IncrementalParser is worth keeping.
>
> If you want to keep it at all cost, I think we should at least hide it
> behind a function (as with iterparse()). If it's implemented as a class,
> chances are that people will start relying on internals by inheriting from
> it. And we already know that those internals should eventually be removed
> completely. Eventually, that function should become a two-liner or so.
>

I'm not sure I see how this is different from any class exposed by the
stdlib. Users can inherit it, rely on internals and then we can never
change its implementation? Surely this isn't true. We don't make guarantees
w.r.t. implementation and internals, only interfaces. Yes, Python lets you
monkey-patch anything, but this doesn't mean we can't build stable APIs in
Python and change implementation in the future.

So the only thing we should focus on is the *external* API of this class.
With only feed/close/events exposed, it seems logical that these can be
maintained even when the implementation changes.

Besides, such an incremental parser is a natural match for a class because
it has state and several methods that collectively act on that state.
Redesigning it as a function would be awkward. That said, if you have a
specific suggestion in mind that would be as natural to use in user code,
feel free to offer it.
Actually, let me revise my rpevious comment. I think we should fake the new interface for now by adding a TreeEventBuilder that requires having its own TreeBuilder internally, instead of wrapping an arbitrary target. That way, we can avoid having to clean up the existing dependency of the C parser on the C-level TreeBuilder implementation.

Once this cleanup gets done, we can add the support for wrapping user defined targets as a feature.

So, basically, I propose to take the route that my TreeEventBuilder patch went, just with a simplified implementation. I'll see if I can cut down the patch to those essentials.

Would you be ok with something that uses the isinstance() hack in the parser?
Stefan, I don't mind looking at a working patch; however, it's not clear to me how you intend to solve the iterparse-accepting-a-custom-XMLParser problem. As for faking the new API, I don't know if that's a good idea because we're not yet sure what that new API is. I agree with your earlier observation that we must begin by refactoring the internals (in the C implementation) and defining cleaner APIs for existing objects. Once we do that, we may have better ideas about the ideal API for incremental parsing / event building. But that's a large job.

Again, fully working patches will be considered. Lacking those, I'm inclined to go with the interim solution proposed in http://bugs.python.org/issue17741#msg196133
> fully working patches will be considered

Let me remind you that it's not me who wants this feature so badly.

> As for faking the new API, I don't know if that's a good idea because we're not yet sure what that new API is.

If that's your concern, then I suggest not adding the feature at all, as long as we don't know if we can (or should) keep up the IncrementalParser facade with the revised implementation.

For example, can it accept a user defined parser instance as input or not? Can it accept a user defined parser target as input? Can it wrap it or would it maybe have to inherit from a TreeBuilder? Should it be a class or a helper function? I don't see how the interface you proposed leaves less questions open than what I am proposing.
> > As for faking the new API, I don't know if that's a good idea because
> we're not yet sure what that new API is.
>
> If that's your concern, then I suggest not adding the feature at all, as
> long as we don't know if we can (or should) keep up the IncrementalParser
> facade with the revised implementation.
>
> For example, can it accept a user defined parser instance as input or not?
> Can it accept a user defined parser target as input? Can it wrap it or
> would it maybe have to inherit from a TreeBuilder? Should it be a class or
> a helper function? I don't see how the interface you proposed leaves less
> questions open than what I am proposing.
>

This is an existing API within ET:

  xml.etree.ElementTree.iterparse(source, events=None, parser=None)

The parser argument is limited: it can't use a custom TreeBuilder. It also
must provide _setevents, so either inherit from XMLParser or just provide
that method in another way [I'll try to improve the documentation to make
all of this explicit]. Whatever we say in this issue, iterparse is here to
stay.

  class xml.etree.ElementTree.IncrementalParser(events=None, parser=None)

Is the new class. Its interface combines iterparse and XMLParser - so it is
subject to the same constraints. It also serves as a basis of iterparse's
implementation.

If the internals are changed, the constrains we'll be subject to with
iterparse will be the same for IncrementalParse; no more. For example, the
solution can be substituting the target of the given parser to the "event
builder" target. Since we force the target to be TreeBuilder now, it should
not really break user code because that part isn't custom. But whatever the
solution is, it will be the same solution for both IncrementalParser and
iterparse.

Also, even if the new approach is implemented in the next release,
IncrementalParser can stay as a simple synonym to
XMLParser(target=EventBuilder(...)).
> Also, even if the new approach is implemented in the next release,
IncrementalParser can stay as a simple synonym to
XMLParser(target=EventBuilder(...)).

No it can't. According to your signature, it accepts a parser instance as input. So it can't just create a new one internally.

"""
This is an existing API within ET:

  xml.etree.ElementTree.iterparse(source, events=None, parser=None)

The parser argument is limited: it can't use a custom TreeBuilder. It also
must provide _setevents, so either inherit from XMLParser or just provide
that method in another way [I'll try to improve the documentation to make
all of this explicit]. Whatever we say in this issue, iterparse is here to
stay.
"""

I agree. However, I can't see why this should constrain us when adding a new API. Sure, the current implementation puts certain restrictions on the input. But I see that as an advantage, not as a constraint. It's currently broken in what it accepts as input, so we are free to change that aspect later on. And we can now try to get it right for the new API that we add.

Also, AFAICT, passing a parser into iterparse() is a very rare thing to do, if only because it doesn't really work, so the likelyhood of breaking code is very limited (I'm not saying we should, just thinking about the impact). Note that lxml.etree doesn't currently support this at all.

Just to throw some more ideas into the wild, what do you think about making this an inherent feature of the XMLParser class itself? That can be done now, without major changes to the underlying code. It could be enabled by a new keyword argument "collect_events=None" that you could pass into the constructor in order to define the set of events. The "events()" method would always be available, but would only yield events if there actually are any. Otherwise, it would just terminate and be done. So, unless you pass it a set of event names at init time, the parser wouldn't collect events and you wouldn't get any out of it. Sounds like a clean interface to me.

Collecting the events would work in the same way as it currently works, i.e. it would call the target's start() method, and whatever that returns will be added as value to the events as a tuple ("start", target_return_value). That means that it's up to the parser target to determine what the event iteration returns. Can be elements, can be None, can be something else.

As for backwards compatibility with iterparse(), I don't see a problem with officially documenting iterparse() as requiring an XMLParser if a parser is provided, and as reconfiguring that parser to its needs. It already does that anyway and it currently doesn't work for anything but an XMLParser.

Also, on a related note, we might (!) consider the addition of the _setevents() method a bug in CPython 3.3. It didn't exist in ElementTree before it was merged with cElementTree. Instead, the expected interface was (apparently) that the parser that is passed in has a "_parser" attribute that holds an Expat parser, and that iterparse() could reconfigure according for itself. Both are not public APIs and even differed between ET and cET, so I don't see any harm being done, but it would be good to have one way to do it. OTOH, if that way is _setevents(), then be it. The advantage would be that this interface is at least independent of Expat and could potentially be implemented by other types of parsers.
I actually have another idea. Since we all agree that passing a custom "parser" to iterparse is dubious, IncrementalParser does not have to do that at all. This will make it a much more future-proof API. So its signature can be:

  class xml.etree.ElementTree.IncrementalParser(events=None)

The only remaining question is how will iterparse then work based on IncrementalParser (since iterparse does accept a parser). iterparse can just set the parser on IncrementalParser after creating it - it's an internal contract that does not have to be exposed.

This will be better than the current approach in terms of future-proofing.
I had a chat with Nick about this, and a variation of the proposal in http://bugs.python.org/msg196133 seems acceptable to us both. To summarize, here's what goes into 3.4.

* The class will be named EventParser.
Rationale: The Incremental in IncrementalParser is ambiguous, because it's not clear which side is incremental - input or output. In addition, XMLParser is already "incremental" in the sense that it has a feed() method. EventParser is (although not a perfect name) clearer w.r.t what it does - produces events from the parsing. Another (milder) reason is to avoid confusing duplication with xml.sax.xmlreader.IncrementalParser, which is somewhat different. Yet another is that if we eventually figure out how to implement this in a parser target, a great name for that target would be EventBuilder, which is a logical building block for EventParser.

* It will *not* accept a "parser" argument in the constructor.
Rationale: the parser argument of iterparse is broken anyway. This will make it much easier to modify the implementation of EventParser in the future when the C internals are fixed w.r.t problems mentioned in this issue.

* iterparse's "parser" argument will be deprecated, and the documentation will be more detailed w.r.t to the limitations on its current "parser" argument (the limitations are there in the code, but they're not fully documented).

* EventParser's input-side methods will be named feed and close, to be more consistent with existing XML APIs (both XMLParser and xml.sax.xmlreader.IncrementalParser).

* EventParser's output-side method will be called "read_events".
Rationale: "read_" helps convey that the consumption of events is destructive. A possible future addition of a complementary, non-destructive API - "peek_events" will be considered.

* Documentation will be modified to clarify exactly how EventParser differs from XMLParser and iterparse.

Also, new issue(s) will be created to address fixing specific implementation problems.

----

Nick, feel free to add/correct if I misunderstood or forgot something.
Sounds good to me :)
Hmm, did you look at my last comment at all? It solves both the technical issues and the API issues very nicely and avoids any problems of potential future changes. Let me quickly explain why.

The feature in question depends on two existing parts of the API: the event generation of the parser, and the return values of the parser target (e.g. a tree builder). So there are really only three places where this feature makes sense, both technically and API-wise.

1) in the target
2) in the parser
3) between parser and target

Note how a separate class is ruled out right from the start by the fact that the feature lives somehwere between parser and target. It's an inherent part of the existing design already (and of the implementation, BTW), so I don't see how adding a separate thing to control it makes any sense.

1) is impossible because the target is user provided and we do not control it
2) works fine because the parser controls both the call to the target and its return value
3) would be nice (and was my original favourite) but is hard to do with the current implementation and requires further changes to the API of parser targets

So 2) is the choice that remains.

> It will *not* accept a "parser" argument in the constructor.

Yes, I had also arrived at that point yesterday. However, I then reconsidered why there has to be a separate class in the first place. Basically, if you can't allow users to control the parser because the feature is already part of the parser, then why expose it as something that appears to be independent? Adding it to the existing API is very clean and exposes the feature where users would look for it.

> EventParser's output-side method will be called "read_events".

Makes sense to me. I had also considered that problem but didn't come up with a good name yet. "read_events" reads better than plain "events".

> The class will be named EventParser.

Obviously because it's parsing Events, as opposed to the XMLParser, which parses XML, or the HTMLParser, which parses HTML, right?
BTW, I also like how short and clean iterparse() becomes when you move this feature into the parser. It's basically just a convenience function that does read(), feed(), and yield-from. Plus the usual bit of bolerplate code, obviously.
Eli's summary left out an exchange between us that happened after he'd already written the summary - he pointed out the same problem with the EventParser name that you noticed: it's really an alternative XMLParser that exposes read_events(), rather than an event parser. So a completely different name like "EventScanner" or "EventStream" may be more appropriate. (I quite like EventStream, since it further suggests the destructive nature of read_events()).

As for why we think it's worth keeping this as a separate API, it's really about turning XMLParser's "push" API for events (where the events are pushed into the target object by the parser calling the appropriate methods), into an iterparse style pull API where the events can be retrieved via calls to read_events().

The back end implementation for the event streaming API *should* be a custom target object combined with a regular XMLParser object, but there are implementation issues currently preventing that.

We also discussed adding the event streaming interface directly to XMLParser, but I agreed with Eli that it's worth keeping that base API simple, especially since in the long run we want the new class to just be a convenience API for combining XMLParser and a custom target object, even if it can't be implemented that way right now.
> iterparse's "parser" argument will be deprecated

No need to do that. Update the docs, yes, but otherwise keep the possibility to improve the implementation later on, without going through a deprecation + dedeprecation cycle. That would just confuse users IMHO.
Since parsers don't support changing the target after creation, I think it makes sense to deprecate passing in a parser *instance*, and instead require passing in a callback that accepts the target to use and *returns* an appropriate parser object. The "parser can only use the default TreeBuilder as a target." line in the iterparse docs is a sign that this is a more appropriate API - if iterparse needs a particular kind of target, the API should be designed so iterparse *provides* that target instead of saying "please don't use a custom target type".

However, I don't think it makes sense to provide such a callback based API until it is actually possible to implement the streaming event API using a custom target object with XMLParser.
I don't see adding one method to XMLParser as a design problem. In fact, it's even a good design on the technical side, because if ET ever gains an HTMLParser, then the implementation of this feature would be highly dependent on the underlying parser, i.e. it would be very different from the implementation in XMLParser. It might not even be supported at all. Moving the API into the parser gives us that choice quite naturally.

It also solves the naming issue. Why add another thing to the module that needs explanation (and even an explanation why it's there and how it's different from what else is there), when we can just add the feature to what's there and be done?
> ... instead require passing in a callback that accepts the target ...

That could be the parser class then, for example, except that there may be other options to set as well. Plus, it would not actually allow iterparse to wrap a user provided target. So, in fact, that approach doesn't match with the design of introducing a target wrapper.
> it's really about turning XMLParser's "push" API for events (where the events are pushed into the target object by the parser calling the appropriate methods), into an iterparse style pull API where the events can be retrieved via calls to read_events().

Sorry, but this is ignoring the fact that the target is an *inherent* part of this feature. In fact, the push API of XMLParser forms an integral part of the design. The events that are being collected are a mixture of what the XMLParser produces and what the target produces. You can't think this feature without these two parts.
> in the long run we want the new class to just be a convenience API for combining XMLParser and a custom target object, even if it can't be implemented that way right now.

Just to be clear: I changed my opinion on this one and I no longer think that it is a good idea. The negative impact on what's there currently is just too large, specifically the user visible deprecation and the change to the target object API.
On Sun, Aug 25, 2013 at 10:40 PM, Stefan Behnel <report@bugs.python.org>wrote:

>
> Stefan Behnel added the comment:
>
> Hmm, did you look at my last comment at all? It solves both the technical
> issues and the API issues very nicely and avoids any problems of potential
> future changes. Let me quickly explain why.
>
> The feature in question depends on two existing parts of the API: the
> event generation of the parser, and the return values of the parser target
> (e.g. a tree builder). So there are really only three places where this
> feature makes sense, both technically and API-wise.
>
> 1) in the target
> 2) in the parser
> 3) between parser and target
>
> Note how a separate class is ruled out right from the start by the fact
> that the feature lives somehwere between parser and target. It's an
> inherent part of the existing design already (and of the implementation,
> BTW), so I don't see how adding a separate thing to control it makes any
> sense.
>
> 1) is impossible because the target is user provided and we do not control
> it
> 2) works fine because the parser controls both the call to the target and
> its return value
> 3) would be nice (and was my original favourite) but is hard to do with
> the current implementation and requires further changes to the API of
> parser targets
>
> So 2) is the choice that remains.
>
>
I think folding it all into XMLParser is a bad idea. XMLParser is a fairly
simple API and I don't want to complicate it. But more importantly,
XMLParser knows nothing about Elements, at least in the direct API of
today. The one constructing Elements is the target. The "read_events"
method proposed for the new class (currently IncrementalParser.events)
already returns Elements, having used a TreeBuilder to build them.
XMLParser emits start/end/data calls into the target, but these only carry
tag names, attributes and chunks of data. The hierarchical element
construction is done by TreeBuilder.

What I actually think would be better for the long term is to add new
target invocations in XMLParser - start-ns and end-ns. So XMLParser would
just keep *parsing*, leaving the interpretation of the parsed data to the
target. Today's TreeBuilder is free to ignore these calls. A custom
"EventCollectingTreeBuilder" can collect an event list, having all the
information at its disposal. Thus, XMLParser would remain what it is today
(minus the _setevents hack) - a router for pyexpat events.

These discussions of the future API are interesting, but what's more
important today is to have an API for IncrementalParser (using this name
before a new one is agreed upon) that doesn't block future implementation
changes. And I believe the API proposed here fits the bill.

> > The class will be named EventParser.
>
> Obviously because it's parsing Events, as opposed to the XMLParser, which
> parses XML, or the HTMLParser, which parses HTML, right?
>

The name is not perfect, and proposals for a better one are welcome. FWIW,
since it already lives in the xml.etree namespace, "XML" does not
necessarily have to be part of the name. So, some alternatives:

* EventStreamer - proposed by Nick. I have to admit I don't feel good with
it, because I still want to be crystal clear it's a *parser* we're talking
about.
* EventBasedParser
* EventCollectingParser
* NonblockingParser
* ... other ideas?
> XMLParser knows nothing about Elements, at least in the direct API of today. The one constructing Elements is the target.

Absolutely. And I'm 100% for keeping that distinction exactly as it is.

> The "read_events" method proposed for the new class (currently IncrementalParser.events) already returns Elements, having used a TreeBuilder to build them.

More precisely, it only returns Elements *iff* the TreeBuilder builds them. If it does not, then it returns something else. By moving the desired functionality into the parser, we don't even need to change anything about the interface between the parser and the target object. We still can, though, if you want to extend the interface with start-ns and end-ns events (and I'm ok with that, but it's a different feature). We do not loose that option. But the cool thing is that we don't have to do this now, and that iterparse just keeps working as it is and can be fixed later. No deprecation needed.

So we can easily agree on the goals of keeping the interface of the XMLParser simple and not teaching it about Elements. But we still disagree about the conclusions. My conclusion is that the API is substantially simpler if we do *not* add an entire new class that just duplicates existing APIs, but keep the parser as the thing that generates parse events. Be they in the form of callbacks or in the form of event tuples (that have the same name as the callbacks, BTW). The cool feature is that you can use either of the two interfaces or even hook into one to control the other (once the C parser is fixed), without having to learn the distinction between an XMLParser and a WhateverNewParser that also just parses XML.

> I still want to be crystal clear it's a *parser* we're talking about

You have to decide what you want. IMHO, there is no use in putting a new parser next to the existing XMLParser if both are there for parsing XML. That is just unnecessarily confusing. If you want it to be a parser, use the XMLParser.

I guess there's no other way to convince you than by coding up my proposal. It seems to be hard to properly explain it without seeing it at work.
Here is a proof-of-concept patch that integrates the functionality of the IncrementalParser into the XMLParser. I ended up reusing most of Antoines implementation and test suite. In case he'll look back into this ticket at some point, I'll put a "thank you" here.

The patch certainly needs more cleanup, but it shows where things are going and passes all tests.
(I still wonder why I'm the one writing all the patches here when Eli is the one who actually wants this feature ...)
BTW, maybe "read_events()" still isn't the ideal method name to put on a parser.
>
> (I still wonder why I'm the one writing all the patches here when Eli is
> the one who actually wants this feature ...)
>

Well, obviously because you're the only real programmer around here and the
rest of us are just a bunch of hand-wavy morons.

Seriously, Stefan, lose the attitude. Your unpleasant approach to
conducting a conversation completely obscures your technical contributions
and capabilities. Antoine got fed up a while ago, and I'm also nearing that
point.

With that off my chest, I will look at this new patch, hopefully tomorrow.
Stefan Behnel wrote:
> ... I still wonder why I'm the one writing all the patches ...

I imagine for the same reasons that I offered to write Enum:  I had definite ideas about how it should be, it is sometimes easier to explain with working code than with prose, and I wanted to contribute to Python.

I have to agree with Eli, though.  Wading through the accusations to get to the ideas is not helping your cause.
On Mon, Aug 26, 2013 at 12:35 PM, Stefan Behnel <report@bugs.python.org>wrote:

>
> Stefan Behnel added the comment:
>
> Here is a proof-of-concept patch that integrates the functionality of the
> IncrementalParser into the XMLParser. I ended up reusing most of Antoines
> implementation and test suite. In case he'll look back into this ticket at
> some point, I'll put a "thank you" here.
>
> The patch certainly needs more cleanup, but it shows where things are
> going and passes all tests.
>

OK, I looked at the patch. So I think I understood your verbal proposal
correctly earlier. And as I said then, I don't like it. I believe Nick has
summarized nicely why heaping everything into XMLParser is a bad idea, but
since you went the extra mile and made a patch, I will try to provide a
more expanded rationale.

Putting _setevents aside for the moment, XMLParser is a clean and simple
API. Its output is only "push" (by calling callbacks on the target). It
doesn't deal with Elements at all. It decomposes the input XML and produces
push events. That's it. It doesn't even have to interact with the target to
a larger extent than pushing callback calls to it.

You propose to add a completely different mode of operation to it,
controlled by a constructor flag. In "collect_events mode", XMLParser
undergoes a transformation. It becomes a pull API. It starts interacting
with the target more deeply by using it to produce elements for events.
Coupled with "weird" targets, its semantics become totally unclear.

So it's a single class that acts completely differently based on an
argument. This is very unpythonic. But things get worse. XMLParser is not
supposed to be an isolated entity. Users should be able to implement custom
parsers that provide its API. When the API is simple - feed/close and some
callbacks, replacing/ehnancing/faking it is easy. When the API is complex
-- two different modes of operation -- it's considerably more difficult.

That's the gist of it. Sticking to the current design of separation of
concerns between the parser and the target/treebuilder is IMO the better
way looking forward. In the mean-time, a helper class for asynchronous
parsing is useful and timely. By restricting its API, changing its
underlying implementation will be easy in the future.
> Putting _setevents aside for the moment,

Agreed, obviously.

> XMLParser is a clean and simple API. Its output is only "push" (by calling callbacks on the target). It doesn't deal with Elements at all.

We already agreed on that, too. Keep things simple.

> It decomposes the input XML and produces push events. That's it. It doesn't even have to interact with the target to a larger extent than pushing callback calls to it.

It does, though. That's essentially how iterparse works. I.e. the API between parser and target that I'm using here is already there. And has been for years. And I think that the feature in question shows that this was a particularly good design. Thank you Fredrik!

> You propose to add a completely different mode of operation to it, controlled by a constructor flag. In "collect_events mode", XMLParser undergoes a transformation.

We clearly disagree here.

> Coupled with "weird" targets, its semantics become totally unclear.

Try to come up with a patch for your own proposal that avoids that. You will notice that you will be using the existing protocol, thus the target will continue to be able to do what it wants. If it's a weird target, then it's a weird target because the user said so. But it's the target's very own concern to do expected things (build a tree) or do weird things (whatever an example for this is).

Except when you disallow using user defined targets, but then, that's an arbitrary restriction that ignores what is there already.

As long as you want to keep up the existing separation of concern between parser and target (and we agreed on that being a good thing), and as long as you don't impose arbitrary (and arbitrarily complex) restrictions on user code, you will not be able to prevent the target from doing "weird" things that impact the events being collected by whatever interface you choose for this. And why should you?

> XMLParser is not supposed to be an isolated entity. Users should be able to implement custom parsers that provide its API. When the API is simple - feed/close and some callbacks, replacing/ehnancing/faking it is easy. When the API is complex -- two different modes of operation -- it's considerably more difficult.

That's only true when you use it for incremental event parsing. If you don't, that part of the API doesn't have to bother you. If you want to implement a custom parser that doesn't support event collection, then simply don't support the "collect_events" keyword argument and you're done. Implement read_events() by returning an empty iterator, if you like, that's up to you. If you want to write a target that prevents parsers from keeping tree fragments alive in memory, simply return None from your callback methods and you're done. If you want to write a target that changes the Elements that the parser collects, then you can do that. Yes, those are actually features. And you could already do most of this before.

Remember that it's the users using the parser and providing (or implementing) the target. We should let them.

I'm sorry if I sound offensive, but most of what I read as counter arguments in this ticket so far was either incomplete or (sometimes) even plain wrong. Some seemed pretty close to FUD, but maybe that's just me (and I'm guilty here too, I guess - sorry for that). Now, how could this discussion possibly give the me impression that everyone has the same level of understanding? Even your comment about "weird" targets, right above, after going through all of this discussion, makes me wonder if you really know how all of this works.

Please take a closer look at the original code before Antoine changed it. Play with the parser targets a bit. That will help you understand this issue better.

Also, if you want to come up with a patch that implements approach 3), i.e. a class between parser and target, then please do. I tried it (the incomplete and partially fake patch is attached to this ticket), and my take on it is that it's too complex and requires some rather non-trivial changes, including user visible ones. So, based on that experience, I decided that it would be better to keep things simple and integrate the feature more closely with what is there anyway. I think it might help if you went through the same experience.
The whole point of the new API is not to replace XMLParser, but to
provide a convenience API to set up a particular combination of an
XMLParser with a particular kind of custom target. It just happens
that actually *implementing it* that way doesn't work right with the
current code. It is *not* necessary (or even appropriate) for it to
serve as a new building block, but as a convenience API to compose
*existing* building blocks in a particular way (or should be, except
that composition doesn't currently work right).

Once the underlying deficiencies are fixed, then the event capturing
target can be exposed directly, but in the meantime, the *composed*
version can be exposed by relying on internal APIs.

We could call the new class XMLParserWithEventCapturingTarget, but
that's a little clumsy :)
> The whole point of the new API is not to replace XMLParser, but to provide a convenience API to set up a particular combination of an XMLParser with a particular kind of custom target.

Ok, but I'm saying that we don't need that. It's all there, it all comes together at the interfaces of the parser. The two-way communication between parser and target already exists and is used by iterparse().

So, what I'm advocating is that we should not complicate the module interface with a new class (and eventually two classes) at all. Instead, we should just expose the *existing* interface of the parser in two ways, once as callbacks and once as collected events. I find that much easier to explain than any of the other proposals I've seen so far.

This is really not about doing it this way because it's technically too difficult to do differently. It's about keeping things simple for users and well integrated with what's there.

BTW, Eli asked for working code before we discuss. I've provided it. Now I would like to see working code for the proposed three-level design in order to make sure it actually works and feels right. I've already said that I've tried it and it didn't feel right. We can continue to discuss hot air for another month or two, or we can stick to discussing real code and real APIs.
I'm sorry if I sound offensive, but most of what I read as counter
> arguments in this ticket so far was either incomplete or (sometimes) even
> plain wrong. Some seemed pretty close to FUD, but maybe that's just me (and
> I'm guilty here too, I guess - sorry for that). Now, how could this
> discussion possibly give the me impression that everyone has the same level
> of understanding? Even your comment about "weird" targets, right above,
> after going through all of this discussion, makes me wonder if you really
> know how all of this works.
>
> Please take a closer look at the original code before Antoine changed it.
> Play with the parser targets a bit. That will help you understand this
> issue better.
>

Ah, awesome. Well, you can't say I didn't warn you, can you?

I am done discussing this issue with you, Stefan.
Eli, I agree that we've put way more than enough time into the discussion by now. We all know each other's arguments and failed to convince each other.

Please come up with working code that shows that the approach you are advocating for the final implementation is doable and also helpful from an API user's point of view.
Stefan, your proposed merged design isn't going to happen. Two alternative
ways of using the one class is far more complicated to explain to users
than exposing a helper that composes the existing API components to address
a particular use case (particularly when the helper is hiding the fact that
the theoretical underlying composition doesn't currently work the way it
should according to the public APIs).

As a higher level helper, users that are comfortable with and prefer the
lower level API are then clearly free to ignore. This recasting of the
nature of the API as a composition of the lower level components is a
*direct* response to your concern about it being a capability that the
lower level API should support directly.

You clearly don't like that design choice, but this isn't a democracy.
Arguing further on this point will just be ranting for the sake of ranting,
rather than having any chance of changing Eli's mind as module maintainer.

As far as naming goes, I still think it's better to drop "Parser" from the
name entirely. Call it "XMLEventReader" or something like that. The point
is to use it when you want to get the events out without blocking when no
events have been triggered and don't want to set up your own class to
receive event notifications through the target push API. The docs should
make it clear that this API is "pull only" - if people need callbacks or
other push triggers, then that's still what custom targets are for.
Sorry, Stefan, I missed your last comment before posting mine. It appears
you had already reached the same conclusion I had regarding further high
level design discussion being pointless :)
W.r.t. the summary in http://bugs.python.org/msg196177, after some further chat with Nick, a name that sounds good for us for the class is XMLPullParser. It has both XML and Parser in it, so the intention is clear. On the other hand, Pull works well for analogy with the existing pull API in xml.dom.pulldom, *and* suggests that the push API is inactive and gets redirected to a pull API (thanks Nick for this insight). 

The method names remain as suggested there.
> the push API is inactive and gets redirected to a pull API

Given that this is aimed to become a redundant 'convenience' wrapper around something else at a point, I assume that you are aware that the above is just an arbitrary restriction due to the fact that the parser does not accept arbitrary targets as input.

Also, the refrence to pulldom is a bad one, because pulldom reads data as you request it ("pull"). This class creates events based on the data you *push* into it. So it's really a push API from the POV of the user, not so much a pull API.

Anyway, it should be possible to emulate both this class and the pie-in-the-sky three-level API based on what I suggested, so I can't see this addition being a problem. I may even end up making the same visual split in lxml.etree deliberately, in order to make it clear that this feature only applies to the push parser API, i.e. when using feed() and close(), as opposed to what parse() and fromstring() use. That would mean that there'd also be a corresponding class for the HTMLParser then, and both would inherit from the base parsers and thus obviously (and conveniently) accept arbitrary parser targets.

I think I'm ok with adding that class. I liked Greg Ewing's suggestion of calling it AsyncParser, though, i.e. AsyncXMLParser. It makes it obvious what the use case is and avoids any references to push or pull that will always confuse some users, based on what part of the feature they are focussing on in their code.
This patch implements the renaming and updates the documentation.
Any comments regarding my naming suggestion?

Calling it a "push" parser is just too ambiguous.
Erm, "pull" parser, but you see what I mean.
New changeset 8fd72b1bb262 by Eli Bendersky in branch 'default':
Issue #17741: Rename IncrementalParser and its methods.
http://hg.python.org/cpython/rev/8fd72b1bb262
This issue has become too large and discusses a few different things; hence I'm inclined to close it as fixed and open a new one as a placeholder for discussing the new design of the internals.

I'll do this in a couple of days if there are no objections.
New changeset 1faaec66c73d by Eli Bendersky in branch 'default':
Fix XMLPullParser documentation to say "non-blocking" instead of "asynchronous".
http://hg.python.org/cpython/rev/1faaec66c73d
I'm closing this issue as fixed because the feature was implemented. As discussed, opened a new issue (#18902) to discuss re-designing some parts of the current code that would allow a nicer implementation of this feature. Anyone interested in the subject - feel free to subscribe yourself there.
While refactoring the iterparse() implementation in lxml to support this new interface, I noticed that the close() method of the XMLPullParser does not behave like the close() method of the XMLParser. Instead of setting some .root attribute on the parser instance, the method should return the root element that the tree builder generated.
Looks like we missed the alpha2 release for the close() API fix. I recommend not letting yet another deadline go by.
Created separate issue18990 to keep this one closed as is.
The way the XMLPullParser is implemented in lxml.etree now is that it simply inherits from XMLParser. This would also make sense for ElementTree, even before supporting arbitrary targets. The patch in ticket #18990 makes this simple to do.

For reference, here is the implementation in lxml.

Iterparse:

https://github.com/lxml/lxml/blob/master/src/lxml/iterparse.pxi

XMLPullParser:

https://github.com/lxml/lxml/blob/d9f7cd8d12a27cafc4d65c6e280ea36156e3b837/src/lxml/parser.pxi#L1357

SAX based parser that collects events and/or maps parser callbacks to callbacks on the target object:

https://github.com/lxml/lxml/blob/master/src/lxml/saxparser.pxi
As far as I can tell from scanning this ticket, there was no agreement on deprecating the parser argument to iterparse, but it has been documented as deprecated with no documentation about what to replace it with, nor any DeprecationWarning in the code that I can see.  Should I remove the deprecation documentation and someone can figure out what to do about it in 3.5, or was the decision just not recorded here (or I missed it)?  (It doesn't look like there's a programatic deprecation for the html argument either.)
New changeset 31e6adf5bfba by R David Murray in branch 'default':
whatsnew: deprecation of ElementTree XMLParser *html* and iterparse *parser*.
http://hg.python.org/cpython/rev/31e6adf5bfba
My latest status is that a decision on the future of the "parser" argument is still pending. See #20219.

It's correctly deprecated in the sense that passing any previously existing parser isn't going to be supported anymore, but passing an XMLPullParser (which is new in Py3.4 and thus can't have been used in existing code) now makes perfect sense (IMHO). So, in a way, it might end up as a sort of argument recycling rather than argument deletion.
Yeah, I'd agree with Stefan here - documented deprecation is reasonable at this point (so that new users avoid it for now), but we may still undeprecate it later if we decide it makes sense to do so.
I think the current status of whatsnew is OK for this, then.
Someone should add programatic deprecation for the html argument, though, since people need to switch to keyword-based calls to XMLParser to prepare for that going away.

Title:	event-driven XML parser
Type:	enhancement	Stage:	resolved
Components:	Library (Lib), XML	Versions:	Python 3.4

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	eli.bendersky, ethan.furman, flox, jcea, jkloth, ncoghlan, python-dev, r.david.murray, scoder
Priority:	normal	Keywords:	patch

Files
File name	Uploaded	Description	Edit
etree_incparser.patch	pitrou, 2013-04-15 20:49		review
iterparse_cleanup.patch	scoder, 2013-08-09 10:15	make _IterParseIterator use an IncrementalParser instead of being one	review
iterparse_api_cleanup.patch	scoder, 2013-08-09 11:42	new patch that also renames the methods to the normal ET names	review
iterparse_TreeEventBuilder.patch	scoder, 2013-08-10 07:38	new complete patch that replaces the IncrementalParser by a TreeEventBuilder	review
remove_IncrementalParser.patch	scoder, 2013-08-25 14:21		review
integrate_IncrementalParser_into_XMLParser.patch	scoder, 2013-08-26 19:35		review
issue17741.api-renames.1.patch	eli.bendersky, 2013-08-28 14:25

History
Date	User	Action	Args
2022-04-11 14:57:44	admin	set	github: 61941
2014-03-11 13:18:02	r.david.murray	set	messages: + msg213147
2014-03-11 13:16:28	r.david.murray	set	messages: + msg213146
2014-03-11 11:22:31	ncoghlan	set	messages: + msg213139
2014-03-11 07:20:28	scoder	set	messages: + msg213123
2014-03-10 22:11:27	python-dev	set	messages: + msg213104
2014-03-10 22:05:14	r.david.murray	set	nosy: + r.david.murray messages: + msg213092
2013-09-13 06:23:49	scoder	set	messages: + msg197553
2013-09-09 19:33:03	scoder	set	messages: + msg197384
2013-09-09 14:53:45	scoder	set	messages: + msg197365
2013-09-08 18:26:14	scoder	set	messages: + msg197309
2013-09-02 00:05:25	eli.bendersky	set	status: open -> closed resolution: fixed messages: + msg196759 stage: needs patch -> resolved
2013-08-31 14:37:47	python-dev	set	messages: + msg196639
2013-08-30 12:53:50	eli.bendersky	set	messages: + msg196533
2013-08-30 12:51:41	python-dev	set	messages: + msg196532
2013-08-28 16:54:19	scoder	set	messages: + msg196389
2013-08-28 16:53:51	scoder	set	messages: + msg196388
2013-08-28 14:25:21	eli.bendersky	set	files: + issue17741.api-renames.1.patch messages: + msg196384
2013-08-28 04:39:49	scoder	set	messages: + msg196347
2013-08-27 23:15:20	eli.bendersky	set	messages: + msg196327
2013-08-27 22:28:34	ncoghlan	set	messages: + msg196324
2013-08-27 22:06:21	ncoghlan	set	messages: + msg196322
2013-08-27 14:22:23	scoder	set	messages: + msg196295
2013-08-27 13:51:49	eli.bendersky	set	messages: + msg196290
2013-08-27 10:37:21	scoder	set	messages: + msg196274
2013-08-27 09:25:57	ncoghlan	set	messages: + msg196270
2013-08-27 07:15:52	scoder	set	messages: + msg196267
2013-08-27 04:01:47	eli.bendersky	set	messages: + msg196262
2013-08-26 21:31:01	ethan.furman	set	messages: + msg196250
2013-08-26 20:46:21	ethan.furman	set	nosy: + ethan.furman
2013-08-26 20:19:55	eli.bendersky	set	messages: + msg196244
2013-08-26 19:43:02	scoder	set	messages: + msg196238
2013-08-26 19:36:41	scoder	set	messages: + msg196237
2013-08-26 19:35:06	scoder	set	files: + integrate_IncrementalParser_into_XMLParser.patch messages: + msg196236
2013-08-26 17:42:29	scoder	set	messages: + msg196231
2013-08-26 16:03:48	eli.bendersky	set	messages: + msg196220
2013-08-26 09:14:26	scoder	set	messages: + msg196193
2013-08-26 08:40:18	scoder	set	messages: + msg196191
2013-08-26 08:36:01	scoder	set	messages: + msg196190
2013-08-26 08:31:45	scoder	set	messages: + msg196189
2013-08-26 07:58:05	ncoghlan	set	messages: + msg196188
2013-08-26 07:50:41	scoder	set	messages: + msg196187
2013-08-26 07:49:29	ncoghlan	set	messages: + msg196186
2013-08-26 07:47:16	scoder	set	messages: + msg196185
2013-08-26 05:40:44	scoder	set	messages: + msg196180
2013-08-26 03:39:36	ncoghlan	set	messages: + msg196178
2013-08-26 03:33:57	eli.bendersky	set	messages: + msg196177
2013-08-25 22:10:21	eli.bendersky	set	messages: + msg196168
2013-08-25 21:35:33	jkloth	set	nosy: + jkloth
2013-08-25 17:50:11	scoder	set	messages: + msg196155
2013-08-25 16:13:10	eli.bendersky	set	messages: + msg196149
2013-08-25 15:35:35	scoder	set	messages: + msg196148
2013-08-25 15:23:37	eli.bendersky	set	messages: + msg196146
2013-08-25 15:16:07	scoder	set	messages: + msg196143
2013-08-25 14:52:31	eli.bendersky	set	messages: + msg196139
2013-08-25 14:45:49	scoder	set	messages: + msg196137
2013-08-25 14:37:25	scoder	set	messages: + msg196136
2013-08-25 14:35:54	eli.bendersky	set	messages: + msg196135
2013-08-25 14:21:33	scoder	set	files: + remove_IncrementalParser.patch messages: + msg196134
2013-08-25 13:55:51	eli.bendersky	set	messages: + msg196133
2013-08-25 10:04:42	scoder	set	messages: + msg196116
2013-08-24 23:07:31	eli.bendersky	set	messages: + msg196105
2013-08-24 14:32:11	ncoghlan	set	status: closed -> open resolution: fixed -> (no value) messages: + msg196079 stage: resolved -> needs patch
2013-08-24 06:26:19	ncoghlan	set	nosy: + ncoghlan messages: + msg196064
2013-08-23 23:01:47	scoder	set	messages: + msg196053
2013-08-23 22:00:03	pitrou	set	nosy: - pitrou
2013-08-23 21:59:34	pitrou	set	messages: + msg196046
2013-08-23 21:28:39	scoder	set	messages: + msg196035
2013-08-23 21:23:59	pitrou	set	messages: + msg196033
2013-08-23 21:21:19	pitrou	set	messages: + msg196032
2013-08-23 21:20:17	scoder	set	messages: + msg196031
2013-08-23 21:11:31	pitrou	set	messages: + msg196028
2013-08-23 21:10:31	scoder	set	messages: + msg196027
2013-08-23 21:06:08	pitrou	set	status: open -> closed messages: + msg196026
2013-08-23 21:04:43	python-dev	set	messages: + msg196025
2013-08-23 06:41:47	scoder	set	messages: + msg195945
2013-08-10 07:38:24	scoder	set	files: + iterparse_TreeEventBuilder.patch messages: + msg194786
2013-08-10 04:23:08	scoder	set	messages: + msg194784
2013-08-09 20:38:15	pitrou	set	messages: + msg194772
2013-08-09 20:34:08	eli.bendersky	set	messages: + msg194770
2013-08-09 19:34:43	pitrou	set	messages: + msg194758
2013-08-09 18:11:54	scoder	set	messages: + msg194754
2013-08-09 16:52:25	scoder	set	messages: + msg194752
2013-08-09 15:53:23	pitrou	set	messages: + msg194747
2013-08-09 15:51:32	eli.bendersky	set	messages: + msg194746
2013-08-09 15:44:46	scoder	set	messages: + msg194745
2013-08-09 15:41:39	eli.bendersky	set	messages: + msg194744
2013-08-09 15:30:35	scoder	set	messages: + msg194743
2013-08-09 15:14:39	pitrou	set	messages: + msg194742
2013-08-09 15:12:46	scoder	set	messages: + msg194741
2013-08-09 15:02:41	pitrou	set	messages: + msg194740
2013-08-09 14:59:46	scoder	set	messages: + msg194739
2013-08-09 14:47:06	pitrou	set	messages: + msg194738
2013-08-09 14:44:15	scoder	set	messages: + msg194737
2013-08-09 14:32:34	scoder	set	messages: + msg194736
2013-08-09 14:31:21	pitrou	set	messages: + msg194735
2013-08-09 14:28:22	pitrou	set	messages: + msg194734
2013-08-09 14:14:59	scoder	set	messages: + msg194733
2013-08-09 14:09:38	scoder	set	messages: + msg194732
2013-08-09 13:53:09	pitrou	set	status: closed -> open
2013-08-09 13:53:04	pitrou	set	messages: + msg194731
2013-08-09 13:51:04	pitrou	set	messages: + msg194730
2013-08-09 13:38:00	scoder	set	messages: + msg194729
2013-08-09 13:37:30	scoder	set	messages: + msg194728
2013-08-09 11:42:32	scoder	set	files: + iterparse_api_cleanup.patch
2013-08-09 11:32:06	scoder	set	messages: + msg194724
2013-08-09 11:14:59	scoder	set	messages: + msg194723
2013-08-09 10:19:11	scoder	set	components: + XML
2013-08-09 10:15:57	scoder	set	files: + iterparse_cleanup.patch nosy: + scoder messages: + msg194722
2013-04-18 17:42:10	pitrou	set	status: open -> closed resolution: fixed messages: + msg187273 stage: patch review -> resolved
2013-04-18 17:37:14	python-dev	set	nosy: + python-dev messages: + msg187271
2013-04-18 12:45:19	eli.bendersky	set	messages: + msg187244
2013-04-16 09:05:28	pitrou	set	messages: + msg187058
2013-04-16 01:24:10	eli.bendersky	set	messages: + msg187042
2013-04-16 00:52:49	jcea	set	nosy: + jcea
2013-04-15 20:51:38	pitrou	set	messages: + msg187030
2013-04-15 20:49:42	pitrou	create