Author eli.bendersky
Recipients eli.bendersky, flox, jcea, ncoghlan, python-dev, scoder
Date 2013-08-25.13:55:49
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1377438951.42.0.297954043001.issue17741@psf.upfronthosting.co.za>
In-reply-to
Content
Looking at this more, the Parser/Builder abstraction in ET leaks badly, especially when it comes to incremental parsing. This manifests in (at least) two ways:

1. The parser accepted by iterparse has to use the ET-provided TreeBuilder, because the C implementation of TreeBuilder has undocumented fields and hooks used directly by the parser.
2. Even though iterparse & IncrementalParser accept a "parser" argument, it cannot be just any parser. This parser (in addition to using the ET TreeBuilder) must implement an undocumented and obscure _setevents method; this makes the original intent of the abstraction - hooking up arbitrary parsers to the interface, hard to follow.

To *really* fix this, all these capabilities have to be exposed as APIs. TreeBuilder has to support an explicit API for collecting and reporting events. XMLParser has to call into this API and either not have _setevents at all or have something public and documented. Note also that event hookup in the parser makes sense in a way, because we don't want events to be fired when not needed (for performance). So we don't want any parser to hook up all events for firing, and only incremental tree builders to collect them. So things aren't quite *so* simple. In addition, since iterparse (an existing, stable API of the module) expects a parser argument, it won't be easy creating an IncrementalTreeBuilder, because how would we let people using custom parsers with iterparse not change any code and yet keep things working?

--

Since the above presents considerable changes in the internals, to get *really* right - we do have a problem for the upcoming release because I'm not sure if anyone has the time or will to do it. However, the feature is interesting and useful, and the code is cleaner than it used to be. So I propose the following compromise that can serve us for the 3.4 release:

IncrementalParser stays, but its methods will be renamed to feed & close, which is consistent with the current XML parsing APIs (including xml.sax.xmlreader.IncrementalParser). In a future release, we may decide to create a IncrementalTreeBuilder anyway but then IncrementalParser can be layered on top of it and offered as a convenience.

--

To conclude, I don't think that when looked at as exposing an implementation detail of iterparse, IncrementalParser is so horrible. iterparse is already its own kind of thing somewhat alien to ET, but it's very useful and important. IncrementalParser is just replacing iterparse's "parse whole source" behavior with incremental feeds followed by a close. This is similar to the APIs xml.sax.xmlreader is exposing, so there is a consistency here. Solving "all the implementation problems" would certainly be great, but unfortunately no one seems to have time for this at the moment.
History
Date User Action Args
2013-08-25 13:55:51eli.benderskysetrecipients: + eli.bendersky, jcea, ncoghlan, scoder, flox, python-dev
2013-08-25 13:55:51eli.benderskysetmessageid: <1377438951.42.0.297954043001.issue17741@psf.upfronthosting.co.za>
2013-08-25 13:55:51eli.benderskylinkissue17741 messages
2013-08-25 13:55:49eli.benderskycreate