Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

event-driven XML parser #61941

Closed
pitrou opened this issue Apr 15, 2013 · 109 comments
Closed

event-driven XML parser #61941

pitrou opened this issue Apr 15, 2013 · 109 comments
Labels
stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement

Comments

@pitrou
Copy link
Member

pitrou commented Apr 15, 2013

BPO 17741
Nosy @jcea, @ncoghlan, @scoder, @jkloth, @bitdancer, @florentx, @ethanfurman
Files
  • etree_incparser.patch
  • iterparse_cleanup.patch: make _IterParseIterator use an IncrementalParser instead of being one
  • iterparse_api_cleanup.patch: new patch that also renames the methods to the normal ET names
  • iterparse_TreeEventBuilder.patch: new complete patch that replaces the IncrementalParser by a TreeEventBuilder
  • remove_IncrementalParser.patch
  • integrate_IncrementalParser_into_XMLParser.patch
  • issue17741.api-renames.1.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-09-02.00:05:25.756>
    created_at = <Date 2013-04-15.20:49:42.016>
    labels = ['expert-XML', 'type-feature', 'library']
    title = 'event-driven XML parser'
    updated_at = <Date 2014-03-11.13:18:02.443>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2014-03-11.13:18:02.443>
    actor = 'r.david.murray'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-09-02.00:05:25.756>
    closer = 'eli.bendersky'
    components = ['Library (Lib)', 'XML']
    creation = <Date 2013-04-15.20:49:42.016>
    creator = 'pitrou'
    dependencies = []
    files = ['29874', '31205', '31206', '31211', '31460', '31475', '31497']
    hgrepos = []
    issue_num = 17741
    keywords = ['patch']
    message_count = 109.0
    messages = ['187029', '187030', '187042', '187058', '187244', '187271', '187273', '194722', '194723', '194724', '194728', '194729', '194730', '194731', '194732', '194733', '194734', '194735', '194736', '194737', '194738', '194739', '194740', '194741', '194742', '194743', '194744', '194745', '194746', '194747', '194752', '194754', '194758', '194770', '194772', '194784', '194786', '195945', '196025', '196026', '196027', '196028', '196031', '196032', '196033', '196035', '196046', '196053', '196064', '196079', '196105', '196116', '196133', '196134', '196135', '196136', '196137', '196139', '196143', '196146', '196148', '196149', '196155', '196168', '196177', '196178', '196180', '196185', '196186', '196187', '196188', '196189', '196190', '196191', '196193', '196220', '196231', '196236', '196237', '196238', '196244', '196250', '196262', '196267', '196270', '196274', '196290', '196295', '196322', '196324', '196327', '196347', '196384', '196388', '196389', '196532', '196533', '196639', '196759', '197309', '197365', '197384', '197553', '213092', '213104', '213123', '213139', '213146', '213147']
    nosy_count = 9.0
    nosy_names = ['jcea', 'ncoghlan', 'scoder', 'jkloth', 'r.david.murray', 'eli.bendersky', 'flox', 'ethan.furman', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue17741'
    versions = ['Python 3.4']

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 15, 2013

    iterparse() is a blocking operation. It is not really suitable for event-driven applications (e.g. non-blocking I/O). Here is a patch adding a IncrementalParser class.

    @pitrou pitrou added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 15, 2013
    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 15, 2013

    Note that basing iterparse() on IncrementalParser and removing the API discrepancy between the Python and C modules helps make the etree code smaller.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Apr 16, 2013

    Thanks Antoine. This looks interesting - I'm somewhat swamped ATM but will try to review the patch in the next few days.

    Incidentally, since it is a new feature would it be worthwhile to discuss it on python-ideas?

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 16, 2013

    I've posted on the tulip mailing-list already:
    https://groups.google.com/forum/?fromgroups=#!topic/python-tulip/SNOnS27Bctc

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Apr 18, 2013

    Antoine, the patch LGTM. There's some more cleaning that needs to be done in surrounding code, but I can do that later. Also I should probably update the documentation with a bit more details.

    Just add a NEWS entry when you commit.

    Thanks for working on this.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 18, 2013

    New changeset f903cf864191 by Antoine Pitrou in branch 'default':
    Issue bpo-17741: Add ElementTree.IncrementalParser, an event-driven parser for non-blocking applications.
    http://hg.python.org/cpython/rev/f903cf864191

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 18, 2013

    Thank you, Eli. Now committed :)

    @pitrou pitrou closed this as completed Apr 18, 2013
    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    Here is a patch that cleans up the current implementation to avoid making the result of iterparse() an IncrementalParser (with all of its new API).

    Please note that the tulip mailing list is not an appropriate place to discuss additions to the XML libraries, and ElementTree in particular.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    Copying the discussion between Antoine and me from python-dev:

    > IMO it should mimic the interface of the TreeBuilder, which
    > calls the data reception method "feed()" and the termination method
    > "close()". There is no reason to add yet another set of methods names
    > just to do what others do already.

    Well, the difference here is that after calling eof_received() you can
    still (and should) call events() once to get the last events. I think
    it would be weird if you could still do something useful with the object
    after calling close().

    Also, the method names are not invented, they mimick the PEP-3156
    stream protocols:
    http://www.python.org/dev/peps/pep-3156/#stream-protocols

    I see your point about close(). I assume your reasoning was to make the IncrementalParser an arbitrary stream end-point. However, it doesn't really make all that much sense to connect an arbitrary data source to it, as the source wouldn't know that, in addition to passing in data, it would also have to ask for events from time to time. I mean, you could do it, but then it would just fill up the memory with parser events and loose the actual advantages of incremental parsing. So, in a way, the whole point of the class is to *not* be an arbitrary stream end-point.

    Anyway, given that there isn't really the One Obvious Way to do it, maybe you should just add a docstring to the class (ahem), reference the stream protocol as the base for its API, and then rename it to IncrementalStreamParser. That would at least make it clear why it doesn't really fit with the rest of the module API (which was designed some decade before PEP-3156) and instead uses its own naming scheme.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    Actually, let me take that last paragraph back. There is an Obvious Way to do it, and that's the feed() and close() methods. They are the existing and established ElementTree interface for incremental parsing. The fact that close() doesn't clean up all state is IMHO a minor issue. The state will be cleaned up automatically once the iteration terminates, and that's the normal behaviour of iterators. So it actually fits both protocols quite nicely.

    I'd just leave the stream protocol out completely here. For one, the implementation isn't complete anyway (the connection_*() methods don't make much sense), and as I said, it's not very useful to consider the parser a general end-point to that protocol.

    I'd also suggest returning the iterator over the remaining events from close(), just like the TreeBuilder returns the tree. That covers the (less common) use case of first parsing everything and then processing the events. I'll add another patch that does that.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    Thinking about the original patch some more - I wonder why it doesn't use a wrapper for TreeBuilder to process the events. Antoine, is there a reason why you had to add this _setevents() method to the XMLParser, instead of making the IncrementalParser an IncrementalTreeBuilder?

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    Oh, and could we reopen this ticket, please?

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 9, 2013

    Antoine, is there a reason why you had to add this _setevents() method
    to the XMLParser, instead of making the IncrementalParser an
    IncrementalTreeBuilder?

    The point is not to build a tree of potentially unbounded size (think XMPP). The point is to yield events in a non-blocking way (iterparse() is blocking, which makes it useless for non-blocking applications).

    An IncrementalTreeBuilder wouldn't have much point IMO. It is not more costly to accumulate a string and parse it at the end, than to progressively build a growing XML tree.

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 9, 2013

    About the patch: I think changing the API names now that alpha1 has been released is a bit gratuitous. I would like to keep the encapsulation part but ditch the naming changes.

    @pitrou pitrou reopened this Aug 9, 2013
    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    The point is not to build a tree of potentially unbounded size (think
    XMPP). The point is to yield events in a non-blocking way (iterparse()
    is blocking, which makes it useless for non-blocking applications).

    Ok, but that's the only difference. Instead of getting the events from the parser, you could equally well get them from the TreeBuilder, also in a non-blocking way.

    Sticking this functionality into a parser target object has the advantage that the parser interface wouldn't have to change. So, instead of introducing an entirely new parser interface, we'd just add a class that can be used as a parser target and provides an additional events() method. That's a substantially less invasive API change.

    An IncrementalTreeBuilder wouldn't have much point IMO. It is not more
    costly to accumulate a string and parse it at the end, than to
    progressively build a growing XML tree.

    I don't think I understand what you mean.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    About the patch: I think changing the API names now that alpha1 has been
    released is a bit gratuitous.

    Sorry for being late, but I can't see it being my fault.

    A change in an alpha release is still way better than a change after a final release. Eventually, lxml will have to go on par here, so it's better to get it right now, before any harm is done.

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 9, 2013

    > The point is not to build a tree of potentially unbounded size
    > (think
    > XMPP). The point is to yield events in a non-blocking way
    > (iterparse()
    > is blocking, which makes it useless for non-blocking applications).

    Ok, but that's the only difference. Instead of getting the events
    from the parser, you could equally well get them from the
    TreeBuilder, also in a non-blocking way.

    But your TreeBuilder is also growing a tree internally, and therefore
    eating more and more memory, right?

    I don't see the point of stuffing different kinds of functionality
    inside a single class. It makes the intended use less obvious, and
    errors more likely. Right now, IncrementalParser has a simple API,
    and there's a single way to use it.

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 9, 2013

    A change in an alpha release is still way better than a change after
    a final release.

    But worse than no change at all. Arguing about API naming is a bit futile,
    *especially* when the ship has sailed.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    But IncrementalParser uses an XMLParser internally, which in turn uses a TreeBuilder internally. So what exactly is your point?

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    But worse than no change at all. Arguing about API naming is a bit
    futile, *especially* when the ship has sailed.

    <rant>It's easy to say that as a core developer with commit rights who can simply hide changes in a low frequented bug tracker without notifying those who have to know about these changes.</rant> It's pure luck I noticed this change during the alpha release cycle.

    I'm also not arguing about naming. I'm questioning the design, trying to get it into a shape that fits the existing APIs. Why do we need two incremental parsing interfaces in one and the same library that use completely different method names, although doing otherwise exactly the same? Why should the type of the *result* of the parsing process change the method name that you need to call to *insert* data into the parser?

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 9, 2013

    But IncrementalParser uses an XMLParser internally, which in turn
    uses a TreeBuilder internally. So what exactly is your point?

    Well, I would rather like to understand yours. Whatever
    IncrementalParser uses internally needn't impact what API it
    exposes to the user.

    (as a matter of fact, iterparse() doesn't expose a
    TreeBuilder-like API, either)

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    Well, I would rather like to understand yours.

    My point is that the IncrementalParser uses a TreeBuilder that builds an XML tree in the back. So I'm wondering why you are saying that it doesn't build a tree.

    Whatever IncrementalParser uses internally needn't impact what API it
    exposes to the user.

    And in fact, we don't even need an IncrementalParser, because XMLParser already *has* an incremental parsing interface. All that's missing is the bit that collects and exposes the events. And my point is that we shouldn't duplicate the existing *data entry* interface for that, especially not under different names for identical functionality, but only add an interface to access those events as *output*, i.e. to add the bit of the API that's actually missing.

    As an analogon, what would you say if I asked for adding a new, separate Mapping interface to Python that uses the method name "set_value()" instead of "__setitem__()" because I want it to read data from the hard drive and not from memory?

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 9, 2013

    <rant>It's easy to say that as a core developer with commit rights
    who can simply hide changes in a low frequented bug tracker without
    notifying those who have to know about these changes.</rant> It's
    pure luck I noticed this change during the alpha release cycle.

    It is the rule for most stdlib improvements that they go directly
    through the bug tracker. Most core developers and outsiders
    would feel swamped by the traffic if all feature additions went
    through the mailing-list.

    I'm honestly baffled that you think I am trying to "hide" things.
    Why do you think I would feel guilty about proposing an addition,
    or try to sneak things inside xml.etree?

    (yes, we could theoretically run polls for every addition
    we propose, collect and discuss the results, and iterate several
    times until the outcome is successful; I don't think any of us
    has the bandwidth to do that, which is why that practice is
    only used for game-making changes (i.e. PEP material))

    I'm also not arguing about naming. I'm questioning the design, trying
    to get it into a shape that fits the existing APIs. Why do we need
    two incremental parsing interfaces in one and the same library that
    use completely different method names, although doing otherwise
    exactly the same?

    Well, unless I'm missing something, TreeBuilder doesn't do parsing,
    it takes already parsed data: it has a start() method to open a tag,
    and a data() method to add raw text inside that tag. IncrementalParser,
    OTOH, has a data_received() method to which you pass a piece of
    non-parsed XML string.

    The other incremental parsing API is actually iterparse(). The
    reason I proposed IncrementalParser is that iterparse() is useless
    for non-blocking applications. IncrementalParser produces the same
    kind of output as iterparse(), but control of when to feed it data
    is transferred to the user of the API.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 9, 2013

    TreeBuilder doesn't do parsing, it takes already parsed data: it has a
    start() method to open a tag, and a data() method to add raw text
    inside that tag.

    That is correct. However, the XMLParser has a feed() method that sends new data into the parser, and a close() method that tells the parser that it's done. So there already is an incremental parsing interface, and your change is duplicating that interface under a different name. Specifically, IncrementalParser has exactly the same interface as XMLParser when it comes to entering data, but uses different method names for it. This is a Bad Design because it introduces an unnecessary inconsistency in the API.

    However, what you are trying to change is not the way data is *entered* into the parser. What you are after is to change the way the *parsed* data is *presented* to the user. That is not the responsibility of the parser, it's the responsibility of the TreeBuilder.

    In your code, the TreeBuilder builds a tree, and the new interface *additionally* collects parse events in a list. So, the right way to do it would be to change the parser *target* to do both, i.e. to build a tree and collect events at the same time.

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 9, 2013

    > Well, I would rather like to understand yours.

    My point is that the IncrementalParser uses a TreeBuilder that builds
    an XML tree in the back. So I'm wondering why you are saying that it
    doesn't build a tree.

    Unless I'm reading it wrong, when _setevents() is called, the internal
    hooks are rewired to populate the events list, rather than call the
    corresponding TreeBuilder methods. So, yes, there's a TreeBuilder
    somewhere, but it stands unused.

    (the _setevents() method already existed on the C impl, by the way.
    I added it to the Python impl to make things more regular and avoid
    two separate iterparse() implementations)

    And my
    point is that we shouldn't duplicate the existing *data entry*
    interface for that, especially not under different names for
    identical functionality, but only add an interface to access those
    events as *output*, i.e. to add the bit of the API that's actually
    missing.

    What difference would that make? In the end, you mustn't mix
    event-driven/non-blocking and cumulative/blocking styles of
    programming, so having two separate APIs doesn't strike me as
    a problem (it may be a good thing actually, for clarity reasons).

    @ethanfurman
    Copy link
    Member

    Stefan Behnel wrote:

    ... I still wonder why I'm the one writing all the patches ...

    I imagine for the same reasons that I offered to write Enum: I had definite ideas about how it should be, it is sometimes easier to explain with working code than with prose, and I wanted to contribute to Python.

    I have to agree with Eli, though. Wading through the accusations to get to the ideas is not helping your cause.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Aug 27, 2013

    On Mon, Aug 26, 2013 at 12:35 PM, Stefan Behnel <report@bugs.python.org>wrote:

    Stefan Behnel added the comment:

    Here is a proof-of-concept patch that integrates the functionality of the
    IncrementalParser into the XMLParser. I ended up reusing most of Antoines
    implementation and test suite. In case he'll look back into this ticket at
    some point, I'll put a "thank you" here.

    The patch certainly needs more cleanup, but it shows where things are
    going and passes all tests.

    OK, I looked at the patch. So I think I understood your verbal proposal
    correctly earlier. And as I said then, I don't like it. I believe Nick has
    summarized nicely why heaping everything into XMLParser is a bad idea, but
    since you went the extra mile and made a patch, I will try to provide a
    more expanded rationale.

    Putting _setevents aside for the moment, XMLParser is a clean and simple
    API. Its output is only "push" (by calling callbacks on the target). It
    doesn't deal with Elements at all. It decomposes the input XML and produces
    push events. That's it. It doesn't even have to interact with the target to
    a larger extent than pushing callback calls to it.

    You propose to add a completely different mode of operation to it,
    controlled by a constructor flag. In "collect_events mode", XMLParser
    undergoes a transformation. It becomes a pull API. It starts interacting
    with the target more deeply by using it to produce elements for events.
    Coupled with "weird" targets, its semantics become totally unclear.

    So it's a single class that acts completely differently based on an
    argument. This is very unpythonic. But things get worse. XMLParser is not
    supposed to be an isolated entity. Users should be able to implement custom
    parsers that provide its API. When the API is simple - feed/close and some
    callbacks, replacing/ehnancing/faking it is easy. When the API is complex
    -- two different modes of operation -- it's considerably more difficult.

    That's the gist of it. Sticking to the current design of separation of
    concerns between the parser and the target/treebuilder is IMO the better
    way looking forward. In the mean-time, a helper class for asynchronous
    parsing is useful and timely. By restricting its API, changing its
    underlying implementation will be easy in the future.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 27, 2013

    Putting _setevents aside for the moment,

    Agreed, obviously.

    XMLParser is a clean and simple API. Its output is only "push" (by calling callbacks on the target). It doesn't deal with Elements at all.

    We already agreed on that, too. Keep things simple.

    It decomposes the input XML and produces push events. That's it. It doesn't even have to interact with the target to a larger extent than pushing callback calls to it.

    It does, though. That's essentially how iterparse works. I.e. the API between parser and target that I'm using here is already there. And has been for years. And I think that the feature in question shows that this was a particularly good design. Thank you Fredrik!

    You propose to add a completely different mode of operation to it, controlled by a constructor flag. In "collect_events mode", XMLParser undergoes a transformation.

    We clearly disagree here.

    Coupled with "weird" targets, its semantics become totally unclear.

    Try to come up with a patch for your own proposal that avoids that. You will notice that you will be using the existing protocol, thus the target will continue to be able to do what it wants. If it's a weird target, then it's a weird target because the user said so. But it's the target's very own concern to do expected things (build a tree) or do weird things (whatever an example for this is).

    Except when you disallow using user defined targets, but then, that's an arbitrary restriction that ignores what is there already.

    As long as you want to keep up the existing separation of concern between parser and target (and we agreed on that being a good thing), and as long as you don't impose arbitrary (and arbitrarily complex) restrictions on user code, you will not be able to prevent the target from doing "weird" things that impact the events being collected by whatever interface you choose for this. And why should you?

    XMLParser is not supposed to be an isolated entity. Users should be able to implement custom parsers that provide its API. When the API is simple - feed/close and some callbacks, replacing/ehnancing/faking it is easy. When the API is complex -- two different modes of operation -- it's considerably more difficult.

    That's only true when you use it for incremental event parsing. If you don't, that part of the API doesn't have to bother you. If you want to implement a custom parser that doesn't support event collection, then simply don't support the "collect_events" keyword argument and you're done. Implement read_events() by returning an empty iterator, if you like, that's up to you. If you want to write a target that prevents parsers from keeping tree fragments alive in memory, simply return None from your callback methods and you're done. If you want to write a target that changes the Elements that the parser collects, then you can do that. Yes, those are actually features. And you could already do most of this before.

    Remember that it's the users using the parser and providing (or implementing) the target. We should let them.

    I'm sorry if I sound offensive, but most of what I read as counter arguments in this ticket so far was either incomplete or (sometimes) even plain wrong. Some seemed pretty close to FUD, but maybe that's just me (and I'm guilty here too, I guess - sorry for that). Now, how could this discussion possibly give the me impression that everyone has the same level of understanding? Even your comment about "weird" targets, right above, after going through all of this discussion, makes me wonder if you really know how all of this works.

    Please take a closer look at the original code before Antoine changed it. Play with the parser targets a bit. That will help you understand this issue better.

    Also, if you want to come up with a patch that implements approach 3), i.e. a class between parser and target, then please do. I tried it (the incomplete and partially fake patch is attached to this ticket), and my take on it is that it's too complex and requires some rather non-trivial changes, including user visible ones. So, based on that experience, I decided that it would be better to keep things simple and integrate the feature more closely with what is there anyway. I think it might help if you went through the same experience.

    @ncoghlan
    Copy link
    Contributor

    The whole point of the new API is not to replace XMLParser, but to
    provide a convenience API to set up a particular combination of an
    XMLParser with a particular kind of custom target. It just happens
    that actually *implementing it* that way doesn't work right with the
    current code. It is *not* necessary (or even appropriate) for it to
    serve as a new building block, but as a convenience API to compose
    *existing* building blocks in a particular way (or should be, except
    that composition doesn't currently work right).

    Once the underlying deficiencies are fixed, then the event capturing
    target can be exposed directly, but in the meantime, the *composed*
    version can be exposed by relying on internal APIs.

    We could call the new class XMLParserWithEventCapturingTarget, but
    that's a little clumsy :)

    @scoder
    Copy link
    Contributor

    scoder commented Aug 27, 2013

    The whole point of the new API is not to replace XMLParser, but to provide a convenience API to set up a particular combination of an XMLParser with a particular kind of custom target.

    Ok, but I'm saying that we don't need that. It's all there, it all comes together at the interfaces of the parser. The two-way communication between parser and target already exists and is used by iterparse().

    So, what I'm advocating is that we should not complicate the module interface with a new class (and eventually two classes) at all. Instead, we should just expose the *existing* interface of the parser in two ways, once as callbacks and once as collected events. I find that much easier to explain than any of the other proposals I've seen so far.

    This is really not about doing it this way because it's technically too difficult to do differently. It's about keeping things simple for users and well integrated with what's there.

    BTW, Eli asked for working code before we discuss. I've provided it. Now I would like to see working code for the proposed three-level design in order to make sure it actually works and feels right. I've already said that I've tried it and it didn't feel right. We can continue to discuss hot air for another month or two, or we can stick to discussing real code and real APIs.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Aug 27, 2013

    I'm sorry if I sound offensive, but most of what I read as counter

    arguments in this ticket so far was either incomplete or (sometimes) even
    plain wrong. Some seemed pretty close to FUD, but maybe that's just me (and
    I'm guilty here too, I guess - sorry for that). Now, how could this
    discussion possibly give the me impression that everyone has the same level
    of understanding? Even your comment about "weird" targets, right above,
    after going through all of this discussion, makes me wonder if you really
    know how all of this works.

    Please take a closer look at the original code before Antoine changed it.
    Play with the parser targets a bit. That will help you understand this
    issue better.

    Ah, awesome. Well, you can't say I didn't warn you, can you?

    I am done discussing this issue with you, Stefan.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 27, 2013

    Eli, I agree that we've put way more than enough time into the discussion by now. We all know each other's arguments and failed to convince each other.

    Please come up with working code that shows that the approach you are advocating for the final implementation is doable and also helpful from an API user's point of view.

    @ncoghlan
    Copy link
    Contributor

    Stefan, your proposed merged design isn't going to happen. Two alternative
    ways of using the one class is far more complicated to explain to users
    than exposing a helper that composes the existing API components to address
    a particular use case (particularly when the helper is hiding the fact that
    the theoretical underlying composition doesn't currently work the way it
    should according to the public APIs).

    As a higher level helper, users that are comfortable with and prefer the
    lower level API are then clearly free to ignore. This recasting of the
    nature of the API as a composition of the lower level components is a
    *direct* response to your concern about it being a capability that the
    lower level API should support directly.

    You clearly don't like that design choice, but this isn't a democracy.
    Arguing further on this point will just be ranting for the sake of ranting,
    rather than having any chance of changing Eli's mind as module maintainer.

    As far as naming goes, I still think it's better to drop "Parser" from the
    name entirely. Call it "XMLEventReader" or something like that. The point
    is to use it when you want to get the events out without blocking when no
    events have been triggered and don't want to set up your own class to
    receive event notifications through the target push API. The docs should
    make it clear that this API is "pull only" - if people need callbacks or
    other push triggers, then that's still what custom targets are for.

    @ncoghlan
    Copy link
    Contributor

    Sorry, Stefan, I missed your last comment before posting mine. It appears
    you had already reached the same conclusion I had regarding further high
    level design discussion being pointless :)

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Aug 27, 2013

    W.r.t. the summary in http://bugs.python.org/msg196177, after some further chat with Nick, a name that sounds good for us for the class is XMLPullParser. It has both XML and Parser in it, so the intention is clear. On the other hand, Pull works well for analogy with the existing pull API in xml.dom.pulldom, *and* suggests that the push API is inactive and gets redirected to a pull API (thanks Nick for this insight).

    The method names remain as suggested there.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 28, 2013

    the push API is inactive and gets redirected to a pull API

    Given that this is aimed to become a redundant 'convenience' wrapper around something else at a point, I assume that you are aware that the above is just an arbitrary restriction due to the fact that the parser does not accept arbitrary targets as input.

    Also, the refrence to pulldom is a bad one, because pulldom reads data as you request it ("pull"). This class creates events based on the data you *push* into it. So it's really a push API from the POV of the user, not so much a pull API.

    Anyway, it should be possible to emulate both this class and the pie-in-the-sky three-level API based on what I suggested, so I can't see this addition being a problem. I may even end up making the same visual split in lxml.etree deliberately, in order to make it clear that this feature only applies to the push parser API, i.e. when using feed() and close(), as opposed to what parse() and fromstring() use. That would mean that there'd also be a corresponding class for the HTMLParser then, and both would inherit from the base parsers and thus obviously (and conveniently) accept arbitrary parser targets.

    I think I'm ok with adding that class. I liked Greg Ewing's suggestion of calling it AsyncParser, though, i.e. AsyncXMLParser. It makes it obvious what the use case is and avoids any references to push or pull that will always confuse some users, based on what part of the feature they are focussing on in their code.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Aug 28, 2013

    This patch implements the renaming and updates the documentation.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 28, 2013

    Any comments regarding my naming suggestion?

    Calling it a "push" parser is just too ambiguous.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 28, 2013

    Erm, "pull" parser, but you see what I mean.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 30, 2013

    New changeset 8fd72b1bb262 by Eli Bendersky in branch 'default':
    Issue bpo-17741: Rename IncrementalParser and its methods.
    http://hg.python.org/cpython/rev/8fd72b1bb262

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Aug 30, 2013

    This issue has become too large and discusses a few different things; hence I'm inclined to close it as fixed and open a new one as a placeholder for discussing the new design of the internals.

    I'll do this in a couple of days if there are no objections.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 31, 2013

    New changeset 1faaec66c73d by Eli Bendersky in branch 'default':
    Fix XMLPullParser documentation to say "non-blocking" instead of "asynchronous".
    http://hg.python.org/cpython/rev/1faaec66c73d

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Sep 2, 2013

    I'm closing this issue as fixed because the feature was implemented. As discussed, opened a new issue (bpo-18902) to discuss re-designing some parts of the current code that would allow a nicer implementation of this feature. Anyone interested in the subject - feel free to subscribe yourself there.

    @elibendersky elibendersky mannequin closed this as completed Sep 2, 2013
    @scoder
    Copy link
    Contributor

    scoder commented Sep 8, 2013

    While refactoring the iterparse() implementation in lxml to support this new interface, I noticed that the close() method of the XMLPullParser does not behave like the close() method of the XMLParser. Instead of setting some .root attribute on the parser instance, the method should return the root element that the tree builder generated.

    @scoder
    Copy link
    Contributor

    scoder commented Sep 9, 2013

    Looks like we missed the alpha2 release for the close() API fix. I recommend not letting yet another deadline go by.

    @scoder
    Copy link
    Contributor

    scoder commented Sep 9, 2013

    Created separate bpo-18990 to keep this one closed as is.

    @scoder
    Copy link
    Contributor

    scoder commented Sep 13, 2013

    The way the XMLPullParser is implemented in lxml.etree now is that it simply inherits from XMLParser. This would also make sense for ElementTree, even before supporting arbitrary targets. The patch in ticket bpo-18990 makes this simple to do.

    For reference, here is the implementation in lxml.

    Iterparse:

    https://github.com/lxml/lxml/blob/master/src/lxml/iterparse.pxi

    XMLPullParser:

    https://github.com/lxml/lxml/blob/d9f7cd8d12a27cafc4d65c6e280ea36156e3b837/src/lxml/parser.pxi#L1357

    SAX based parser that collects events and/or maps parser callbacks to callbacks on the target object:

    https://github.com/lxml/lxml/blob/master/src/lxml/saxparser.pxi

    @bitdancer
    Copy link
    Member

    As far as I can tell from scanning this ticket, there was no agreement on deprecating the parser argument to iterparse, but it has been documented as deprecated with no documentation about what to replace it with, nor any DeprecationWarning in the code that I can see. Should I remove the deprecation documentation and someone can figure out what to do about it in 3.5, or was the decision just not recorded here (or I missed it)? (It doesn't look like there's a programatic deprecation for the html argument either.)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 10, 2014

    New changeset 31e6adf5bfba by R David Murray in branch 'default':
    whatsnew: deprecation of ElementTree XMLParser *html* and iterparse *parser*.
    http://hg.python.org/cpython/rev/31e6adf5bfba

    @scoder
    Copy link
    Contributor

    scoder commented Mar 11, 2014

    My latest status is that a decision on the future of the "parser" argument is still pending. See bpo-20219.

    It's correctly deprecated in the sense that passing any previously existing parser isn't going to be supported anymore, but passing an XMLPullParser (which is new in Py3.4 and thus can't have been used in existing code) now makes perfect sense (IMHO). So, in a way, it might end up as a sort of argument recycling rather than argument deletion.

    @ncoghlan
    Copy link
    Contributor

    Yeah, I'd agree with Stefan here - documented deprecation is reasonable at this point (so that new users avoid it for now), but we may still undeprecate it later if we decide it makes sense to do so.

    @bitdancer
    Copy link
    Member

    I think the current status of whatsnew is OK for this, then.

    @bitdancer
    Copy link
    Member

    Someone should add programatic deprecation for the html argument, though, since people need to switch to keyword-based calls to XMLParser to prepare for that going away.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants