classification
Title: Remove root attribute from XMLPullParser
Type: behavior Stage: resolved
Components: Library (Lib), XML Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eli.bendersky, ncoghlan, python-dev, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-09-09 19:32 by scoder, last changed 2013-09-28 14:02 by scoder. This issue is now closed.

Files
File name Uploaded Description Edit
xmlpullparser_close.patch scoder, 2013-09-09 19:32 review
xmlpullparser_remove_public_root_attr.patch scoder, 2013-09-27 05:40 minimal patch to remove non-public 'root' attribute review
Messages (24)
msg197383 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-09 19:32
The .close() method of the new XMLPullParser (see issue17741) in Py3.4 shows an unnecessarily complicated behaviour that is inconsistent with the .close() method of the existing XMLParser.

The attached patch removes some code to fix this.
msg197536 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-12 16:37
ping - I would like to see this fixed for alpha3, which is due in two weeks.
msg197638 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-13 20:05
I've updated the lxml documentation to reflect the new implementation. It also has a couple of examples that show how the API works now.

https://github.com/lxml/lxml/blob/74ab4b10176750e7797eb0eff3c7c91901ab6adb/doc/parsing.txt#L509
msg198127 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-20 04:42
Inviting some more people to get this patch reviewed in a timely fashion. Given that it's a cleanup of a newly introduced API, it must go in before the first beta release of 3.4.

The even wider cleanup would be to make XMLPullParser inherit from XMLParser and simply drop most of its code (as done in lxml.etree, see ticket #19010), but this patch is all that is needed to keep that path open.
msg198128 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-20 05:27
I clarified the issue title to actually explain the concern and the proposed fix (i.e. close() on XMLPullParser doesn't currently return the root, which is inconsistent with the XMLParser API).

This sounds like a reasonable change to me, but the docs also need to be updated.
msg198129 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-20 05:59
I'm not entirely happy about the docs anyway. Most people just want to loop over iterparse() and be done (use case being to save memory). The new parser class is a rather special and more complex thing that we shouldn't let innocent users run into that easily. But that's a different ticket, I guess, and in no way time critical.

I can't see a reference to "parser.root" in the ElementTree doc page, BTW, so maybe there's just nothing to do there?
msg198131 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-20 06:59
It's the docs for XMLPullParser.close that need to be updated.

On 20 September 2013 15:59, Stefan Behnel <report@bugs.python.org> wrote:
>
> Stefan Behnel added the comment:
>
> I'm not entirely happy about the docs anyway. Most people just want to loop over iterparse() and be done (use case being to save memory). The new parser class is a rather special and more complex thing that we shouldn't let innocent users run into that easily. But that's a different ticket, I guess, and in no way time critical.
>
> I can't see a reference to "parser.root" in the ElementTree doc page, BTW, so maybe there's just nothing to do there?
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue18990>
> _______________________________________
msg198132 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-20 07:08
Ah, right - I forgot that it currently only implements a part of the parser API in ElementTree. Adding "Returns the toplevel document element." behind the current sentence would do, I guess. That will need updating once the XMLPullParser implements support for parser targets, though. Also, the description could just be dropped if XMLPullParser inherited from XMLParser (ticket #19010), because it's just (or should just be) a redundant repitition. All that's really new is the read_events() method and the constructor's arguments. Keeping the rest out would simplify both the interface and the documentation.
msg198145 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-09-20 12:43
Hi Nick,

I disagree with this change. The way the APIs are currently defined, XMLParser and XMLPullParser are different animals. XMLParser can be considered to only have one "front" in the API - feed() and close(). You feed() until the document is done and then you close() and get the parsed tree. There's no other way to get the parsed tree (unless you use a custom builder, I guess).

On the other hand XMLPullParser has two clear "fronts" - an input front with feed() and close() and an output front with read_events(). For XMLPullParser, close() is just an input signal. The canonical way to get output from XMLPullParser is read_events(). close() has no better reason to return output than feed(). When we decided to change the method names (recall that Antoine's originals were completely different), we perhaps forgot this detail. Even though XMLPullParser's method is named close(), it's *not* like XMLParser's close(). If someone is using XMLPullParser for its close() he's likely using the class incorrectly.

Just as an example: consider that in a lot of use cases the programmer will want to discard parts of the tree that's parsed iteratively (similarly to the main use case of iterparse()), because the XML itself is too huge. It's a convenient streaming API, in other words. Now, if the reader discards parts of the tree (by deleting subtrees), then returning the root from close() becomes even more meaningless, because it's no longer the root and we have no idea what it actually is.

Finally, I will add that we can always err on the side of under-specifying. We plan to change the implementation of the class in the future, so we can always add to the API. Specifying that close() returns the root is something we can't back out of, but we can always add in a future release without hurting backwards compatibility, if we do reach the conclusion that it's better.

---

P.S. all of this makes me wonder if we should mark the API of XMLPullParser provisional for 3.4
msg198185 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-21 06:53
> Just as an example: consider that in a lot of use cases the programmer will want to discard parts of the tree that's parsed iteratively (similarly to the main use case of iterparse()), because the XML itself is too huge. It's a convenient streaming API, in other words. Now, if the reader discards parts of the tree (by deleting subtrees), then returning the root from close() becomes even more meaningless, because it's no longer the root and we have no idea what it actually is.

Well, that's how the parser API works. It's the target's responsibility to make something (really, anything) of the parser's callbacks. Why change this for XMLPullParser if it already works that way for both XMLParser and iterparse() ? It's an inherent part of the design, and it really fits together just fine.

I don't like the idea of deliberately adding surprises to an otherwise simple API, especially if they clearly restrict future developments.


> Finally, I will add that we can always err on the side of under-specifying. We plan to change the implementation of the class in the future, so we can always add to the API. Specifying that close() returns the root is something we can't back out of, but we can always add in a future release without hurting backwards compatibility, if we do reach the conclusion that it's better.

Eli, if you had taken a look at my patch (and also a closer look at some of my earlier patches), you would have noticed that it not only makes close() behave as expected, but also removes the ".root" attribute from the parser. This removes a very unfortunate design decision in iterparse() that has bugged me several times in the past already, because it really gets in the way of both the implementation and the usage. Please keep that a misfeature only of iterparse() and do not repeat this mistake in this newly introduced API.

In fact, my decision to make XMLPullParser.close() consistent with XMLParser.close() was mostly motivated by this change, not the other way round.
msg198192 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-21 10:56
Eli, seeing our discussion so far, ISTM that the parser-target interface is at the very heart of our disagreement.

For me, it's a good design that provides a clean separation of concerns between the parser that generates events, and the target (usually, but not necessarily a TreeBuilder) that processes them. The link back is then provided by the close() method (provided by both, BTW), which terminates the parsing process and passes the result through to the user. That result may be a tree, but that's completely up to the target, and whatever comes back is transparent to the parser (there's an example of that in the docs, BTW). That's a clean separation of concerns that I would like to copy in the design of the XMLPullParser (in fact, by inheriting from XMLParser). Extrapolating this exact design, whatever comes back from the callback methods should be transparent to the parser and should control the objects that the XMLPullParser collects for its events. I.e., it's the target's sole responsibility to convert the parser's calls into objects that make up both the collected events and the result returned by the parser's close() methods. Again, I consider this a very clean separation of concerns.

Essentially, I embrace the fact that the user is standing on both sides of the API and can control both the generation of the event objects (by providing a target object) and the processing of the events in front of the parser. To me, this is a very clean and extremely flexible design, and I hope it's clearer now why I consider it an excellent feature.

Now, several of your comments so far indicate an aversion towards the parser-target interface because it allows arbitrary objects to be returned by uncontrollable, user provided code, thus "tainting" the otherwise clean tree building interface of the new pull parser. On the other hand, you also stated that you would like to keep a separation between the parser and the TreeBuilder and do not want the parser to know about trees, so I guess that means that you also wouldn't want the parser to enforce any kind of tree building constraints on whatever the result of the parsing process is. As I said before, it's utterly unclear to me how both of these go together, but that's what I could extract from the discussion so far.

Would it be possible for you to clarify your own opinion on this topic, and, specifically, how you envision the role of the parser, the target and their separation of concerns during the event generation? I think this would help us get this discussion a little forward.
msg198196 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-21 12:52
Actually, I think it's reasonable to define the custom target nominally
abstracted by PullParser as always returning None from close(). As Eli
notes, it's designed to let you discard events as you go, so remembering
them internally to return from close() doesn't make sense.

That means the patch could be simplified to just removing the root
attribute without changing the result of calling close().
msg198199 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-21 13:19
> That means the patch could be simplified to just removing the root
> attribute without changing the result of calling close().

Absolutely.

Returning the parse result from close() would still make it both more
consistent and easier to use (also from within iterparse()), but I agree
that that's more of a feature. If the goal is really just to get the bare
minimum API out that allows future extensions without introducing future
consistency issues, then I'm ok with that change, too.

However, note that the next lxml release will almost certainly have been
out for a while when Py3.4 finally lands (I'm currently planning the
release for next month, maybe early November), which means that the
complete, well integrated API, including target support for iterparse()
etc., will already have an established, publicly available implementation
when Py3.4 comes out. If ElementTree wants to have a word in the final
design of this feature, and wants to avoid introducing deliberate
incompatibilities in the future, then I'd prefer finishing up this
discussion before the next lxml release gets into beta state.
msg198202 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-09-21 14:05
On Sat, Sep 21, 2013 at 5:52 AM, Nick Coghlan <report@bugs.python.org>wrote:

>
> Nick Coghlan added the comment:
>
> Actually, I think it's reasonable to define the custom target nominally
> abstracted by PullParser as always returning None from close(). As Eli
> notes, it's designed to let you discard events as you go, so remembering
> them internally to return from close() doesn't make sense.
>

IMHO the documentation is already sufficient w.r.t. this. By convention,
when a method does not return anything, we just don't mention its return
value. So:

  close()
      Signal the parser that the data stream is terminated.

Seems good.

> That means the patch could be simplified to just removing the root
> attribute without changing the result of calling close().
>

Unfortunately I don't have time to review refactoring patches now. In light
of a larger refactoring planned in this part of the module in the future, I
don't think it's very important to tweak things right now.
msg198203 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-21 14:14
> Unfortunately I don't have time to review refactoring patches now. In light of a larger refactoring planned in this part of the module in the future, I don't think it's very important to tweak things right now.

You misunderstood. The proposal was to remove the ".root" attribute, because it introduces a new API that hinders the future development and unnecessarily introduces an unprecedented inconsistency with existing APIs.

This is not a refactoring.
msg198204 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-21 14:26
Right, the minimum change needed is to prefix it with an underscore, but if
it isn't actually needed for anything, we may as well remove it entirely.
msg198208 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-09-21 14:47
On Sat, Sep 21, 2013 at 7:26 AM, Nick Coghlan <report@bugs.python.org>wrote:

>
> Nick Coghlan added the comment:
>
> Right, the minimum change needed is to prefix it with an underscore, but if
> it isn't actually needed for anything, we may as well remove it entirely.
>

Just to be clear - do we consider all non-prefixed-with-underscore
attributes of classes in the stdlib part of the public API, even when
they're explicitly undocumented? And while we're at it, all
non-prefixed-with-underscore methods as well?

I still consider this refactoring gratuitious at this point. The API is
well defined by the documentation. All the rest is implementation details.
msg198209 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-21 15:01
> I still consider this refactoring gratuitious at this point. The API is
> well defined by the documentation. All the rest is implementation details.

Famous last words.
msg198241 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-22 02:01
An undocumented name without a leading underscore is always a bug, as it
means introspection and the prose docs give conflicting definitions of the
public API.

Whether the fix is to remove the offending name entirely, add a leading
underscore to mark it as private or document it as officially public is a
judgement call that needs to be made on a case by case basis.
msg198463 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-27 05:40
Here's the obvious minimal patch that removes the non-public 'root' attribute. Please apply it for Py3.4 and then set the version tag of this ticket to Py3.5 (instead of closing it, because it's not resolved yet).

As I said, the expected thing to do (and what lxml.etree will do in its next release) is to behave like XMLParser. Reverting the new patch for Py3.5 and applying the old one will achieve that nicely, including the proper test adaptations.
msg198511 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-28 09:23
Eli didn't explicitly comment on the patch so far, but let me quote some of his earlier comments:

> if the reader discards parts of the tree (by deleting subtrees), then returning the root from close() becomes even more meaningless, because it's no longer the root and we have no idea what it actually is.

Meaning: ".root" is definitely the wrong name for this attribute.

> The API is well defined by the documentation. All the rest is implementation details.

Meaning: ".root" is not a public attribute, the existing (functionally equivalent and therefore redundant) "._root" is enough and makes it clear(er) for readers of the source code that the attribute is really non-public.

I read these as a clear expression of consensus that the public ".root" attribute is not supposed to exist at all. So, given that the next alpha release is due tomorrow, I'd like to see my latest patch applied by then in order to finally move at least a tiny step forward with this.
msg198513 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-09-28 13:50
New changeset c5e206b9df2e by Nick Coghlan in branch 'default':
Close #18990: remove root attribute from XMLPullParser
http://hg.python.org/cpython/rev/c5e206b9df2e
msg198514 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-09-28 13:54
We're not changing XMLPullParser.close() to return anything other than None, so the unit tests now check for this, and this behaviour is explicitly documented with users redirected towards read_events() instead.

This is consistent with the XMLParser.close() API, since what to return is up to the target, and the implicit custom target for XMLPullParser is designed for the events to be consumed in a stream.
msg198515 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-28 14:02
Thanks, Nick. Your changes look good to me.
History
Date User Action Args
2013-09-28 14:02:44scodersetmessages: + msg198515
2013-09-28 13:54:18ncoghlansetmessages: + msg198514
2013-09-28 13:50:56python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg198513

resolution: fixed
stage: resolved
2013-09-28 13:47:31ncoghlansettitle: Return root element from ElementTree.XMLPullParser.close() to match ElementTree.XMLParser -> Remove root attribute from XMLPullParser
2013-09-28 09:23:39scodersetmessages: + msg198511
2013-09-27 05:40:12scodersetfiles: + xmlpullparser_remove_public_root_attr.patch

messages: + msg198463
2013-09-22 02:01:13ncoghlansetmessages: + msg198241
2013-09-21 15:01:01scodersetmessages: + msg198209
2013-09-21 14:47:30eli.benderskysetmessages: + msg198208
2013-09-21 14:26:26ncoghlansetmessages: + msg198204
2013-09-21 14:14:18scodersetmessages: + msg198203
2013-09-21 14:05:28eli.benderskysetmessages: + msg198202
2013-09-21 13:19:57scodersetmessages: + msg198199
2013-09-21 12:52:58ncoghlansetmessages: + msg198196
2013-09-21 10:56:47scodersetmessages: + msg198192
2013-09-21 06:53:15scodersetmessages: + msg198185
2013-09-20 12:43:40eli.benderskysetmessages: + msg198145
2013-09-20 07:08:53scodersetmessages: + msg198132
2013-09-20 06:59:47ncoghlansetmessages: + msg198131
2013-09-20 05:59:23scodersetmessages: + msg198129
2013-09-20 05:27:31ncoghlansetmessages: + msg198128
title: Remove unnecessary API inconsistency from ElementTree.XMLPullParser.close() -> Return root element from ElementTree.XMLPullParser.close() to match ElementTree.XMLParser
2013-09-20 04:42:15scodersetnosy: + ncoghlan, serhiy.storchaka
messages: + msg198127
2013-09-13 20:05:10scodersetmessages: + msg197638
2013-09-12 16:37:48scodersetmessages: + msg197536
2013-09-09 19:33:39scodersetnosy: + eli.bendersky
2013-09-09 19:32:33scodercreate