classification
Title: ElementTree not preserving attribute order
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Mariatta, dfrojas, eli.bendersky, lukasz.langa, matrixise, mdk, nedbat, rhettinger, scoder, serhiy.storchaka, sivert, taleinat, vstinner
Priority: Keywords: easy, patch

Created on 2018-07-20 00:42 by rhettinger, last changed 2019-04-27 07:17 by scoder. This issue is now closed.

Files
File name Uploaded Description Edit
test_xml_compare.py matrixise, 2019-03-18 22:18
Pull Requests
URL Status Linked Edit
PR 10163 merged rhettinger, 2018-10-28 04:29
PR 10190 merged serhiy.storchaka, 2018-10-28 18:50
PR 10219 merged dfrojas, 2018-10-29 16:31
PR 10452 closed serhiy.storchaka, 2018-11-10 20:07
PR 12335 merged dfrojas, 2019-03-14 17:07
PR 12354 closed matrixise, 2019-03-15 18:29
Messages (81)
msg321973 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-07-20 00:42
Starting with Python3.6, the order of keyword arguments has been guaranteed.  Something in ElementTree is not respecting that order.

    $ python3.7
    Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 26 2018, 23:26:24)
    [Clang 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from xml.etree.ElementTree import Element, dump
    >>> dump(Element('cirriculum', status='public', company='example'))
    <cirriculum company="example" status="public" />
    >>> #         ^-----------------^-------------------- These are swapped
msg321976 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-20 04:46
Attributes are sorted by name when converted an element to the text representation for making it stable. Changing this may break existing software.

I think it makes sense now to add an optional keyword-only parameter sort_attrs (True by default) for functions tostring() and tostringlist() and method ElementTree.write(). If it is False, attributes will keep the original order.

Since dump() is used for debugging only, it could change the behavior by default, even in maintained releases. But currently it just uses ElementTree.write(), which should sort attributes by default for compatibility.
msg321988 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2018-07-20 07:56
At least for lxml, attributes were never specified to have a sorted order (although attribute dicts are sorted on *input*, to give a *predictable* order as in ET), and the tutorial says: "Attributes are just unordered name-value pairs".

However, I still think that Serhiy is right: this change would break code, and in particular test code that compares XML output. Having to deal with two different "correct" serialisations in tests depending on the Python version is annoying at best. But OTOH, comparing XML output should always be based on C14N and not on an arbitrary serialisation. XML C14N serialisation was specifically designed to provide a deterministic output byte sequence, regardless of how the XML document was created. (This is also used for cryptographic fingerprinting.)

It is interesting that ET sorts the attributes on output. lxml does it only on API *input*, because the underlying tree model is order preserving. Parsed attributes are always written in the original document order, followed by the attributes set through the API in the order in which they were set.

For lxml, a serialisation flag is not going to help, because the serialisation order is always deterministic from the time where an attribute is added to the tree. Given that dicts are order preserving in Python now, ET could follow this path as well. The attributes are already in parse order and could easily be serialised that way, and attributes that were added through the API would end up being serialised last, also preserving the order. This would leave lxml and ET with the same serialisation order, which seems attractive.

In short, I like this change, it's just difficult to see a way how this could preserve full backwards compatibility. And it's difficult to say how important backwards compatibility really is here.
msg328429 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2018-10-25 12:17
This doesn't strike me as a bug, or even clearly being a potential improvement.

If this about round-trip parsing/serializing not preserving order, that would be significant. As it is, if this is only intended as a debugging tool, why change it?

Is there a concrete reason to change this and break backwards compatibility? If not I suggest we close this as "won't fix".
msg328456 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-25 17:10
Consider this as a feature request.  It is perfectly reasonable for a user to want to generate specific XML output and to want to control the order that the attributes are listed.  It is perfectly reasonable for the API to preserve the order that the user specified rather than change it into some other order that they either don't expect or can't control.  Starting in Python 3.6, the language guarantees that the order of keyword arguments is preserved.  Knowing that, it will be natural for users to start viewing as buggy APIs that discard that order when everything else in the eco-system respects the ordering.

When I teach ElementTree in Python courses, students notice this behavior and ask why the order was swapped.  My only answer is that we didn't used to have a way to control it, so the input order was ignored (i.e. a reason that no longer makes sense).
msg328469 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2018-10-25 19:25
Thanks for the clarification, Raymond.

I've spent a few minutes searching for uses of dump(), and have only come up with uses for debugging or printing helpful info in command line tools.  So, it seems that changing this as suggested wouldn't break much existing code, since I couldn't easily find even one such example.
msg328588 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-26 18:38
This is an easy issue and I left it for beginning contributors. The PR should include the following changes:

* Add an optional sort_attrs parameter (True by default) in the write() method.

* Make the dump() function passing sort_attrs=False to write().

* Add tests for write() (with sort_attrs=False and sort_attrs=True) and dump() and/or modify existing tests.

* Document change (in the module docs, in What's New, news entry).
msg328632 - (view) Author: Diego Rojas (dfrojas) * Date: 2018-10-27 03:09
I'm working on this issue, but I have some questions:

1. If dump() function is only for debugging purpose, and since from dump() is where we will pass the sort_attrs keyword in False in order to conserve the order, I don't see how from dump() we can handle the order of the keyword arguments. The sorted occurs in line 926 (https://github.com/python/cpython/blob/master/Lib/xml/etree/ElementTree.py#L926) and 982(https://github.com/python/cpython/blob/master/Lib/xml/etree/ElementTree.py#L982) when xml or html are serialized. Or could we pass the sort_attrs to the functions _serialize_html/_serialize_xml and there handle this?

2. Just a comment, maybe sort_attrs is not ambiguous? For a user would be clear that sort_attrs will order the arguments in lexical order or maybe he/she could be confused and take it as that will respect the order of the keywords passed?
msg328633 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-27 03:16
Though it is compatible with earlier versions, I don't see any reason to keep sorting at all, especially as a default.  The reasonable default is that whatever order the user specifies is what the tool does.

The backwards compatible argument is specious.  The ordering was never guaranteed.  The only reason that we had sorting was to make the XML generation deterministic, and that is something we would still have.

I vote against the permanent API expansion with an unhelpful default.  The sorting is no longer necessary or desirable.  Appropriately, we never guaranteed it, leaving us the implementation flexibility to make exactly this change.

AFAICT, no XML user has ever expressed an interest in having attributes appear in alphabetical order -- for that matter, I can't recall any HTML user having expressed this preference (instead, they prefer to put class and id tags ahead of other tags for example).
msg328634 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-27 03:21
Diego, it would be premature to write an implementation until we've agreed on an API.  

I object to the Serhiy's proposal, and if no one agrees to my proposal to preserve the user's specific actions, then it  would be better to do nothing at all, leaving the current API (deterministic, but with no particular order being guaranteed).
msg328657 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2018-10-27 18:12
There is also a middle ground: Keep .write() as it is *and* have .dump() preserve the order in which the parameters were supplied in the constructor.  They could both use a common *internal* method with a flag to select the desired ordering.
msg328659 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-27 20:26
Okay, lets just remove sorting.

The method for creating XML file in the minidom module also sorts attributes. It should be changed as well.

XMLGenerator in xml.sax.saxutils doesn't sort attributes.
msg328720 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-28 18:17
Diego, I've taken care of the ElementTree case.  Would you like to make a PR for the other occurrences (minidom, html, etc)?
msg328721 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-28 18:18
New changeset e3685fd5fdd8808acda81bfc12fb9702d4b59a60 by Raymond Hettinger in branch 'master':
bpo-34160: Preserve user specified order of Element attributes (GH-10163)
https://github.com/python/cpython/commit/e3685fd5fdd8808acda81bfc12fb9702d4b59a60
msg328723 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-10-28 18:28
Nice enhancement! I was always annoyed by XML libraries in PHP and Python which don't respect the "insertion order".
msg328763 - (view) Author: Diego Rojas (dfrojas) * Date: 2018-10-28 22:03
Raymond, sure. I could do the rest, I'll start tomorrow Monday.
msg328845 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-29 17:31
New changeset 3b05ad7be09af1d4510eb698b0a70d36387f296e by Serhiy Storchaka in branch 'master':
bpo-34160: Preserve user specified order of Element attributes in html. (GH-10190)
https://github.com/python/cpython/commit/3b05ad7be09af1d4510eb698b0a70d36387f296e
msg328846 - (view) Author: Diego Rojas (dfrojas) * Date: 2018-10-29 17:36
Serhiy Storchaka, Raymond. I already made the 10219 PR. Preserves order in html Element attributes and minidom.
msg328881 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-30 00:51
Thanks for doing the additional work to finish this up.
msg329416 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-11-07 14:09
New changeset 5598cc90c745dab827e55fadded42dbe85e31d33 by Serhiy Storchaka (Diego Rojas) in branch 'master':
bpo-34160: Preserve order of attributes in minidom. (GH-10219)
https://github.com/python/cpython/commit/5598cc90c745dab827e55fadded42dbe85e31d33
msg329612 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-10 14:46
Hi, this broke my tests, just as earlier comments predicted.

Can we get an optional argument to sort the keys?  The json module lets us sort keys even though it's irrelevant to JSON.  Being able to sort attributes would be a very useful addition to the prettyxml method.
msg329613 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-10 14:46
(sorry, to sort the attributes.)
msg329614 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-10 14:53
To provide a little more context: coverage.py has tests that the XML reports it generates are correct.  It does this by comparing the result to saved XML files.  On Python versions up to 3.7, the file compares correctly.  It has sorted attributes generated by minidom.  Now on Python 3.8, the comparison fails.
msg329616 - (view) Author: Diego Rojas (dfrojas) * Date: 2018-11-10 14:54
Ned, exactly what test fails? I ran all the test suite in local and all was ok, even all passed in the CI build.
msg329618 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-10 15:02
Diego, they are my tests in the coverage.py test suite.  I'm checking that my XML generating code is producing the right XML.
msg329619 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-10 15:04
This is exactly the situation Stefan was talking about: "However, I still think that Serhiy is right: this change would break code, and in particular test code that compares XML output. Having to deal with two different "correct" serialisations in tests depending on the Python version is annoying at best."

He says comparison should be done with true canonicalization (C14N).  Does that exist in the standard library?  With sorted keys, we had a form of good-enough canonicalization.  Now we don't.
msg329624 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-11-10 16:05
> I'm checking that my XML generating code is producing the right XML.

ISTM that the coverage tests as currently written aren't good tests.  Why not update the tests to check the generated XML is what the code asked it to generate?  Otherwise, the tests are relying on a non-guaranteed implementation detail. 

Alternatively, use c14n.Canonicalize[1] which implements standard compliant cross-language canonicalization (as opposed to non-guaranteed ad-hoc comparison which just happens to work).

[1] https://www.ibm.com/developerworks/xml/library/x-c14n/

Note with attribute order preservation, code now has the ability to produce exactly the XML it wants to produce.  The explicit intention of the user is honored.
msg329626 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-10 16:24
I can see that there are ways that I can change my tests.  I see that there are third-party libraries that can help me with this.

But changing the behavior of the standard library, without a way to retain the old behavior, and asking people to adapt by changing their test suites, seems a bit cavalier.

Why not add an option to keep the sorting that the standard library has always had?  I'll even be OK with changing the default behavior to unsorted, just give me a way to explicitly keep the behavior its always had in the past.

The discussion so far has 1) hypothesized that test suites would be broken, and 2) claimed that no one needs the output in sorted order.  I can tell you that my test suite is broken, and that removing the sorted order is what broke it.

The json module has the option to sort keys.  toprettyxml already has options well outside the needs of XML semantics (indent and newl).

Let's honor Python's tradition of standard library stability, and give me a way to keep the behavior I used to have.
msg329627 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-11-10 17:12
> I can see that there are ways that I can change my tests. 
> I see that there are third-party libraries that can help me with this.

Please do that.  That is good design and uses all the tools the way they are meant to be used.


> But changing the behavior of the standard library, without a 
> way to retain the old behavior, and asking people to adapt 
> by changing their test suites, seems a bit cavalier.

Several thoughts:

* We are not cavalier.  We generally work very hard to preserve guaranteed behaviors.  For example, see all the work done on version numbering random seeding or pickle versions.  That said, we cannot be handcuffed by over-reliance on implementation details; otherwise, we wound never be able to improve the library.  The whole point of encapsulation and abstraction is to provide freedom to improve the internals and add capabilities without breaking the API contract.

* It is reasonable for folks to change tests which were flawed from the outset (just like when hash randomization was introduced, it scrambled key order and broke any tests that relied on key order).  FWIW, we remove or fix fragile or ill conceived tests in our own test suite all the time.

* We could add an option to sort attributes but this just clutters the API and turns an old quirk into a permanent fixture.  Instead, I recommend adding a documentation link to a standards compliant canonicalization tool.

> Let's honor Python's tradition of standard library stability, 
> and give me a way to keep the behavior I used to have.

The library IS stable with respect to documented guaranteed behaviors.  With respect to everything else, our tradition is make improvements where we can.  We don't backport these changes so that mirco-releases retain their quirks.  But point releases are allowed and encouraged to make improvements (i.e. this is for 3.8 and later).

In this case, there was never an API contract to sort attributes.  In a way, this is no different than when we improve a repr (see match objects for example).  That would break tests that incorrectly relied on an exact output match.  

When a test stops matching what a tool does, the one that is incorrect is the one that should be changed.  If there is an API contract, the tool needs to change to match.  However, if the test is fundamentally flawed, it should be one to change.
msg329633 - (view) Author: Mariatta Wijaya (Mariatta) * (Python committer) Date: 2018-11-10 18:20
> ISTM that the coverage tests as currently written aren't good tests.

Hi, I'd like to remind everyone to be open, respectful, and considerate. There are ways to describe hos things that can be improved. There is no need to denigrate other people's work.
msg329634 - (view) Author: Mariatta Wijaya (Mariatta) * (Python committer) Date: 2018-11-10 18:30
IMO adding optional keyword argument to make it compatible with previous version is reasonable request, and it seems early on several other core devs also agree with it.
msg329635 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2018-11-10 18:37
>> ISTM that the coverage tests as currently written aren't good tests.

> Hi, I'd like to remind everyone to be open, respectful, and considerate. There are ways to describe hos things that can be improved. There is no need to denigrate other people's work.

I find this to be an overreaction in this case.  Sure, it could have been worded more positively, but the negativity was very mild; the tests weren't even being called "bad", not to mention overly negative wording e.g. "horrible".

Further, you omitted the followup explanation of *what about the tests isn't good*:

> Otherwise, the tests are relying on a non-guaranteed implementation detail.

IMO we shouldn't require ourselves to be overly careful in our wording, such as avoiding any negative wording entirely.
msg329637 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-10 18:51
Words like flawed from the outset, fundamentally flawed, fragile, and ill-conceived are being thrown around, which does not help the discussion.  Can we focus on the question of whether it's reasonable to add sorted attributes as an option?

This idea exactly parallels the sort_keys option in the json module.  It precisely solves the problem I have.  The code change is small.  I'll even volunteer to make the pull request.

I'm not suggesting that we get rid of the new feature of preserving the user's attributes order. I'm not even suggesting that the new feature needs to be the non-default option. I'd just like a way to sort attributes in XML output.
msg329639 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-11-10 19:32
> Being able to sort attributes would be a very useful addition 
> to the prettyxml method.

+1 This seems reasonable to me.

I misread it and thought it was much broader than prettyxml().
msg329642 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-11-10 20:13
Raymond's words convinced me, but if it is needed for preserving compatibility in tests, here is PR 10452 which adds support for the sort_attrs argument in XML serializing methods.
msg329643 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-11-10 20:47
One other thought:  We should add a note to the docs for all of the serialization formats saying that we specifically disclaim that they will always generate exactly the same output byte-for-byte.  When Serhiy made some small optimizations to the encoding of pickles, it would have broken any test that checked byte-level equality rather than semantic level equality (checked by making sure the pickle/unpickle steps would round-trip without loss of information).
msg329698 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-11-12 00:07
I like the idea of having an opt-in option to get Python 3.7 behavior in Python 3.8, sort=True. It would be a new parameter in Python 3.8, right?

It makes sense to me to ask explicitly to sort attributes, especially since Python decided to no longer sort by default :-)

Maybe some people prefer sorting to get a more deterministic output for things like https://reproducible-builds.org/
msg329999 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2018-11-16 14:04
> Maybe some people prefer sorting to get a more deterministic output

Note that those people are much better off with C14N. It is the one tool designed for all of these use cases, even usable for cryptographic signatures. And it's trivial to use, it's just a different serialiser.

The new option would only be for people who want the old behaviour for backwards compatibility reasons, like Ned who wants to keep his existing byte level tests working that were created by the previous version. Understandable, although re-serialising the test samples with C14N would probably still be the better approach, because it guards against more than just changes in the attribute order.

Maybe it's time to think about adding the ET C14N module to the stdlib again, see issue13611.
msg330003 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-11-16 14:38
+1 for adding C14N support for ElementTree.

But the problem will still exist for minidom. We will need either add the sort_attrs parameter in prettyxml() or complete C14N support in minidom.
msg330004 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-16 14:38
Stefan, just to clarify: it isn't that I don't feel like re-serializing my gold files with the latest version.  I run my test suite on 2.7, 3.4, 3.5, 3.6, 3.7, 3.8, PyPy2, and PyPy3.  Today, one gold file suffices for all those versions.  Without the sorted attributes, 3.8 fails the test.
msg330011 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-11-16 19:12
> I run my test suite on 2.7, 3.4, 3.5, 3.6, 3.7, 3.8, PyPy2, 
> and PyPy3.  Today, one gold file suffices for all those
> versions.  Without the sorted attributes, 3.8 fails the test.

Would it make sense to rewrite the test to just make sure the XML roundtrips instead of relying on the exact byte sequence that is generated?  IIRC, that is how we test pickling and various codecs -- rather than relying on implement specific details for a particular byte sequence -- the test is roughly "assert decode(encode(thing)) == thing".  That allows the test to survive on-going improvements to serialization.
msg330012 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-16 19:21
It seemed like we were close to a consensus about adding back the option to sort the attributes, or did I misunderstand?
msg330013 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-11-16 19:52
> It seemed like we were close to a consensus about adding 
> back the option to sort the attributes, or did I 
> misunderstand?

Adding a sort() option to prettyxml() seems perfectly reasonable.  I wouldn't extend any of the other APIs though.

Also, it is reasonable to consider adding a standards compliant canonicalization tool.

> Without the sorted attributes, 3.8 fails the test.

It seems to me that the test should be improved as well so that it doesn't rely on implementation details.  The usual way to test serialization is to verify that it round trips.

Unless the test is improved to not rely on implementation details, I'm concerned that we'll end-up having this conversation again if any other serialization improvements are made.

So yes, I support adding a sort() option to prettyxml() and adding a canonicalization tool, but no I don't agree with the premise that Py 2.7, 3.4, 3.5, 3.6, 3.7, 3.8 were required to now and forever always produce byte-by-byte identical output -- afaict that was never a promised behavior, so any test that relies on that premise should be improved.

FWIW, when I described the existing test as being "fragile", it wasn't a pejorative, I meant it in the literal sense of "easily broken".
msg330372 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-11-24 08:53
I am working on adding a canonization support (issue13611 for ElementTree and perhaps separate issue for minidom), but prefer to keep this issue open for the case if this task turns out to be too difficult.

Ned, what module is used in your tests for serialization? minidom or ElementTree?
msg330375 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2018-11-24 10:01
I happen to use xml.dom.minidom, though I didn't mean this to be, "fix Ned's personal problem" :)
msg332143 - (view) Author: Julian Sivertsen (sivert) Date: 2018-12-19 14:48
I don't understand why this library should go out of its way to support the old behavior when it seems like the only thing it breaks is tests that assume something that was never guaranteed and where you can get the old behavior in just two lines of Python:

for node in root.iter():
    node.attrib = dict(sorted(node.items()))

Kind regards from confused Pythoner that just wonted the attribute order to make sense
msg337914 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-14 13:56
This change broke docutils test suite: https://sourceforge.net/p/docutils/bugs/359/

Problems:

* I cannot see the change documented anywhere in https://docs.python.org/dev/whatsnew/3.8.html
* I don't see any simple workaround. It would be nice to add an opt-in option to sort attributes just to get Python 3.7 behavior.
msg337916 - (view) Author: Diego Rojas (dfrojas) * Date: 2019-03-14 14:12
Victor, I thought that the news changes were only for major changes. 

In another hand, I could retake this bug in a few days with the issues that you mentioned.
msg337917 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-14 14:17
> Victor, I thought that the news changes were only for major changes.

It's a backward incompatible change. It broke at least docutils. So it should be mentioned at least at:
https://docs.python.org/dev/whatsnew/3.8.html#changes-in-the-python-api
msg337937 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-14 15:35
Please do add an entry to WhatsNew.
msg338017 - (view) Author: Diego Rojas (dfrojas) * Date: 2019-03-15 19:04
Stéphane, According to you PR, I agree that is a good approach give the option to the user if wants sorted or not the attributes. But in this thread, is discussed that there is not necessary and leaves as default that the attributes preserve the order.
msg338018 - (view) Author: Stéphane Wirtel (matrixise) * (Python committer) Date: 2019-03-15 19:11
Diego,

Victor indicates that we could propose an option to the user

```
* I cannot see the change documented anywhere in https://docs.python.org/dev/whatsnew/3.8.html
* I don't see any simple workaround. It would be nice to add an opt-in option to sort attributes just to get Python 3.7 behavior.
```
msg338060 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-16 08:00
FWIW, I second Raymond's opposition against a new option. It's really not something that users should care about. Instead of us providing a new "sort or not" option, and people adding it (conditionally!) to their code to "fix" previous or future behaviour of the tool, they should better invest the time into fixing their code to not rely on behaviour that was never guaranteed.

Also, lxml cannot support such a serialisation option because it preserves the order of attributes as soon as they are *created*. It previously also sorted dicts on input, but only dicts and not other (sequential) ways to set attributes, which I always considered a necessary quirk to gain reproducible output. I'm planning to remove the sorting for Py3.6+ in the next release. It is better to leave the decision about attribute order entirely to the users.

It's not difficult to get sorted dict behaviour in Py3.6+ if you really want it (walk the tree and sort each el.attrib if you feel like it), but it's currently impossible to _not_ get sorted attributes but the "expected" order of creation. That's what we should fix.
msg338063 - (view) Author: Stéphane Wirtel (matrixise) * (Python committer) Date: 2019-03-16 08:09
@scoder

ok, we can close my PR and I will submit a patch to docutils.
msg338065 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-16 08:24
Is not PR 12354 a duplicate of PR 10452?
msg338098 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-16 20:49
FWIW, this issue arose from an end-user problem. She had a hard requirement to show a security clearance level as the first attribute.  We did find a work around but it was hack.
msg338100 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2019-03-16 21:12
I'm a bit mystified why people are still opposed to providing sorting control, even after people are resorting to "hack work-arounds."  The (two) pull requests that provide the control are simple, easy to understand, and easy to test.

Raymond, it sounds like your end-user didn't agree that "sorting is no longer necessary or desirable."

How many people have to comment here on the difficulties they are having before we do the simple thing that will help them?
msg338101 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-16 21:18
Ned, would it solve your problem to write a helper function that walks the tree and sorts the attrib dicts? That would also work in all existing CPython versions (because they already sort attributes anyway), for both ElementTree and lxml, and you wouldn't have to pass a new option into the serialiser conditionally based on the running Python version.
msg338102 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-16 21:27
Something like this:

def sort_attributes(root):
    for el in root.iter():
        attrib = el.attrib
        if len(attrib) > 1:
            attribs = sorted(attrib.items())
            attrib.clear()
            attrib.update(attribs)
msg338103 - (view) Author: Ned Batchelder (nedbat) * (Python triager) Date: 2019-03-16 21:36
Thanks for the suggestion.  My problem has already been fixed with my own hack workaround.  Your idea is a good one to leave here so that other frustrated users will have code to copy for their hack workaround.
msg338104 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-16 21:40
:) The coolest thing about it is: it's not a hack at all. It's simply making use of the new feature in a suitable way.
msg338107 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-16 22:46
> Raymond, it sounds like your end-user didn't agree that
> "sorting is no longer necessary or desirable."

Sorry for being unclear.  The user needed to *overcome* the sorting behavior so that they could produce an ordering of their own choosing.  The problem then became that we had to work out a way to defeat the sorting to allow the requested attribute ordering to be preserved.

Roughly:

   casefile = Element('casefile', clearance="confidential",
              access="internal", valid_through="31 Mar 2022")

The desired result is that the attribute order be preserved.  By removing the sort, a user can specify an order they want and have that specification honored.
msg338112 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-16 23:44
New changeset 06e1e688228013f411aca6309562ef80df6ce5d3 by Raymond Hettinger (Diego Rojas) in branch 'master':
bpo-34160: Update news entry for XML order attributes (#12335)
https://github.com/python/cpython/commit/06e1e688228013f411aca6309562ef80df6ce5d3
msg338248 - (view) Author: Stéphane Wirtel (matrixise) * (Python committer) Date: 2019-03-18 15:48
in fact, there is no easy way for the fix for python-docutils.
msg338249 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-18 16:10
Please don't break the backward compatibility without an easy way to retrieve Python 3.7 behavior.

I set the priority to release blocker until a consensus can be found, and I add Lukasz (Python 3.8 release manager) in the loop.


Ned Batchelder: "I'm a bit mystified why people are still opposed to providing sorting control, even after people are resorting to "hack work-arounds."  The (two) pull requests that provide the control are simple, easy to understand, and easy to test."

I concur. I'm fine with sorting by default, but it breaks the backward compatibility on purpose without providing an option to opt-in for the old behavior :-(

Many XML parsers rely on the order of attributes. It's part of the XML "semantics". Well, it shouldn't, but I cannot fix all applications around the world to explain them that Python is right, and they are all wrong :-)

It's not straighforward to fix an application to get Python 3.7 behavior. I would prefer to not have to copy-paste Stefan Behnel's recipe in every project using XML who wants to sort attributes:

"""
Something like this:

def sort_attributes(root):
    for el in root.iter():
        attrib = el.attrib
        if len(attrib) > 1:
            attribs = sorted(attrib.items())
            attrib.clear()
            attrib.update(attribs)
"""

This recipe does modify the document and so changes the behavior of the application when it iterates on attributes later, whereas in Python 3.7 attributes are only sorted while writing the XML into a file.
msg338269 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-18 17:44
Victor, as much as I appreciate backwards compatibility, I really don't think it's a big deal in this case. In fact, it might not even apply.

My (somewhat educated) gut feeling is that most users simply won't care or won't even notice the change. Out of those who do (or have to) care, many are better off by fixing their code to not rely on an entirely arbitrary (sorted by name) attribute order than by getting the old behaviour back. And for those few who really need attributes to be sorted by name, there's the recipe I posted which works on all ElementTree implementations out there with all alive CPython versions.

> This recipe does modify the document and so changes the behaviour of the application when it iterates on attributes later

This is actually a very rare thing. If I were to make up numbers, I'd guess that some 99% of the applications do XML serialisation as the last thing and then throw away the tree afterwards, without touching it (or its attributes) again. And the remaining cases are most probably covered by the "don't need to care" type of users. I don't think we should optimise for 0.05% of our user base by providing a new API option for them. Especially in ElementTree, which decidedly aims to be simple.

The example that Ned gave refers to a very specific and narrow case: comparing serialised XML, at the byte level, in tests. He was very lucky that ElementTree was so stable over the last 10 Python releases that the output did not change at all. That is not something that an XML library needs to guarantee. There is some ambiguity in XML for everything that's outside of the XML Information set, and there is a good reason why the W3C has tackled this ambiguity with an explicit and *separate* specification: C14N. So, when you write:

> Many XML parsers rely on the order of attributes

It's certainly not many parsers, and could even be close to none. The order of attributes is explicitly excluded from the XML Information set:

https://www.w3.org/TR/xml-infoset/#omitted

Despite this, cases where the order of the attributes matters to the *application* are not unheard of. But for them, attributes sorted by their name are most likely the problem and not a solution. Raymond mentioned one such example. Sorting attributes by their name really only fulfils a single purpose: to get reproducible output in cases where the order does *not* matter. For all the (few) cases where the order *does* matter, it gets in the way.

But by removing the sorting, as this change does, we still get predictable output due to dict ordering. So this use case is still covered. It's just not necessarily the same output as before, because now the ordering is entirely in the hands of the users. Meaning, those users who *do* care can now actually influence the ordering, which was very difficult and hackish to achieve before. We are allowing users to remove these hacks, not forcing them to add new ones.
msg338291 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-18 21:55
> Victor, as much as I appreciate backwards compatibility, I really don't think it's a big deal in this case.

In short, the XML serialization is no longer deterministic. It depends in which order attributes are defined. Example:
---
from xml.etree import ElementTree as ET

document = ET.Element('document')
document.attrib['a'] = '1'
document.attrib['b'] = '2'
ET.dump(document)

document = ET.Element('document')
document.attrib['b'] = '2'
document.attrib['a'] = '1'
ET.dump(document)
---

Python 3.7 output:
---
<document a="1" b="2" />
<document a="1" b="2" />
---

Python 3.8 output:
---
<document a="1" b="2" />
<document b="2" a="1" />
---

On this example, it's obvious that the attributes are defined in a different order, but the code which generates the XML document can be way more complex and use unordered data structures like set().

Why does it matter? Most programs compare XML using basic string comparison, they don't implement smart XML comparison which ignore attributes order.

Concrete example:

* A program which rewrites the XML on disk if attribute order changes (useless disk write).
* Version Control System (Git, hg, svn, whatever) sees the file as modified and produces a change which can be unexpected and annoy users (use more disk space, more network bandwidth, etc.)
* https://reproducible-builds.org/
* It breaks docutils unit tests which uses straightfoward assertEqual(). docutils is just one example: it's not hard to imagine many other applications using XML and which use a similar check.
* etc.

It's not just a matter of parsers.


> My (somewhat educated) gut feeling is that most users simply won't care or won't even notice the change.

... Really? Why do you think that so many people are involved in this issue if nobody cares of XML attribute order? :-)
msg338292 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-18 22:07
I'll make a post to python-dev so that we can escalate the discussion and get broader participation.  Personally, I'm not wedded to any one particular outcome and just want to do what is best for the users in the long run.
msg338294 - (view) Author: Stéphane Wirtel (matrixise) * (Python committer) Date: 2019-03-18 22:18
@raymond

Here is a basic comparison between two xml streams.

Maybe we could propose this solution when we want to compare an XML with xml.dom.minidom instead of the byte-to-byte comparison.
msg338296 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2019-03-18 22:57
We found two occurrences of external code breaking due to this change, and one looks tricky to fix (docutils), this indicates there are other library that will break.

> My (somewhat educated) gut feeling is that most users simply won't care or won't even notice the change.

I concur with this. Why changing the default behavior? The costs of changing this looks high, and the benefits looks very low (people not noticing).

As already proposed, a keyword-only `sorted` parameter defaulting to True (the current behavior) would allow one to give False and get control of ordering, and won't break any existing library.

On the plus side `sorted=True` makes the current behavior (sorted by default) more discoverable.
msg338299 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-19 00:01
> one looks tricky to fix (docutils)

Actually, it is really easy to fix just by updating the target comparison files (that is the procedure described in the comments for the test).

What is taking more thought is coming up with a better test that verifies intended semantic content rather than the exact serialization.  I'm working with Stéphane to figure out how to do this.
msg338327 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2019-03-19 08:25
> Actually, it is really easy to fix

By answering this, it looks like you're currently going this way, and obviously you'll succeed fixing docutils, this still mean voluntarily leaving some (or many?) other code broken (open source and closed source), how that's acceptable?
msg338388 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-19 16:58
> how that's acceptable?

For docutils, we'll most likely propose some variant of Stéphane Wirtel's script to test semantic equivalence for docutils.  For other cases, Serhiy is working on a C14N canonicalization tool which is specifically designed for the task of creating reproducible output, in a cross-language standards compliant way.

As Stefan Behnel clearly articulated, there are multiple reasons why Python should not guarantee byte-for-byte serialization across point releases.  That said, we'll likely make the guarantee across micro-releases.  That will make it possible a third mitigation strategy of generating new baseline files for a new point releases and adding a version check to decide which baseline to test against.

FWIW, we had a similar discussion regarding hash randomization.  While there are a number of significant differences, the outcome is relevantL  User tests that depended on non-guaranteed implementation details had to be fixed.
msg339167 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-30 02:14
I set the priority to release blocker to make sure that the issue got enough attention. Raymond started a thread on python-dev, so I reset the priority:
https://mail.python.org/pipermail/python-dev/2019-March/156709.html

Moreover, the change is now documented in "What's New in Python 3.8?" documentation ("Changes in the Python API" section):
https://github.com/python/cpython/commit/06e1e688228013f411aca6309562ef80df6ce5d3
msg339252 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-31 09:47
Thanks to the discussion that rhettinger triggered on python-dev, it seems that there is a majority accordance for considering the previous ordering "undefined but deterministic" rather than "sorted", and for making the change from "arbitrarily sorted" to "user defined" in Py3.8. That matches the current status after the merge of the four pull requests, for both ElementTree and MiniDOM.

There also seems to be a large fraction of participants that would like to see a stdlib way to safely/correctly/easily compare XML output. IMHO, that is covered by issue13611, adding a C14N module to ElementTree. There is also a (Py2) C14N implementation for minidom, from the now unmaintained PyXML package¹, which, according to the license, is already distributed under Python copyright. I will try to bring at least the ET module in shape for integration before 3.8 beta1.

During the discussions, several ways were shown to achieve deterministic behaviour, also across Python releases. My proposal is therefore to *not* add a new "sort_attrs" option to the API, with the reasoning that it is a) not required for any of the use cases, b) not maintaining compatibility without code changes (neither backwards nor forward), and c) not what most people actually want who strive for deterministic output. Instead, I think we should concentrate on providing and documenting better ways to compare XML output.

¹ https://github.com/actmd/PyXML/blob/97f144152878a67cd95dd229fe72e86f19bce706/xml/dom/ext/c14n.py
msg339278 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-04-01 00:53
Stefan, as the module maintainer, you get to make the final decision on this.  What you've proposed makes sense in light of the prior discussion and the python-dev discussion.  Unless you think something new will come along to change your mind, just mark as closed/fixed, then continue work on c14n in another tracker issue.
msg339798 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-09 18:13
This is done now. Thanks everyone who helped in discussing and implementing this change.

I will leave Serhiy's last PR (adding the "sort_attrs" flag option) open for a while until I'm sure we have a better solution for comparing XML in 3.8, at which point I would reject it.
msg340342 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-16 13:46
FYI pungi project is also broken by this change and it blocks Fedora Rawhide to upgrade to Python 3.8:
https://bugzilla.redhat.com/show_bug.cgi?id=1698514
msg340345 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-16 14:05
PR 10452 is still open. Should it be closed if you consider that we must not add an optional sort_attrs=False attribute?
msg340363 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-16 20:08
I rejected the (now conflicting) PR that adds a sorting option.
I also sent Victor a tentative (and trivial) patch for the pungi package.
msg340970 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-27 07:17
I implemented (most of) C14N 2.0 in issue 13611. Please give it a try if you are interested in the canonical serialisation feature. I would like to include it in Py3.8.
History
Date User Action Args
2019-04-27 15:10:38scoderlinkissue28460 superseder
2019-04-27 07:17:22scodersetmessages: + msg340970
2019-04-16 20:08:54scodersetmessages: + msg340363
2019-04-16 14:05:48vstinnersetmessages: + msg340345
2019-04-16 13:46:44vstinnersetnosy: + vstinner
messages: + msg340342
2019-04-09 18:13:16scodersetstatus: open -> closed
resolution: fixed
messages: + msg339798

stage: patch review -> resolved
2019-04-01 00:53:27rhettingersetmessages: + msg339278
2019-03-31 09:47:58scodersetmessages: + msg339252
2019-03-30 02:15:17vstinnersetnosy: - vstinner
2019-03-30 02:14:55vstinnersetpriority: release blocker ->
nosy: rhettinger, scoder, vstinner, taleinat, nedbat, eli.bendersky, lukasz.langa, serhiy.storchaka, matrixise, sivert, mdk, Mariatta, dfrojas
messages: + msg339167
2019-03-19 16:58:02rhettingersetmessages: + msg338388
2019-03-19 08:25:25mdksetmessages: + msg338327
2019-03-19 00:01:38rhettingersetmessages: + msg338299
2019-03-18 22:57:07mdksetnosy: + mdk
messages: + msg338296
2019-03-18 22:18:44matrixisesetfiles: + test_xml_compare.py

messages: + msg338294
2019-03-18 22:07:48rhettingersetmessages: + msg338292
2019-03-18 21:55:35vstinnersetmessages: + msg338291
2019-03-18 17:44:30scodersetmessages: + msg338269
2019-03-18 16:10:03vstinnersetpriority: normal -> release blocker
nosy: + lukasz.langa
messages: + msg338249

2019-03-18 15:48:09matrixisesetmessages: + msg338248
2019-03-16 23:44:58rhettingersetmessages: + msg338112
2019-03-16 22:46:24rhettingersetmessages: + msg338107
2019-03-16 21:40:12scodersetmessages: + msg338104
2019-03-16 21:36:22nedbatsetmessages: + msg338103
2019-03-16 21:27:17scodersetmessages: + msg338102
2019-03-16 21:18:50scodersetmessages: + msg338101
2019-03-16 21:12:51nedbatsetmessages: + msg338100
2019-03-16 20:49:19rhettingersetmessages: + msg338098
2019-03-16 08:24:36serhiy.storchakasetmessages: + msg338065
2019-03-16 08:09:20matrixisesetmessages: + msg338063
2019-03-16 08:00:48scodersetmessages: + msg338060
2019-03-15 19:11:12matrixisesetnosy: + matrixise
messages: + msg338018
2019-03-15 19:04:14dfrojassetmessages: + msg338017
2019-03-15 18:29:55matrixisesetpull_requests: + pull_request12319
2019-03-14 17:07:27dfrojassetpull_requests: + pull_request12306
2019-03-14 15:35:30rhettingersetmessages: + msg337937
2019-03-14 14:17:04vstinnersetmessages: + msg337917
2019-03-14 14:12:02dfrojassetmessages: + msg337916
2019-03-14 13:56:51vstinnersetmessages: + msg337914
2018-12-19 14:48:47sivertsetnosy: + sivert
messages: + msg332143
2018-11-24 10:01:06nedbatsetmessages: + msg330375
2018-11-24 08:53:15serhiy.storchakasetmessages: + msg330372
2018-11-16 19:52:14rhettingersetmessages: + msg330013
2018-11-16 19:21:30nedbatsetmessages: + msg330012
2018-11-16 19:12:49rhettingersetmessages: + msg330011
2018-11-16 14:38:54nedbatsetmessages: + msg330004
2018-11-16 14:38:25serhiy.storchakasetmessages: + msg330003
2018-11-16 14:04:22scodersetmessages: + msg329999
2018-11-12 00:07:13vstinnersetmessages: + msg329698
2018-11-10 20:47:03rhettingersetmessages: + msg329643
2018-11-10 20:13:38serhiy.storchakasetmessages: + msg329642
2018-11-10 20:07:36serhiy.storchakasetpull_requests: + pull_request9727
2018-11-10 19:32:16rhettingersetmessages: + msg329639
2018-11-10 18:51:35nedbatsetmessages: + msg329637
2018-11-10 18:37:10taleinatsetmessages: + msg329635
2018-11-10 18:30:08Mariattasetmessages: + msg329634
2018-11-10 18:20:17Mariattasetnosy: + Mariatta
messages: + msg329633
2018-11-10 17:12:52rhettingersetmessages: + msg329627
2018-11-10 16:24:57nedbatsetmessages: + msg329626
2018-11-10 16:05:07rhettingersetmessages: + msg329624
2018-11-10 15:04:56nedbatsetmessages: + msg329619
2018-11-10 15:02:34nedbatsetmessages: + msg329618
2018-11-10 14:54:57dfrojassetmessages: + msg329616
2018-11-10 14:53:27nedbatsetmessages: + msg329614
2018-11-10 14:46:48nedbatsetmessages: + msg329613
2018-11-10 14:46:27nedbatsetnosy: + nedbat
messages: + msg329612
2018-11-07 14:09:16serhiy.storchakasetmessages: + msg329416
2018-10-30 00:51:53rhettingersetmessages: + msg328881
2018-10-29 17:36:41dfrojassetmessages: + msg328846
2018-10-29 17:31:10serhiy.storchakasetmessages: + msg328845
2018-10-29 16:31:26dfrojassetpull_requests: + pull_request9535
2018-10-28 22:03:37dfrojassetmessages: + msg328763
2018-10-28 18:50:16serhiy.storchakasetpull_requests: + pull_request9508
2018-10-28 18:28:37vstinnersetnosy: + vstinner
messages: + msg328723
2018-10-28 18:18:26rhettingersetmessages: + msg328721
2018-10-28 18:17:54rhettingersetmessages: + msg328720
2018-10-28 04:29:10rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request9486
2018-10-27 20:26:35serhiy.storchakasetmessages: + msg328659
2018-10-27 18:12:49taleinatsetmessages: + msg328657
2018-10-27 03:21:12rhettingersetmessages: + msg328634
2018-10-27 03:16:47rhettingersetmessages: + msg328633
2018-10-27 03:09:43dfrojassetnosy: + dfrojas
messages: + msg328632
2018-10-26 18:38:48serhiy.storchakasetkeywords: + easy

messages: + msg328588
2018-10-25 19:25:01taleinatsetmessages: + msg328469
2018-10-25 17:10:00rhettingersettype: behavior -> enhancement
messages: + msg328456
versions: - Python 3.6, Python 3.7
2018-10-25 12:17:05taleinatsetnosy: + taleinat
messages: + msg328429
2018-07-20 07:56:42scodersetmessages: + msg321988
2018-07-20 04:46:12serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg321976
2018-07-20 00:42:07rhettingercreate