Serialiser in ElementTree returns unicode strings in Py3k #52295

scoder · 2010-03-03T07:15:24Z

BPO	8047
Nosy	@malemburg, @birkenfeld, @scoder, @bitdancer, @florentx
Files	issue8047_etree_encoding_v2.diff: Patch, apply to 3.x

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-10-29.02:37:45.468>
created_at = <Date 2010-03-03.07:15:23.877>
labels = ['expert-XML', 'easy', 'type-bug', 'library', 'docs']
title = 'Serialiser in ElementTree returns unicode strings in Py3k'
updated_at = <Date 2011-10-29.02:37:45.467>
user = 'https://github.com/scoder'

bugs.python.org fields:

activity = <Date 2011-10-29.02:37:45.467>
actor = 'flox'
assignee = 'none'
closed = True
closed_date = <Date 2011-10-29.02:37:45.468>
closer = 'flox'
components = ['Documentation', 'Library (Lib)', 'XML']
creation = <Date 2010-03-03.07:15:23.877>
creator = 'scoder'
dependencies = []
files = ['18286']
hgrepos = []
issue_num = 8047
keywords = ['easy']
message_count = 47.0
messages = ['100333', '100342', '100345', '100349', '100350', '100513', '100572', '100582', '100633', '100634', '100649', '100846', '100857', '100868', '100877', '100880', '100883', '100884', '100887', '100890', '100891', '100895', '100896', '100898', '100900', '100902', '100903', '100907', '100915', '100916', '100919', '100923', '100929', '100930', '100931', '100932', '100936', '101050', '101052', '101427', '101487', '101488', '101490', '112165', '113296', '113307', '146593']
nosy_count = 6.0
nosy_names = ['lemburg', 'effbot', 'georg.brandl', 'scoder', 'r.david.murray', 'flox']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'needs patch'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue8047'
versions = ['Python 3.1']

scoder · 2010-03-03T07:15:22Z

The xml.etree.ElementTree package in the Python 3.x standard library breaks compatibility with existing ET 1.2 code. The serialiser returns a unicode string when no encoding is passed. Previously, the serialiser was guaranteed to return a byte string. By default, the string was 7-bit ASCII compatible.

This behavioural change breaks all code that relies on the default behaviour of ElementTree. Since there is no longer a default encoding in Python 3, unicode strings are incompatible with byte strings, which means that the result of the serialisation can no longer be written to a file, for example.

XML is well defined as a stream of bytes. Redefining it as a unicode string *by default* is hard to understand at best.

Finally, it would have been good to look at the other ET implementation before introducing such a change. The lxml.etree package has had support for serialising XML into a unicode string for years, and does so in a clear, safe and explicit way. It requires the user to pass the 'unicode' (Py3 'str') type as encoding parameter, e.g.

    tree.tostring(encoding=str)

which is explicit enough to make it clear that this is different from a normal encoding.

bitdancer · 2010-03-03T13:44:40Z

I'm not an ElementTree user, but that spelling (etree.tostring(encode=str), or even etree.tostring(encode=unicode)) strikes me as horrible. You don't encode to unicode, you *decode* to unicode. Thus the current Python3 interface works the way I'd expect: if I don't specify an encoding, I get unicode. If I do specify an encoding, I get encoded bytes. In the general case the fact that you can no longer get away with being sloppy about what encoding a byte stream is in, the way you could in Python2, is a feature of Python3, not a bug.

If anything, having 'tostring' return bytes is broken, given its name. But I think we fudge that by claiming it is returning a 'byte string' when given an encoding.

That said, I'm not sure how much, if at all, my opinion counts :)

scoder · 2010-03-03T14:33:42Z

I agree that the lxml API is somewhat clumsy here. I just mentioned it to show that there are already ways to do it in a backwards compatible way, so this change does two things: it breaks existing code, and it does so in a way that is incompatible with other existing implementations. That's what *I* would call horrible.

Also, this is absolutely not a feature that is restricted to Py3, so what's the equivalent feature in the standard library of Py2 going to be, and how much code will it break for the Py2 series?

bitdancer · 2010-03-03T17:24:48Z

My understanding is that backward compatibility, while nice to retain, was not considered a stopper for cleaning up interfaces in py3. Exactly how considered this change was, I have no idea, but as I said it does make sense to me. As for 2.x, what's there is what's there, as far as I can see. Florent could speak to whether or not that API is likely to change in 2.7, but I doubt it will.

florentx · 2010-03-03T19:10:55Z

With ET 1.3, the serializer ElementTree.write() should output bytes only. And the default encoding is still US-ASCII.

The new behaviour is specific to the 3.x branch (since 3.0, r56841).
Even if it is not fully backward compatible, I don't find this behavior shocking: it is a rule of Python 3 to avoid implicit encoding/decoding.

pitrou · 2010-03-06T02:57:40Z

I don't know what compatibility you are talking about. Py3k deliberately breaks compatibility with many 2.x behaviours that were considered defective or suboptimal.

scoder · 2010-03-07T10:56:43Z

It has been brought up several times that ET is special in the stdlib in that it is an externally maintained package. Correct me if I'm wrong, but the rules seem to be: features come outside, adaptation to Py3 can happen inside. What we are talking about here is a new feature that makes sense for both Py2 and Py3. We are not talking about a bug fix, neither is this an adaptation to Py3. It is a new feature that was added inside of the standard library and that is not compatible with the external libraries that are supposed to implement the same interface, namely, ElementTree and lxml.etree.

pitrou · 2010-03-07T14:41:08Z

As Florent said, it is a rule of py3k to avoid implicit encoding/decoding. The fact that it could have made sense for 2.x as well is not relevant, since the change was only done in py3k (and for good reason: we normally try not to break compatibility without prior notice).

In any case, I have trouble understanding your concern here. Do you think the change is bad? Is it really that difficult to support it in lxml?

scoder · 2010-03-08T09:01:16Z

Antoine, in the same comment, you say that it was not backported to Py2 in order to prevent breaking existing code, and then you ask if it's difficult to support in lxml. ;-)

Supporting the same behaviour in lxml would either mean that it breaks existing code in Py2 (when making the API consistent), or that you can safely (and correctly) write the return value to a file in Py2, but that you can't do the same in Py3 (when adopting the change only in Py3).

Previously, in ElementTree, serialising without an explicit encoding was a way to get a byte encoded serialisation without an XML declaration header, so I expect there to be code that depends on this. Since ElementTree 1.3 uses the same keyword argument as lxml for this feature, I assume that Florent's patches provide at least an alternative here, even if it requires users to adapt their code.

I just wish this backwards incompatible feature had been advertised at the time, or at least *documented* in any way. Even the latest 3.2-dev docs still state that the default encoding of the serialiser is US-ASCII, not a word about *ever* returning a unicode string, especially not by default, and totally not the required big fat warning that writing to a file will fail with mysterious errors if no encoding is specified.

florentx · 2010-03-08T09:19:12Z

With ET 1.3, you should have an explicit keyword argument "xml_declaration":

# ----
if xml_declaration or (xml_declaration is None and
encoding not in ("utf-8", "us-ascii")):
if method == "xml":
write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
# ----

In ET 1.2.6, the same snippet looks like:
# ----

        if encoding != "utf-8" and encoding != "us-ascii":
            file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
#
```----

pitrou · 2010-03-08T15:05:45Z

Le Mon, 08 Mar 2010 09:01:19 +0000,
Stefan Behnel <report@bugs.python.org> a écrit :

Antoine, in the same comment, you say that it was not backported to
Py2 in order to prevent breaking existing code, and then you ask if
it's difficult to support in lxml. ;-)

I meant breaking existing *user* code. Besides, the fact that
compatibility is broken doesn't mean third-party code difficult to fix;
hence my question.

Supporting the same behaviour in lxml would either mean that it
breaks existing code in Py2 (when making the API consistent), or that
you can safely (and correctly) write the return value to a file in
Py2, but that you can't do the same in Py3 (when adopting the change
only in Py3).

Sorry, I don't understand this. Are you saying it's impossible
for you to define two different behaviours based on the current Python
version? What's bad with
"""if sys.version_info() >= (3, 0, 0): # blah"""

Previously, in ElementTree, serialising without an explicit encoding
was a way to get a byte encoded serialisation without an XML
declaration header, so I expect there to be code that depends on
this.

This doesn't seem to be documented. The doc simply says
"""encoding is the output encoding (default is US-ASCII)""".

In other words, undocumented (and untested) behaviour has been "broken"
when porting to 3.0, which is the version which deliberately broke
compatibility for documented things. I guess we can live with it ;)

Even the latest
3.2-dev docs still state that the default encoding of the serialiser
is US-ASCII, not a word about *ever* returning a unicode string,
especially not by default, and totally not the required big fat
warning that writing to a file will fail with mysterious errors if no
encoding is specified.

Ok, perhaps some documentation changes are in order :-)
(I wonder why the default was US-ASCII, though. Sounds a bit braindead)

effbot · 2010-03-11T12:37:14Z

The "no header" thing is very much done on purpose, and it's documented in the upstream ElementTree documentation.

I suggest dropping this "Python 3 exists in its own universe" nonsense; it's not very professional, and it's hurting Python, its users, and all third party developers. The "things I don't understand are braindead" stuff is less of a problem; that only hurts yourself.

pitrou · 2010-03-11T14:45:48Z

The "no header" thing is very much done on purpose, and it's
documented in the upstream ElementTree documentation.

I'm sorry, where is that?
I can't find it either at
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.tostring-function
or
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree.write-method

I suggest dropping this "Python 3 exists in its own universe"
nonsense; it's not very professional, and it's hurting Python, its
users, and all third party developers.

Ha. There has been a very long temporal window (until 3.1, probably)
during which things were very much in flux and anyone with a
professional knowledge of elementtree and XML APIs could chime in and
point out any nonsense in py3k.

Now Python 3.1 is out and as a result py3k also has to ensure upwards
compatibility for its own APIs. Of course we can still make exceptions
if the alleged breakage is truly major. To me, it doesn't /seem/ to be
the case here.

scoder · 2010-03-11T15:49:02Z

Sorry, Antoine, but you can't possibly mean what you say here. The culprit in question is clearly one of the best hidden features of the new Py3 ET API. The only existing reference to it that I can find is the SVN commit comment when it was applied. How is that supposed to be any reason for keeping up "backwards compatibility" within the Py3 series?

bitdancer · 2010-03-11T16:53:26Z

I suspect that what Antoine is referring to is the fact that Python 3.1 has this behavior. Whether or not it is explicitly documented is a secondary issue.

We're having a similar issue in the unittest package, where there's a new function, assertSameElements, that has an unfortunate and poorly documented API. But changing that API now that the function exists in a released version (3.1) is not something to be done lightly, if it is done at all.

This is definitely an unfortunate state of affairs no matter how you look at it.

effbot · 2010-03-11T19:01:10Z

if I don't specify an encoding, I get unicode. If I do specify an encoding, I get encoded bytes.

You're confusing the XML document encoding with character set encoding.

A serialized (unparsed) XML document is a byte stream, not a string of Unicode characters. And the character set encoding is both embedded in that byte stream and affects how it's generated in more than one way; you cannot just recode XML documents nilly willy and expect things to work.

A parsed XML document (an infoset) -- for ET, that's the tree of Element objects -- does indeed contain Unicode strings, but the transformation from the byte stream to the Unicode string doesn't just involve character set decoding; there are several other constructs that are handled by the XML parser.

Ha. There has been a very long temporal window

You should have had plenty of time to fix it, then, right?

pitrou · 2010-03-11T20:22:51Z

> Ha. There has been a very long temporal window

You should have had plenty of time to fix it, then, right?

Under the condition that someone would have actually reported it, yes.
We don't magically fix bugs if nobody (including us) detects and reports
them.

scoder · 2010-03-11T20:43:45Z

Then I would call that a clear sign that no-one actually stumbled over this feature in Py3 before I did, well hidden as it was. Still time to fix it.

bitdancer · 2010-03-11T21:14:39Z

You may well be correct. But just because no one reported a bug does not mean that no one is using the API. The person using it may find it perfectly logical (and may be writing py3 only code, not porting py2 code).

However, regardless of whether we decide it is acceptable to change the behavior, it seems to me that having an interface named 'tostring' that returns bytes by default in Python3 would be a broken API. I don't see any way around that terminology problem.

effbot · 2010-03-11T22:03:34Z

>>> import array
>>> array.array("i", [1, 2, 3]).tostring()
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

pitrou · 2010-03-11T23:01:20Z

Le Thu, 11 Mar 2010 22:03:37 +0000,
Fredrik Lundh <report@bugs.python.org> a écrit :
> 
> >>> import array
> >>> array.array("i", [1, 2, 3]).tostring()
> b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

The fact that array is old, rusty and slightly broken doesn't meen we
should propagate that brokenness to other Python modules.
Also, as David said, the fact that you think there is a bug
here doesn't mean everyone would agree.
Finally, the behaviour you seem to be looking for could be added
as a separated API or an optional method argument. Patches welcome.

effbot · 2010-03-12T00:10:40Z

So now it's the domain experts against some hypothetical people that might exist? Tricky.

bitdancer · 2010-03-12T00:14:23Z

Well, Benjamin pointed out to me that it would be a bad thing if array.tostring produced a string. True, the method is named wrong, but it is less broken than returning a string. I suspect that that is the same argument Fredrik is making: that returning the XML as a byte string is less broken than returning it as a string when it in fact may contain other encoded stuff. The email package has some of the same problems, and there we are retooling the API to deal with this.

Presumably ET needs to have a retooled API for Python3 as well. Then the question becomes what do we do in the meantime? For email, we are just living with the breakage until we can get something better in place, because no one has come up with any good short term solutions for email.

gvanrossum · 2010-03-12T00:35:32Z

Hey, can we all try to get along?

For anyone who didn't follow the link to r56841, that was mine (though Christian Heimes provided the basis for much of the patch apart from elementtree), and I wrote at the time:

"""I had to fix a few tests and modules beyond what Christian did, and
invent a few conventions. E.g. in elementtree, I chose to
write/return Unicode strings whe no encoding is given, but bytes when
an explicit encoding is given."""

I am not a user of elementtree, so this may well have been a mistake -- at the time (in 2007) we were so busy making zillions of tests pass that some mistakes were made. Some of those were caught in time, others apparently not.

My thinking was that since an XML document looks like text, it should probably be considered text, at least by default. (There may have been some unittests that appeared to require this -- of course this was probably just the confusion between byte strings and 8-bit text strings inherent in Python 2.)

Regarding backwards compatibility, there are now two backwards compatibility problems: with 2.x, and with 3.1. It seems we cannot easily be backwards compatible with both (though if someone figures out a way that would be best of course).

If I were to propose an API for returning a Unicode string, I would probably add a new method (e.g. tounicode()) rather than using a "magical" argument (tostring(encoding=str)), but given that that exists in another supposedly-compatible implementation I'm not against it. Maybe tostring(encoding=None) could also be made to work? That would at least make it *possible* to write code that receives a text object and that works in 3.1 and 3.2 both. In 2.x I think neither of these should work, and there probably isn't a need -- apps needing full compatibility will just have to refrain from calling tostring() without arguments.

ISTM that the behavior of write() is just fine -- the contents of the file will be correct after all.

pitrou · 2010-03-12T04:42:16Z

Not wanting to waste my time anymore on this.

scoder · 2010-03-12T06:48:59Z

Hi Guido, your comment was long overdue in this discussion.

Guido van Rossum, 12.03.2010 01:35:

My thinking was that since an XML document looks like text, it should
probably be considered text, at least by default. (There may have
been some unittests that appeared to require this -- of course this
was probably just the confusion between byte strings and 8-bit text
strings inherent in Python 2.)

Well, well, XML...

It does look like text, but it's encoded text that is defined as a stream of bytes, and that's the only safe way of dealing with it.

There certainly *is* a use case for treating the serialised result as text, that's why lxml has this feature. A minor one is for debug output (which certainly doesn't merit being the default), but another one is when dealing with HTML, where encoding information is certainly less well defined and *much* less often seen in the wild. So users tend to be happy when they get their real-world HTML input fixed up into proper Unicode, still happier when they see that lxml can parse that correctly and even serialise the result back into a Unicode string directly, that they can post-process as text if they need to.

However, the main part here is the input, i.e. getting HTML data properly decoded into Unicode. The output part is a lot less important, and it's often easier to let lxml.html do the correct serialisation into bytes with proper encoding meta information, rather than dealing with it yourself.

Those are the two use cases I see for lxml. Their impact on ElementTree is relatively low as it doesn't support *parsing* from a Unicode string, so the most important HTML feature isn't there in the first place. The lack of major use cases in ElementTree is one of the reasons I'm so opposed to making this feature the backwards incompatible default for the output side.

Regarding backwards compatibility, there are now two backwards
compatibility problems: with 2.x, and with 3.1. It seems we cannot
easily be backwards compatible with both (though if someone figures
out a way that would be best of course).

If I were to propose an API for returning a Unicode string, I would
probably add a new method (e.g. tounicode()) rather than using a
"magical" argument (tostring(encoding=str)), but given that that
exists in another supposedly-compatible implementation I'm not
against it.

Actually, lxml.etree originally had a tounicode() function for this purpose, and I deprecated it in favour of tostring(encoding=unicode) to avoid having a separate interface for this, while staying just as explicit as before. I'm aware that this wasn't an all-win decision, but I found passing the unicode type to be explicit enough, and separate enough from an encoding /name/ to make it clear what happens. It's certainly less beautiful in Py3, where you write "tostring(encoding=str)".

I still didn't remove the function from the API, but it's been deprecated for years. Reactivating it in lxml.etre, and duplicating it in ET would safe lxml.etree from having to break user code (as "tostring(encoding=str)" could simply continue to work, but disappear from the docs). It wouldn't safe ET-Py3 from breaking backwards compatibility to itself, though.

Maybe tostring(encoding=None) could also be made to work? That would
at least make it *possible* to write code that receives a text object
and that works in 3.1 and 3.2 both. In 2.x I think neither of these
should work, and there probably isn't a need -- apps needing full
compatibility will just have to refrain from calling tostring()
without arguments.

It could be made to work, and it doesn't even read that bad. I can't imagine anyone using this explicitly to get the default behaviour, although you never know how people put together their keyword argument dicts programmatically. 'None' has always been the documented default for the encoding parameter, so I'm sure there's at least a tiny bit of code that uses it to say "I'm not overriding the default here".

Actually, the encoding has been a keyword-only parameter in lxml.etree for ages, which was ok with the original default and conform with the official ET documentation. So it would be easy to switch here, although not beautiful in the implementation. Same for ElementTree, where the current default None in the signature could simply be replaced by the 'real' default 'us-ascii'. Within the Py3 series, this change would not keep up backwards compatibility either.

So, as a solution, I do prefer separating this feature out into a separate function, so that we can simplify the interface of tostring() into always returning a byte string serialisation, as it always was in ET. The rather distinct use case of serialising to an unencoded text string can well be handled by a tounicode() function.

ISTM that the behavior of write() is just fine -- the contents of the
file will be correct after all.

Not according to the Py3.2 dev docs of open():

"""
'encoding' is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
"""

So if a users "preferred encoding" is not UTF-8 compatible, then writing out the Unicode serialisation will result in an incorrect XML serialisation, as an XML byte stream without encoding declaration is assumed to be in UTF-8 by specification.

Stefan

scoder · 2010-03-12T07:00:57Z

One more thing: given that many web-frameworks are still not available for Py3 at this time, and that there are still tons of third-party libraries missing on that platform, I would be surprised if there was any ElementTree based XML/HTML processing code written specifically and only for Py3 by now. So I cannot imagine any noticeable body of code being available that relies on this new Py3 feature.

effbot · 2010-03-12T09:00:45Z

"'None' has always been the documented default for the encoding parameter"

That's probably mostly by accident at least in original ET, but the 1.3 draft docs at effbot.org/elementtree does spell it out explicitly for the 'write' method:

Output encoding. If omitted or set to None, defaults to US-ASCII.

Not sure I'd consider this text binding in itself, though (even if I'd argue that it's preferred to have the same interpretation of encoding everywhere).

"writing out the Unicode serialisation will result in an incorrect XML serialisation"

I think Guido meant the ElementTree.write method; is that broken too?

The file.write(et.tostring()) issue is probably my most pressing concern here; that's a common use case (e.g. when using "iterparse" to cut pieces from a big document), and the defaults were chosen to increase the chance that this automatically do the right thing for non-ASCII even if the programmer never tests it. In 3.X, that construct is suddenly dependent on the interpreter's default encoding.

I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

effbot · 2010-03-12T09:34:48Z

(what's the Python 3 replacement for the array module, btw?)

scoder · 2010-03-12T09:38:26Z

"'None' has always been the documented default for the encoding parameter"

What I meant here was that "help(ET.tostring)" will show you that as the default. Also, in the docs, the signature is "tostring(tree, encoding=None)", so None is the documented default value for the argument, regardless of the internal handling.

"writing out the Unicode serialisation will result in an incorrect
XML serialisation"
I think Guido meant the ElementTree.write method; is that broken too?

Yes, the feature has been implemeted deep down in the _encode() helper function, so it impacts the entire serialiser, not only its API.

I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

I obviously agree that the use case for XML is fable, but that alone doesn't make this a convincing argument to move it into lxml.html when the implementation will stay in lxml.etree anyway. Besides, that's pretty off-topic for this bug tracker.

Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

That might be the clean break-all-bridges solution, but I don't think the name tostring() is so inherently broken in Py3 that it needs fixing. It's not "tostr()", for example.

I wouldn't raise much opposition against tobytes() as an alias for tostring(), although that sounds more like duplicating an otherwise simple API.

Stefan

effbot · 2010-03-12T10:14:06Z

"Yes, the feature has been implemented deep down in the _encode() helper function, so it impacts the entire serialiser, not only its API"

Ouch.

>>> import locale
>>> locale.getpreferredencoding() == "utf-8"
False
>>> from xml.etree.ElementTree import *
>>> e = Element("tag")
>>> e.text = "hellö"
>>> tostring(e)
'<tag>hellö</tag>'
>>> ElementTree(e).write("out.xml")
>>> tree = parse("out.xml")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python31\lib\xml\etree\ElementTree.py", line 843, in parse
    tree.parse(source, parser)
  File "C:\Python31\lib\xml\etree\ElementTree.py", line 581, in parse
    parser.feed(data)
  File "C:\Python31\lib\xml\etree\ElementTree.py", line 1221, in feed
    self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

effbot · 2010-03-12T10:25:33Z

"I wouldn't raise much opposition against tobytes() as an alias for tostring(), although that sounds more like duplicating an otherwise simple API."

Adding an alias would be a way address the 2.X/3.X terminology overlap; string traditionally implies 8-bit in 2.X, and apparently now Unicode in 3.X. That's likely to cause a lot of confusion for people switching from 2 to 3 (and to people writing 3.X documentation, apparently; the array module's documentation is an example of that).

(And once everyone has switched over, we can deprecate the tostring spelling... :)

ET isn't the only thing with tostring functionality, of course -- it's pretty much the standard name for "serialize data structure to byte string for later transmission" -- so it probably wouldn't hurt with a python-dev pronouncement here.

florentx · 2010-03-12T12:24:30Z

I plan to merge ET 1.3 in the 3.x branch tomorrow (See bpo-6472)
Currently, the patch is consistent with 3.1 behaviour.
It could be changed later, depending on the pronouncement on this compatibility issue.

Previously, in ElementTree, serialising without an explicit encoding
was a way to get a byte encoded serialisation without an XML
declaration header.

Now you can pass keyword argument "xml_declaration=False" to skip the header explicitely.

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

Now it works better.

~ $ ./python 
Python 3.2a0 (py3k:78865M, Mar 12 2010, 13:05:30) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding() == "utf-8"
False
>>> from xml.etree.ElementTree import *
>>> e = Element("tag")
>>> e.text = "hellö"
>>> tostring(e)
'<tag>hellö</tag>'
>>> ElementTree(e).write("out.xml")
>>> tree = parse("out.xml")
>>> dump(tree)
<tag>hellö</tag>

effbot · 2010-03-12T12:38:57Z

Interesting. But isn't the problem with 3.1 that it relies on the standard encoding, which results in code that may or may not work depending on a global platform setting? Who's doing the encoding in the new version? And what ends up in the file?

florentx · 2010-03-12T12:40:28Z

>> tree = parse("out.xml")

Actually the test in my previous message does not prove anything.
locale.getpreferredencoding() returns "UTF-8" != "utf-8".

:)

effbot · 2010-03-12T12:52:43Z

Oops :) Yeah, that was pretty lousy way to show what encoding I was using for that test:

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
>>>

(Somewhat related, it would be nice if Python actually normalized defaultencoding/preferredencoding to some canonical name for the codec in use, i.e. preferred MIME name or at least IANA; we had a rather nice little bug recently that wouldn't have happened if that had been the case...)

gvanrossum · 2010-03-12T15:32:07Z

I propose that we continue to see Fredrik as elementtree's "BDFL". If Fredrik wants the API in 3.2 to be changed to be backwards compatible with 2.x, we should do that, and damn the torpedoes (um, 3.1 compatibility).

I would do this ASAP; if you can, fix it *before* merging 1.3.

Since I hate XML equally whether it's text or bytes, please leave me out of this in the future; I apologize for having cause the problem in the first place (but note that apparently nobody cared or noticed until a week ago).

florentx · 2010-03-14T10:59:39Z

Currently "tree.write(file)" returns Unicode in 3.1 (and 3.x).
I would propose the following change:

>>> tree.write(file)
#  ==>  encode to ASCII without xml declaration (compatible 2.x)
>>> tree.write(file, encoding="utf-8")
#  ==>  encode to UTF-8 without xml declaration (compatible 2.x + 3.1)
>>> tree.write(file, encoding=False)
#  ==>  output Unicode, without xml declaration (compatible 3.1)

The "xml_declaration" keyword argument can be set to True explicitly.

For compatibility with lxml.etree, "encoding=str" returns the same as "encoding=False".

Functions tostring() and tostringlist() will inherit the same behavior.
This change could be backported to 2.7, because it is backward compatible.

See proposed patch for implementation details.

scoder · 2010-03-14T11:28:38Z

That's a funny idea. I like that. +1

effbot · 2010-03-21T14:38:39Z

Hmm. I'm not entirely sure about giving False a meaning when None has traditionally had a different (and documented) meaning. And sleeping on it hasn't convinced me in either direction :-(

(well, I'd say no, but the compatibility argument is somewhat tempting)

I'm not that concerned by changing the default for write -- 3.x users with utf-8 as the default output encoding will get different output, but still perfectly valid XML. 3.x users with non-utf-8 default encodings will get valid XML also in cases where it didn't work before.

tostring() is more problematic, but I'm leaning towards Guido's torpedoes approach there -- changing the default output to bytestrings is more likely to cause code to blow up than cause bad output, and you can trivially make your program backwards compatible by adding an extra check/decode after the call. Supporting unicode for lxml.etree compatibility is fine with me, but I think it might make sense to support the string "unicode" as well (as a pseudo-encoding -- it's pretty clear to me that nobody will ever define a real character encoding with that name :-).

Have you posted/can you post the patch to riedveld, btw? I have some questions about the code that are independent of the encoding decision.

florentx · 2010-03-22T08:53:59Z

http://codereview.appspot.com/664043 (patch against 3.x)

IIUC, the changes proposed (for 3.2) are:

default encoding or bool(encoding) == False
==> fallback to 'US-ASCII' encoding (instead of Unicode)
encoding=str or encoding='unicode'
==> serialize to Unicode

And it changes the behavior of :

ET.write()
tostring()
tostringlist()

For 2.x we could add the options for Unicode output:

encoding=unicode
and encoding='unicode'

scoder · 2010-03-22T09:09:28Z

Supporting unicode for lxml.etree compatibility is fine with me, but I
think it might make sense to support the string "unicode" as well (as
a pseudo-encoding -- it's pretty clear to me that nobody will ever
define a real character encoding with that name :-).

The reason I chose the unicode type over a 'unicode' string name at the time was that I wanted to make a clear distinction to show that this is not just selecting a different codec but that it changes the output type.

I don't really care either way, though, given that this reads a lot less well in Py3. If ET supports both, lxml will follow.

Stefan

malemburg · 2010-03-22T09:36:30Z

Stefan Behnel wrote:

Stefan Behnel <scoder@users.sourceforge.net> added the comment:

> Supporting unicode for lxml.etree compatibility is fine with me, but I
> think it might make sense to support the string "unicode" as well (as
> a pseudo-encoding -- it's pretty clear to me that nobody will ever
> define a real character encoding with that name :-).

The reason I chose the unicode type over a 'unicode' string name at the time was that I wanted to make a clear distinction to show that this is not just selecting a different codec but that it changes the output type.

I don't really care either way, though, given that this reads a lot less well in Py3. If ET supports both, lxml will follow.

There's always the possibility of adding a new official codec
called 'unicode' which converts Unicode to Unicode as no-op.

This may also be useful to have in other situations where you
want to signal a special case for Unicode input or output.

florentx · 2010-07-31T16:55:40Z

Patch updated here, and on Rietveld too.
http://codereview.appspot.com/664043

Rules (as discussed):

tree.tostring(encoding=None) => encodes to "US-ASCII"
(compatible with 2.7 and lxml.etree)
tree.tostring(encoding="unicode") => outputs Unicode
tree.tostring(encoding=str) => outputs Unicode
(compatible with lxml.etree)

For 2.7, no change planned.
For 3.1, do we keep the current behavior?

tree.tostring(encoding=None) => outputs Unicode

scoder · 2010-08-08T18:29:24Z

I would suggest fixing the tostring() behaviour also in a future 3.1.x bug fix release. After all, the current behaviour means that 3.0 and 3.1 would behave different from any other (released or future) Python version here.

florentx · 2010-08-08T20:08:10Z

Done for 3.2 with r83851.

Still opened, if someone wants to propose a patch for 3.1.

florentx · 2011-10-29T02:37:45Z

3.1 is no longer in scope for this issue.

scoder added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 3, 2010

pitrou added the docs Documentation in the Doc dir label Mar 8, 2010

pitrou assigned birkenfeld Mar 8, 2010

florentx mannequin assigned effbot and unassigned birkenfeld Mar 22, 2010

florentx mannequin added the topic-XML label Jul 31, 2010

florentx mannequin added the easy label Aug 8, 2010

florentx mannequin unassigned effbot Aug 8, 2010

florentx mannequin closed this as completed Oct 29, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Serialiser in ElementTree returns unicode strings in Py3k #52295

Serialiser in ElementTree returns unicode strings in Py3k #52295

Comments

scoder commented Mar 3, 2010

scoder commented Mar 3, 2010

bitdancer commented Mar 3, 2010

scoder commented Mar 3, 2010

bitdancer commented Mar 3, 2010

florentx mannequin commented Mar 3, 2010

pitrou commented Mar 6, 2010

scoder commented Mar 7, 2010

pitrou commented Mar 7, 2010

scoder commented Mar 8, 2010

florentx mannequin commented Mar 8, 2010

pitrou commented Mar 8, 2010

effbot mannequin commented Mar 11, 2010

pitrou commented Mar 11, 2010

scoder commented Mar 11, 2010

bitdancer commented Mar 11, 2010

effbot mannequin commented Mar 11, 2010

pitrou commented Mar 11, 2010

scoder commented Mar 11, 2010

bitdancer commented Mar 11, 2010

effbot mannequin commented Mar 11, 2010

pitrou commented Mar 11, 2010

effbot mannequin commented Mar 12, 2010

bitdancer commented Mar 12, 2010

gvanrossum commented Mar 12, 2010

pitrou commented Mar 12, 2010

scoder commented Mar 12, 2010

scoder commented Mar 12, 2010

effbot mannequin commented Mar 12, 2010

effbot mannequin commented Mar 12, 2010

scoder commented Mar 12, 2010

effbot mannequin commented Mar 12, 2010

effbot mannequin commented Mar 12, 2010

florentx mannequin commented Mar 12, 2010

effbot mannequin commented Mar 12, 2010

florentx mannequin commented Mar 12, 2010

effbot mannequin commented Mar 12, 2010

gvanrossum commented Mar 12, 2010

florentx mannequin commented Mar 14, 2010

scoder commented Mar 14, 2010

effbot mannequin commented Mar 21, 2010

florentx mannequin commented Mar 22, 2010

scoder commented Mar 22, 2010

malemburg commented Mar 22, 2010

florentx mannequin commented Jul 31, 2010

scoder commented Aug 8, 2010

florentx mannequin commented Aug 8, 2010

florentx mannequin commented Oct 29, 2011