classification
Title: Serialiser in ElementTree returns unicode strings in Py3k
Type: behavior Stage: needs patch
Components: Documentation, Library (Lib), XML Versions: Python 3.1
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: effbot, flox, georg.brandl, lemburg, r.david.murray, scoder
Priority: normal Keywords: easy

Created on 2010-03-03 07:15 by scoder, last changed 2011-10-29 02:37 by flox. This issue is now closed.

Files
File name Uploaded Description Edit
issue8047_etree_encoding_v2.diff flox, 2010-07-31 16:55 Patch, apply to 3.x review
Messages (47)
msg100333 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-03 07:15
The xml.etree.ElementTree package in the Python 3.x standard library breaks compatibility with existing ET 1.2 code. The serialiser returns a unicode string when no encoding is passed. Previously, the serialiser was guaranteed to return a byte string. By default, the string was 7-bit ASCII compatible.

This behavioural change breaks all code that relies on the default behaviour of ElementTree. Since there is no longer a default encoding in Python 3, unicode strings are incompatible with byte strings, which means that the result of the serialisation can no longer be written to a file, for example.

XML is well defined as a stream of bytes. Redefining it as a unicode string *by default* is hard to understand at best.

Finally, it would have been good to look at the other ET implementation before introducing such a change. The lxml.etree package has had support for serialising XML into a unicode string for years, and does so in a clear, safe and explicit way. It requires the user to pass the 'unicode' (Py3 'str') type as encoding parameter, e.g.

    tree.tostring(encoding=str)

which is explicit enough to make it clear that this is different from a normal encoding.
msg100342 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-03 13:44
I'm not an ElementTree user, but that spelling (etree.tostring(encode=str), or even etree.tostring(encode=unicode)) strikes me as horrible.  You don't encode to unicode, you *decode* to unicode.  Thus the current Python3 interface works the way I'd expect: if I don't specify an encoding, I get unicode.  If I do specify an encoding, I get encoded bytes.  In the general case the fact that you can no longer get away with being sloppy about what encoding a byte stream is in, the way you could in Python2, is a feature of Python3, not a bug.

If anything, having 'tostring' return bytes is broken, given its name.  But I think we fudge that by claiming it is returning a 'byte string' when given an encoding.

That said, I'm not sure how much, if at all, my opinion counts :)
msg100345 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-03 14:33
I agree that the lxml API is somewhat clumsy here. I just mentioned it to show that there are already ways to do it in a backwards compatible way, so this change does two things: it breaks existing code, and it does so in a way that is incompatible with other existing implementations. That's what *I* would call horrible.

Also, this is absolutely not a feature that is restricted to Py3, so what's the equivalent feature in the standard library of Py2 going to be, and how much code will it break for the Py2 series?
msg100349 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-03 17:24
My understanding is that backward compatibility, while nice to retain, was not considered a stopper for cleaning up interfaces in py3.  Exactly how considered this change was, I have no idea, but as I said it does make sense to me.  As for 2.x, what's there is what's there, as far as I can see.  Florent could speak to whether or not that API is likely to change in 2.7, but I doubt it will.
msg100350 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-03 19:10
With ET 1.3, the serializer ElementTree.write() should output bytes only. And the default encoding is still US-ASCII.

The new behaviour is specific to the 3.x branch (since 3.0, r56841).
Even if it is not fully backward compatible, I don't find this behavior shocking: it is a rule of Python 3 to avoid implicit encoding/decoding.
msg100513 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-06 02:57
I don't know what compatibility you are talking about. Py3k deliberately breaks compatibility with many 2.x behaviours that were considered defective or suboptimal.
msg100572 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-07 10:56
It has been brought up several times that ET is special in the stdlib in that it is an externally maintained package. Correct me if I'm wrong, but the rules seem to be: features come outside, adaptation to Py3 can happen inside. What we are talking about here is a new feature that makes sense for both Py2 and Py3. We are not talking about a bug fix, neither is this an adaptation to Py3. It is a new feature that was added inside of the standard library and that is not compatible with the external libraries that are supposed to implement the same interface, namely, ElementTree and lxml.etree.
msg100582 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-07 14:41
As Florent said, it is a rule of py3k to avoid implicit encoding/decoding. The fact that it could have made sense for 2.x as well is not relevant, since the change was only done in py3k (and for good reason: we normally try not to break compatibility without prior notice).

In any case, I have trouble understanding your concern here. Do you think the change is bad? Is it really that difficult to support it in lxml?
msg100633 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-08 09:01
Antoine, in the same comment, you say that it was not backported to Py2 in order to prevent breaking existing code, and then you ask if it's difficult to support in lxml. ;-)

Supporting the same behaviour in lxml would either mean that it breaks existing code in Py2 (when making the API consistent), or that you can safely (and correctly) write the return value to a file in Py2, but that you can't do the same in Py3 (when adopting the change only in Py3).

Previously, in ElementTree, serialising without an explicit encoding was a way to get a byte encoded serialisation without an XML declaration header, so I expect there to be code that depends on this. Since ElementTree 1.3 uses the same keyword argument as lxml for this feature, I assume that Florent's patches provide at least an alternative here, even if it requires users to adapt their code.

I just wish this backwards incompatible feature had been advertised at the time, or at least *documented* in any way. Even the latest 3.2-dev docs still state that the default encoding of the serialiser is US-ASCII, not a word about *ever* returning a unicode string, especially not by default, and totally not the required big fat warning that writing to a file will fail with mysterious errors if no encoding is specified.
msg100634 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-08 09:19
With ET 1.3, you should have an explicit keyword argument "xml_declaration":

# ----
        if xml_declaration or (xml_declaration is None and
                               encoding not in ("utf-8", "us-ascii")):
            if method == "xml":
                write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
# ----

In ET 1.2.6, the same snippet looks like:
# ----
        if encoding != "utf-8" and encoding != "us-ascii":
            file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
# ----
msg100649 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-08 15:05
Le Mon, 08 Mar 2010 09:01:19 +0000,
Stefan Behnel <report@bugs.python.org> a écrit :
> 
> Antoine, in the same comment, you say that it was not backported to
> Py2 in order to prevent breaking existing code, and then you ask if
> it's difficult to support in lxml. ;-)

I meant breaking existing *user* code. Besides, the fact that
compatibility is broken doesn't mean third-party code difficult to fix;
hence my question.

> Supporting the same behaviour in lxml would either mean that it
> breaks existing code in Py2 (when making the API consistent), or that
> you can safely (and correctly) write the return value to a file in
> Py2, but that you can't do the same in Py3 (when adopting the change
> only in Py3).

Sorry, I don't understand this. Are you saying it's impossible
for you to define two different behaviours based on the current Python
version? What's bad with
"""if sys.version_info() >= (3, 0, 0): # blah"""

> Previously, in ElementTree, serialising without an explicit encoding
> was a way to get a byte encoded serialisation without an XML
> declaration header, so I expect there to be code that depends on
> this.

This doesn't seem to be documented. The doc simply says
"""encoding is the output encoding (default is US-ASCII)""".

In other words, undocumented (and untested) behaviour has been "broken"
when porting to 3.0, which is the version which deliberately broke
compatibility for documented things. I guess we can live with it ;)

> Even the latest
> 3.2-dev docs still state that the default encoding of the serialiser
> is US-ASCII, not a word about *ever* returning a unicode string,
> especially not by default, and totally not the required big fat
> warning that writing to a file will fail with mysterious errors if no
> encoding is specified.

Ok, perhaps some documentation changes are in order :-)
(I wonder why the default was US-ASCII, though. Sounds a bit braindead)
msg100846 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-11 12:37
The "no header" thing is very much done on purpose, and it's documented in the upstream ElementTree documentation.

I suggest dropping this "Python 3 exists in its own universe" nonsense; it's not very professional, and it's hurting Python, its users, and all third party developers.  The "things I don't understand are braindead" stuff is less of a problem; that only hurts yourself.
msg100857 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-11 14:45
> The "no header" thing is very much done on purpose, and it's
> documented in the upstream ElementTree documentation.

I'm sorry, where is that?
I can't find it either at
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.tostring-function
or
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree.write-method

> I suggest dropping this "Python 3 exists in its own universe"
> nonsense; it's not very professional, and it's hurting Python, its
> users, and all third party developers.

Ha. There has been a very long temporal window (until 3.1, probably)
during which things were very much in flux and anyone with a
professional knowledge of elementtree and XML APIs could chime in and
point out any nonsense in py3k.

Now Python 3.1 is out and as a result py3k also has to ensure upwards
compatibility for its own APIs. Of course we can still make exceptions
if the alleged breakage is truly major. To me, it doesn't /seem/ to be
the case here.
msg100868 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-11 15:49
Sorry, Antoine, but you can't possibly mean what you say here. The culprit in question is clearly one of the best hidden features of the new Py3 ET API. The only existing reference to it that I can find is the SVN commit comment when it was applied. How is that supposed to be any reason for keeping up "backwards compatibility" within the Py3 series?
msg100877 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-11 16:53
I suspect that what Antoine is referring to is the fact that Python 3.1 has this behavior.  Whether or not it is explicitly documented is a secondary issue.

We're having a similar issue in the unittest package, where there's a new function, assertSameElements, that has an unfortunate and poorly documented API.  But changing that API now that the function exists in a released version (3.1) is not something to be done lightly, if it is done at all.

This is definitely an unfortunate state of affairs no matter how you look at it.
msg100880 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-11 19:01
> if I don't specify an encoding, I get unicode.  If I do specify an encoding, I get encoded bytes.

You're confusing the XML document encoding with character set encoding.

A serialized (unparsed) XML document is a byte stream, not a string of Unicode characters.  And the character set encoding is both embedded in that byte stream and affects how it's generated in more than one way; you cannot just recode XML documents nilly willy and expect things to work.

A parsed XML document (an infoset) -- for ET, that's the tree of Element objects -- does indeed contain Unicode strings, but the transformation from the byte stream to the Unicode string doesn't just involve character set decoding; there are several other constructs that are handled by the XML parser.

> Ha. There has been a very long temporal window

You should have had plenty of time to fix it, then, right?
msg100883 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-11 20:22
> > Ha. There has been a very long temporal window
> 
> You should have had plenty of time to fix it, then, right?

Under the condition that someone would have actually reported it, yes.
We don't magically fix bugs if nobody (including us) detects and reports
them.
msg100884 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-11 20:43
Then I would call that a clear sign that no-one actually stumbled over this feature in Py3 before I did, well hidden as it was. Still time to fix it.
msg100887 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-11 21:14
You may well be correct.  But just because no one reported a bug does not mean that no one is using the API.  The person using it may find it perfectly logical (and may be writing py3 only code, not porting py2 code).

However, regardless of whether we decide it is acceptable to change the behavior, it seems to me that having an interface named 'tostring' that returns bytes by default in Python3 would be a broken API.  I don't see any way around that terminology problem.
msg100890 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-11 22:03
>>> import array
>>> array.array("i", [1, 2, 3]).tostring()
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'
msg100891 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-11 23:01
Le Thu, 11 Mar 2010 22:03:37 +0000,
Fredrik Lundh <report@bugs.python.org> a écrit :
> 
> >>> import array
> >>> array.array("i", [1, 2, 3]).tostring()
> b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

The fact that array is old, rusty and slightly broken doesn't meen we
should propagate that brokenness to other Python modules.
Also, as David said, the fact that you think there is a bug
here doesn't mean everyone would agree.
Finally, the behaviour you seem to be looking for could be added
as a separated API or an optional method argument. Patches welcome.
msg100895 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-12 00:10
So now it's the domain experts against some hypothetical people that might exist?  Tricky.
msg100896 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-12 00:14
Well, Benjamin pointed out to me that it would be a bad thing if array.tostring produced a string.  True, the method is named wrong, but it is less broken than returning a string.  I suspect that that is the same argument Fredrik is making: that returning the XML as a byte string is less broken than returning it as a string when it in fact may contain other encoded stuff.  The email package has some of the same problems, and there we are retooling the API to deal with this.

Presumably ET needs to have a retooled API for Python3 as well.  Then the question becomes what do we do in the meantime?  For email, we are just living with the breakage until we can get something better in place, because no one has come up with any good short term solutions for email.
msg100898 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2010-03-12 00:35
Hey, can we all try to get along?

For anyone who didn't follow the link to r56841, that was mine (though Christian Heimes provided the basis for much of the patch apart from elementtree), and I wrote at the time:

"""I had to fix a few tests and modules beyond what Christian did, and
invent a few conventions.  E.g. in elementtree, I chose to
write/return Unicode strings whe no encoding is given, but bytes when
an explicit encoding is given."""

I am not a user of elementtree, so this may well have been a mistake -- at the time (in 2007) we were so busy making zillions of tests pass that some mistakes were made.  Some of those were caught in time, others apparently not.

My thinking was that since an XML document looks like text, it should probably be considered text, at least by default.  (There may have been some unittests that appeared to require this -- of course this was probably just the confusion between byte strings and 8-bit text strings inherent in Python 2.)

Regarding backwards compatibility, there are now two backwards compatibility problems: with 2.x, and with 3.1.  It seems we cannot easily be backwards compatible with both (though if someone figures out a way that would be best of course).

If I were to propose an API for returning a Unicode string, I would probably add a new method (e.g. tounicode()) rather than using a "magical" argument (tostring(encoding=str)), but given that that exists in another supposedly-compatible implementation I'm not against it.  Maybe tostring(encoding=None) could also be made to work? That would at least make it *possible* to write code that receives a text object and that works in 3.1 and 3.2 both.  In 2.x I think neither of these should work, and there probably isn't a need -- apps needing full compatibility will just have to refrain from calling tostring() without arguments.

ISTM that the behavior of write() is just fine -- the contents of the file will be correct after all.
msg100900 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-12 04:42
Not wanting to waste my time anymore on this.
msg100902 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-12 06:48
Hi Guido, your comment was long overdue in this discussion.

Guido van Rossum, 12.03.2010 01:35:
> My thinking was that since an XML document looks like text, it should
> probably be considered text, at least by default.  (There may have
> been some unittests that appeared to require this -- of course this
> was probably just the confusion between byte strings and 8-bit text
> strings inherent in Python 2.)

Well, well, XML...

It does look like text, but it's encoded text that is defined as a stream of bytes, and that's the only safe way of dealing with it.

There certainly *is* a use case for treating the serialised result as text, that's why lxml has this feature. A minor one is for debug output (which certainly doesn't merit being the default), but another one is when dealing with HTML, where encoding information is certainly less well defined and *much* less often seen in the wild. So users tend to be happy when they get their real-world HTML input fixed up into proper Unicode, still happier when they see that lxml can parse that correctly and even serialise the result back into a Unicode string directly, that they can post-process as text if they need to.

However, the main part here is the input, i.e. getting HTML data properly decoded into Unicode. The output part is a lot less important, and it's often easier to let lxml.html do the correct serialisation into bytes with proper encoding meta information, rather than dealing with it yourself.

Those are the two use cases I see for lxml. Their impact on ElementTree is relatively low as it doesn't support *parsing* from a Unicode string, so the most important HTML feature isn't there in the first place. The lack of major use cases in ElementTree is one of the reasons I'm so opposed to making this feature the backwards incompatible default for the output side.


> Regarding backwards compatibility, there are now two backwards
> compatibility problems: with 2.x, and with 3.1.  It seems we cannot
> easily be backwards compatible with both (though if someone figures
> out a way that would be best of course).
> 
> If I were to propose an API for returning a Unicode string, I would
> probably add a new method (e.g. tounicode()) rather than using a
> "magical" argument (tostring(encoding=str)), but given that that
> exists in another supposedly-compatible implementation I'm not
> against it.

Actually, lxml.etree originally had a tounicode() function for this purpose, and I deprecated it in favour of tostring(encoding=unicode) to avoid having a separate interface for this, while staying just as explicit as before.  I'm aware that this wasn't an all-win decision, but I found passing the unicode type to be explicit enough, and separate enough from an encoding /name/ to make it clear what happens. It's certainly less beautiful in Py3, where you write "tostring(encoding=str)".

I still didn't remove the function from the API, but it's been deprecated for years. Reactivating it in lxml.etre, and duplicating it in ET would safe lxml.etree from having to break user code (as "tostring(encoding=str)" could simply continue to work, but disappear from the docs). It wouldn't safe ET-Py3 from breaking backwards compatibility to itself, though.


> Maybe tostring(encoding=None) could also be made to work? That would
> at least make it *possible* to write code that receives a text object
> and that works in 3.1 and 3.2 both.  In 2.x I think neither of these
> should work, and there probably isn't a need -- apps needing full
> compatibility will just have to refrain from calling tostring()
> without arguments.

It could be made to work, and it doesn't even read that bad. I can't imagine anyone using this explicitly to get the default behaviour, although you never know how people put together their keyword argument dicts programmatically. 'None' has always been the documented default for the encoding parameter, so I'm sure there's at least a tiny bit of code that uses it to say "I'm not overriding the default here".

Actually, the encoding has been a keyword-only parameter in lxml.etree for ages, which was ok with the original default and conform with the official ET documentation. So it would be easy to switch here, although not beautiful in the implementation. Same for ElementTree, where the current default None in the signature could simply be replaced by the 'real' default 'us-ascii'. Within the Py3 series, this change would not keep up backwards compatibility either.

So, as a solution, I do prefer separating this feature out into a separate function, so that we can simplify the interface of tostring() into always returning a byte string serialisation, as it always was in ET. The rather distinct use case of serialising to an unencoded text string can well be handled by a tounicode() function.


> ISTM that the behavior of write() is just fine -- the contents of the
> file will be correct after all.

Not according to the Py3.2 dev docs of open():

"""
'encoding' is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
"""

So if a users "preferred encoding" is not UTF-8 compatible, then writing out the Unicode serialisation will result in an incorrect XML serialisation, as an XML byte stream without encoding declaration is assumed to be in UTF-8 by specification.

Stefan
msg100903 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-12 07:00
One more thing: given that many web-frameworks are still not available for Py3 at this time, and that there are still tons of third-party libraries missing on that platform, I would be surprised if there was any ElementTree based XML/HTML processing code written specifically and only for Py3 by now. So I cannot imagine any noticeable body of code being available that relies on this new Py3 feature.
msg100907 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-12 09:00
"'None' has always been the documented default for the encoding parameter"

That's probably mostly by accident at least in original ET, but the 1.3 draft docs at effbot.org/elementtree does spell it out explicitly for the 'write' method:

   Output encoding. If omitted or set to None, defaults to US-ASCII.

Not sure I'd consider this text binding in itself, though (even if I'd argue that it's preferred to have the same interpretation of encoding everywhere).

"writing out the Unicode serialisation will result in an incorrect XML serialisation"

I think Guido meant the ElementTree.write method; is that broken too?

The file.write(et.tostring()) issue is probably my most pressing concern here; that's a common use case (e.g. when using "iterparse" to cut pieces from a big document), and the defaults were chosen to increase the chance that this automatically do the right thing for non-ASCII even if the programmer never tests it.  In 3.X, that construct is suddenly dependent on the interpreter's default encoding.

I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?
msg100915 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-12 09:34
(what's the Python 3 replacement for the array module, btw?)
msg100916 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-12 09:38
"'None' has always been the documented default for the encoding parameter"

What I meant here was that "help(ET.tostring)" will show you that as the default. Also, in the docs, the signature is "tostring(tree, encoding=None)", so None is the documented default value for the argument, regardless of the internal handling.


> "writing out the Unicode serialisation will result in an incorrect
> XML serialisation"
> I think Guido meant the ElementTree.write method; is that broken too?

Yes, the feature has been implemeted deep down in the _encode() helper function, so it impacts the entire serialiser, not only its API.


> I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

I obviously agree that the use case for XML is fable, but that alone doesn't make this a convincing argument to move it into lxml.html when the implementation will stay in lxml.etree anyway. Besides, that's pretty off-topic for this bug tracker.


> Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

That might be the clean break-all-bridges solution, but I don't think the name tostring() is so inherently broken in Py3 that it needs fixing. It's not "tostr()", for example.

I wouldn't raise much opposition against tobytes() as an alias for tostring(), although that sounds more like duplicating an otherwise simple API.

Stefan
msg100919 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-12 10:14
"Yes, the feature has been implemented deep down in the _encode() helper function, so it impacts the entire serialiser, not only its API"

Ouch.

>>> import locale
>>> locale.getpreferredencoding() == "utf-8"
False
>>> from xml.etree.ElementTree import *
>>> e = Element("tag")
>>> e.text = "hellö"
>>> tostring(e)
'<tag>hellö</tag>'
>>> ElementTree(e).write("out.xml")
>>> tree = parse("out.xml")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python31\lib\xml\etree\ElementTree.py", line 843, in parse
    tree.parse(source, parser)
  File "C:\Python31\lib\xml\etree\ElementTree.py", line 581, in parse
    parser.feed(data)
  File "C:\Python31\lib\xml\etree\ElementTree.py", line 1221, in feed
    self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
msg100923 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-12 10:25
"I wouldn't raise much opposition against tobytes() as an alias for tostring(), although that sounds more like duplicating an otherwise simple API."

Adding an alias would be a way address the 2.X/3.X terminology overlap; string traditionally implies 8-bit in 2.X, and apparently now Unicode in 3.X.  That's likely to cause a lot of confusion for people switching from 2 to 3 (and to people writing 3.X documentation, apparently; the array module's documentation is an example of that).

(And once everyone has switched over, we can deprecate the tostring spelling... :)

ET isn't the only thing with tostring functionality, of course -- it's  pretty much the standard name for "serialize data structure to byte string for later transmission" -- so it probably wouldn't hurt with a python-dev pronouncement here.
msg100929 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-12 12:24
I plan to merge ET 1.3 in the 3.x branch tomorrow (See #6472)
Currently, the patch is consistent with 3.1 behaviour.
It could be changed later, depending on the pronouncement on this compatibility issue.


> Previously, in ElementTree, serialising without an explicit encoding
> was a way to get a byte encoded serialisation without an XML
> declaration header.

Now you can pass keyword argument "xml_declaration=False" to skip the header explicitely.


> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

Now it works better.


~ $ ./python 
Python 3.2a0 (py3k:78865M, Mar 12 2010, 13:05:30) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding() == "utf-8"
False
>>> from xml.etree.ElementTree import *
>>> e = Element("tag")
>>> e.text = "hellö"
>>> tostring(e)
'<tag>hellö</tag>'
>>> ElementTree(e).write("out.xml")
>>> tree = parse("out.xml")
>>> dump(tree)
<tag>hellö</tag>
msg100930 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-12 12:38
Interesting.  But isn't the problem with 3.1 that it relies on the standard encoding, which results in code that may or may not work depending on a global platform setting?  Who's doing the encoding in the new version?  And what ends up in the file?
msg100931 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-12 12:40
>>> tree = parse("out.xml")

Actually the test in my previous message does not prove anything.
locale.getpreferredencoding() returns "UTF-8" != "utf-8".

:)
msg100932 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-12 12:52
Oops :)  Yeah, that was pretty lousy way to show what encoding I was using for that test:

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
>>>

(Somewhat related, it would be nice if Python actually normalized defaultencoding/preferredencoding to some canonical name for the codec in use, i.e. preferred MIME name or at least IANA; we had a rather nice little bug recently that wouldn't have happened if that had been the case...)
msg100936 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2010-03-12 15:32
I propose that we continue to see Fredrik as elementtree's "BDFL". If Fredrik wants the API in 3.2 to be changed to be backwards compatible with 2.x, we should do that, and damn the torpedoes (um, 3.1 compatibility).

I would do this ASAP; if you can, fix it *before* merging 1.3.

Since I hate XML equally whether it's text or bytes, please leave me out of this in the future; I apologize for having cause the problem in the first place (but note that apparently nobody cared or noticed until a week ago).
msg101050 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-14 10:59
Currently "tree.write(file)" returns Unicode in 3.1 (and 3.x).
I would propose the following change:

>>> tree.write(file)
#  ==>  encode to ASCII without xml declaration (compatible 2.x)
>>> tree.write(file, encoding="utf-8")
#  ==>  encode to UTF-8 without xml declaration (compatible 2.x + 3.1)
>>> tree.write(file, encoding=False)
#  ==>  output Unicode, without xml declaration (compatible 3.1)

The "xml_declaration" keyword argument can be set to True explicitly.

For compatibility with lxml.etree, "encoding=str" returns the same as "encoding=False".

Functions tostring() and tostringlist() will inherit the same behavior.
This change could be backported to 2.7, because it is backward compatible.

See proposed patch for implementation details.
msg101052 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-14 11:28
That's a funny idea. I like that. +1
msg101427 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-21 14:38
Hmm.  I'm not entirely sure about giving False a meaning when None has traditionally had a different (and documented) meaning.  And sleeping on it hasn't convinced me in either direction :-(

(well, I'd say no, but the compatibility argument is somewhat tempting)

I'm not that concerned by changing the default for write -- 3.x users with utf-8 as the default output encoding will get different output, but still perfectly valid XML.  3.x users with non-utf-8 default encodings  will get valid XML also in cases where it didn't work before.

tostring() is more problematic, but I'm leaning towards Guido's torpedoes approach there -- changing the default output to bytestrings is more likely to cause code to blow up than cause bad output, and you can trivially make your program backwards compatible by adding an extra check/decode after the call.  Supporting unicode for lxml.etree compatibility is fine with me, but I think it might make sense to support the string "unicode" as well (as a pseudo-encoding -- it's pretty clear to me that nobody will ever define a real character encoding with that name :-).

Have you posted/can you post the patch to riedveld, btw?  I have some questions about the code that are independent of the encoding decision.
msg101487 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-22 08:53
http://codereview.appspot.com/664043 (patch against 3.x)

IIUC, the changes proposed (for 3.2) are:
 - default encoding or bool(encoding) == False
   ==> fallback to 'US-ASCII' encoding (instead of Unicode)
 - encoding=str or encoding='unicode'
   ==> serialize to Unicode

And it changes the behavior of :
 - ET.write()
 - tostring()
 - tostringlist()

For 2.x we could add the options for Unicode output:
 - encoding=unicode
 - and encoding='unicode'
msg101488 - (view) Author: Stefan Behnel (scoder) * Date: 2010-03-22 09:09
> Supporting unicode for lxml.etree compatibility is fine with me, but I
> think it might make sense to support the string "unicode" as well (as
> a pseudo-encoding -- it's pretty clear to me that nobody will ever
> define a real character encoding with that name :-).

The reason I chose the unicode type over a 'unicode' string name at the time was that I wanted to make a clear distinction to show that this is not just selecting a different codec but that it changes the output type.

I don't really care either way, though, given that this reads a lot less well in Py3. If ET supports both, lxml will follow.

Stefan
msg101490 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-22 09:36
Stefan Behnel wrote:
> 
> Stefan Behnel <scoder@users.sourceforge.net> added the comment:
> 
>> Supporting unicode for lxml.etree compatibility is fine with me, but I
>> think it might make sense to support the string "unicode" as well (as
>> a pseudo-encoding -- it's pretty clear to me that nobody will ever
>> define a real character encoding with that name :-).
> 
> The reason I chose the unicode type over a 'unicode' string name at the time was that I wanted to make a clear distinction to show that this is not just selecting a different codec but that it changes the output type.
> 
> I don't really care either way, though, given that this reads a lot less well in Py3. If ET supports both, lxml will follow.

There's always the possibility of adding a new official codec
called 'unicode' which converts Unicode to Unicode as no-op.

This may also be useful to have in other situations where you
want to signal a special case for Unicode input or output.
msg112165 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-07-31 16:55
Patch updated here, and on Rietveld too.
http://codereview.appspot.com/664043

Rules (as discussed):
 - tree.tostring(encoding=None)  => encodes to "US-ASCII"
   (compatible with 2.7 and lxml.etree)
 - tree.tostring(encoding="unicode") => outputs Unicode
 - tree.tostring(encoding=str) => outputs Unicode
   (compatible with lxml.etree)

For 2.7, no change planned.
For 3.1, do we keep the current behavior?
  - tree.tostring(encoding=None)  => outputs Unicode
msg113296 - (view) Author: Stefan Behnel (scoder) * Date: 2010-08-08 18:29
I would suggest fixing the tostring() behaviour also in a future 3.1.x bug fix release. After all, the current behaviour means that 3.0 and 3.1 would behave different from any other (released or future) Python version here.
msg113307 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-08-08 20:08
Done for 3.2 with r83851.

Still opened, if someone wants to propose a patch for 3.1.
msg146593 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2011-10-29 02:37
3.1 is no longer in scope for this issue.
History
Date User Action Args
2011-10-29 02:37:45floxsetstatus: open -> closed
resolution: out of date
messages: + msg146593
2010-08-08 22:03:17gvanrossumsetnosy: - gvanrossum
2010-08-08 20:08:09floxsetversions: - Python 3.2
messages: + msg113307

assignee: effbot ->
keywords: + easy, - patch
stage: commit review -> needs patch
2010-08-08 18:29:23scodersetmessages: + msg113296
2010-07-31 16:55:43floxsetfiles: + issue8047_etree_encoding_v2.diff
nosy: lemburg, gvanrossum, effbot, georg.brandl, scoder, r.david.murray, flox
messages: + msg112165

components: + XML
stage: patch review -> commit review
2010-07-31 16:44:25floxsetfiles: - issue8047_etree_encoding.diff
2010-03-22 09:36:30lemburgsetnosy: + lemburg
messages: + msg101490
2010-03-22 09:09:28scodersetmessages: + msg101488
2010-03-22 08:53:59floxsetassignee: georg.brandl -> effbot
messages: + msg101487
stage: test needed -> patch review
2010-03-21 14:38:39effbotsetmessages: + msg101427
2010-03-14 11:28:38scodersetmessages: + msg101052
2010-03-14 10:59:41floxsetfiles: + issue8047_etree_encoding.diff
keywords: + patch
messages: + msg101050
2010-03-12 15:32:07gvanrossumsetmessages: + msg100936
2010-03-12 12:52:43effbotsetmessages: + msg100932
2010-03-12 12:40:28floxsetmessages: + msg100931
2010-03-12 12:38:57effbotsetmessages: + msg100930
2010-03-12 12:24:30floxsetmessages: + msg100929
2010-03-12 10:25:33effbotsetmessages: + msg100923
2010-03-12 10:23:17effbotsetmessages: - msg100922
2010-03-12 10:22:40effbotsetmessages: + msg100922
2010-03-12 10:14:06effbotsetmessages: + msg100919
2010-03-12 09:38:26scodersetmessages: + msg100916
2010-03-12 09:34:48effbotsetmessages: + msg100915
2010-03-12 09:00:46effbotsetmessages: + msg100907
2010-03-12 07:00:58scodersetmessages: + msg100903
2010-03-12 06:49:03scodersetmessages: + msg100902
2010-03-12 04:42:36pitrousetnosy: - pitrou
2010-03-12 04:42:16pitrousetnosy: gvanrossum, effbot, georg.brandl, pitrou, scoder, r.david.murray, flox
messages: + msg100900
2010-03-12 00:35:32gvanrossumsetnosy: + gvanrossum
messages: + msg100898
2010-03-12 00:14:23r.david.murraysetmessages: + msg100896
2010-03-12 00:10:40effbotsetmessages: + msg100895
2010-03-11 23:01:20pitrousetmessages: + msg100891
2010-03-11 22:03:34effbotsetmessages: + msg100890
2010-03-11 21:14:39r.david.murraysetmessages: + msg100887
2010-03-11 20:43:45scodersetmessages: + msg100884
2010-03-11 20:22:51pitrousetmessages: + msg100883
2010-03-11 19:01:10effbotsetmessages: + msg100880
2010-03-11 16:53:27r.david.murraysetmessages: + msg100877
2010-03-11 15:49:02scodersetmessages: + msg100868
2010-03-11 14:45:49pitrousetmessages: + msg100857
2010-03-11 12:37:14effbotsetmessages: + msg100846
2010-03-08 15:07:44pitrousetassignee: georg.brandl

components: + Documentation
nosy: + georg.brandl
2010-03-08 15:05:48pitrousetmessages: + msg100649
2010-03-08 09:19:12floxsetmessages: + msg100634
2010-03-08 09:01:17scodersetmessages: + msg100633
2010-03-07 14:41:08pitrousetmessages: + msg100582
2010-03-07 10:56:43scodersetmessages: + msg100572
2010-03-06 02:57:39pitrousetnosy: + pitrou
messages: + msg100513
2010-03-03 19:10:56floxsetmessages: + msg100350
stage: test needed
2010-03-03 17:24:48r.david.murraysetmessages: + msg100349
2010-03-03 14:33:42scodersetmessages: + msg100345
2010-03-03 13:44:40r.david.murraysetpriority: normal
nosy: + effbot, r.david.murray, flox
messages: + msg100342

2010-03-03 07:15:24scodercreate