Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialiser in ElementTree returns unicode strings in Py3k #52295

Closed
scoder opened this issue Mar 3, 2010 · 47 comments
Closed

Serialiser in ElementTree returns unicode strings in Py3k #52295

scoder opened this issue Mar 3, 2010 · 47 comments
Labels
docs Documentation in the Doc dir easy stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error

Comments

@scoder
Copy link
Contributor

scoder commented Mar 3, 2010

BPO 8047
Nosy @malemburg, @birkenfeld, @scoder, @bitdancer, @florentx
Files
  • issue8047_etree_encoding_v2.diff: Patch, apply to 3.x
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-10-29.02:37:45.468>
    created_at = <Date 2010-03-03.07:15:23.877>
    labels = ['expert-XML', 'easy', 'type-bug', 'library', 'docs']
    title = 'Serialiser in ElementTree returns unicode strings in Py3k'
    updated_at = <Date 2011-10-29.02:37:45.467>
    user = 'https://github.com/scoder'

    bugs.python.org fields:

    activity = <Date 2011-10-29.02:37:45.467>
    actor = 'flox'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-10-29.02:37:45.468>
    closer = 'flox'
    components = ['Documentation', 'Library (Lib)', 'XML']
    creation = <Date 2010-03-03.07:15:23.877>
    creator = 'scoder'
    dependencies = []
    files = ['18286']
    hgrepos = []
    issue_num = 8047
    keywords = ['easy']
    message_count = 47.0
    messages = ['100333', '100342', '100345', '100349', '100350', '100513', '100572', '100582', '100633', '100634', '100649', '100846', '100857', '100868', '100877', '100880', '100883', '100884', '100887', '100890', '100891', '100895', '100896', '100898', '100900', '100902', '100903', '100907', '100915', '100916', '100919', '100923', '100929', '100930', '100931', '100932', '100936', '101050', '101052', '101427', '101487', '101488', '101490', '112165', '113296', '113307', '146593']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'effbot', 'georg.brandl', 'scoder', 'r.david.murray', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue8047'
    versions = ['Python 3.1']

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 3, 2010

    The xml.etree.ElementTree package in the Python 3.x standard library breaks compatibility with existing ET 1.2 code. The serialiser returns a unicode string when no encoding is passed. Previously, the serialiser was guaranteed to return a byte string. By default, the string was 7-bit ASCII compatible.

    This behavioural change breaks all code that relies on the default behaviour of ElementTree. Since there is no longer a default encoding in Python 3, unicode strings are incompatible with byte strings, which means that the result of the serialisation can no longer be written to a file, for example.

    XML is well defined as a stream of bytes. Redefining it as a unicode string *by default* is hard to understand at best.

    Finally, it would have been good to look at the other ET implementation before introducing such a change. The lxml.etree package has had support for serialising XML into a unicode string for years, and does so in a clear, safe and explicit way. It requires the user to pass the 'unicode' (Py3 'str') type as encoding parameter, e.g.

        tree.tostring(encoding=str)

    which is explicit enough to make it clear that this is different from a normal encoding.

    @scoder scoder added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 3, 2010
    @bitdancer
    Copy link
    Member

    I'm not an ElementTree user, but that spelling (etree.tostring(encode=str), or even etree.tostring(encode=unicode)) strikes me as horrible. You don't encode to unicode, you *decode* to unicode. Thus the current Python3 interface works the way I'd expect: if I don't specify an encoding, I get unicode. If I do specify an encoding, I get encoded bytes. In the general case the fact that you can no longer get away with being sloppy about what encoding a byte stream is in, the way you could in Python2, is a feature of Python3, not a bug.

    If anything, having 'tostring' return bytes is broken, given its name. But I think we fudge that by claiming it is returning a 'byte string' when given an encoding.

    That said, I'm not sure how much, if at all, my opinion counts :)

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 3, 2010

    I agree that the lxml API is somewhat clumsy here. I just mentioned it to show that there are already ways to do it in a backwards compatible way, so this change does two things: it breaks existing code, and it does so in a way that is incompatible with other existing implementations. That's what *I* would call horrible.

    Also, this is absolutely not a feature that is restricted to Py3, so what's the equivalent feature in the standard library of Py2 going to be, and how much code will it break for the Py2 series?

    @bitdancer
    Copy link
    Member

    My understanding is that backward compatibility, while nice to retain, was not considered a stopper for cleaning up interfaces in py3. Exactly how considered this change was, I have no idea, but as I said it does make sense to me. As for 2.x, what's there is what's there, as far as I can see. Florent could speak to whether or not that API is likely to change in 2.7, but I doubt it will.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 3, 2010

    With ET 1.3, the serializer ElementTree.write() should output bytes only. And the default encoding is still US-ASCII.

    The new behaviour is specific to the 3.x branch (since 3.0, r56841).
    Even if it is not fully backward compatible, I don't find this behavior shocking: it is a rule of Python 3 to avoid implicit encoding/decoding.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 6, 2010

    I don't know what compatibility you are talking about. Py3k deliberately breaks compatibility with many 2.x behaviours that were considered defective or suboptimal.

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 7, 2010

    It has been brought up several times that ET is special in the stdlib in that it is an externally maintained package. Correct me if I'm wrong, but the rules seem to be: features come outside, adaptation to Py3 can happen inside. What we are talking about here is a new feature that makes sense for both Py2 and Py3. We are not talking about a bug fix, neither is this an adaptation to Py3. It is a new feature that was added inside of the standard library and that is not compatible with the external libraries that are supposed to implement the same interface, namely, ElementTree and lxml.etree.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 7, 2010

    As Florent said, it is a rule of py3k to avoid implicit encoding/decoding. The fact that it could have made sense for 2.x as well is not relevant, since the change was only done in py3k (and for good reason: we normally try not to break compatibility without prior notice).

    In any case, I have trouble understanding your concern here. Do you think the change is bad? Is it really that difficult to support it in lxml?

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 8, 2010

    Antoine, in the same comment, you say that it was not backported to Py2 in order to prevent breaking existing code, and then you ask if it's difficult to support in lxml. ;-)

    Supporting the same behaviour in lxml would either mean that it breaks existing code in Py2 (when making the API consistent), or that you can safely (and correctly) write the return value to a file in Py2, but that you can't do the same in Py3 (when adopting the change only in Py3).

    Previously, in ElementTree, serialising without an explicit encoding was a way to get a byte encoded serialisation without an XML declaration header, so I expect there to be code that depends on this. Since ElementTree 1.3 uses the same keyword argument as lxml for this feature, I assume that Florent's patches provide at least an alternative here, even if it requires users to adapt their code.

    I just wish this backwards incompatible feature had been advertised at the time, or at least *documented* in any way. Even the latest 3.2-dev docs still state that the default encoding of the serialiser is US-ASCII, not a word about *ever* returning a unicode string, especially not by default, and totally not the required big fat warning that writing to a file will fail with mysterious errors if no encoding is specified.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 8, 2010

    With ET 1.3, you should have an explicit keyword argument "xml_declaration":

    # ----
    if xml_declaration or (xml_declaration is None and
    encoding not in ("utf-8", "us-ascii")):
    if method == "xml":
    write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
    # ----

    In ET 1.2.6, the same snippet looks like:
    # ----

            if encoding != "utf-8" and encoding != "us-ascii":
                file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
    #
    ```----

    @pitrou
    Copy link
    Member

    pitrou commented Mar 8, 2010

    Le Mon, 08 Mar 2010 09:01:19 +0000,
    Stefan Behnel <report@bugs.python.org> a écrit :

    Antoine, in the same comment, you say that it was not backported to
    Py2 in order to prevent breaking existing code, and then you ask if
    it's difficult to support in lxml. ;-)

    I meant breaking existing *user* code. Besides, the fact that
    compatibility is broken doesn't mean third-party code difficult to fix;
    hence my question.

    Supporting the same behaviour in lxml would either mean that it
    breaks existing code in Py2 (when making the API consistent), or that
    you can safely (and correctly) write the return value to a file in
    Py2, but that you can't do the same in Py3 (when adopting the change
    only in Py3).

    Sorry, I don't understand this. Are you saying it's impossible
    for you to define two different behaviours based on the current Python
    version? What's bad with
    """if sys.version_info() >= (3, 0, 0): # blah"""

    Previously, in ElementTree, serialising without an explicit encoding
    was a way to get a byte encoded serialisation without an XML
    declaration header, so I expect there to be code that depends on
    this.

    This doesn't seem to be documented. The doc simply says
    """encoding is the output encoding (default is US-ASCII)""".

    In other words, undocumented (and untested) behaviour has been "broken"
    when porting to 3.0, which is the version which deliberately broke
    compatibility for documented things. I guess we can live with it ;)

    Even the latest
    3.2-dev docs still state that the default encoding of the serialiser
    is US-ASCII, not a word about *ever* returning a unicode string,
    especially not by default, and totally not the required big fat
    warning that writing to a file will fail with mysterious errors if no
    encoding is specified.

    Ok, perhaps some documentation changes are in order :-)
    (I wonder why the default was US-ASCII, though. Sounds a bit braindead)

    @pitrou pitrou added the docs Documentation in the Doc dir label Mar 8, 2010
    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 11, 2010

    The "no header" thing is very much done on purpose, and it's documented in the upstream ElementTree documentation.

    I suggest dropping this "Python 3 exists in its own universe" nonsense; it's not very professional, and it's hurting Python, its users, and all third party developers. The "things I don't understand are braindead" stuff is less of a problem; that only hurts yourself.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 11, 2010

    The "no header" thing is very much done on purpose, and it's
    documented in the upstream ElementTree documentation.

    I'm sorry, where is that?
    I can't find it either at
    http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.tostring-function
    or
    http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree.write-method

    I suggest dropping this "Python 3 exists in its own universe"
    nonsense; it's not very professional, and it's hurting Python, its
    users, and all third party developers.

    Ha. There has been a very long temporal window (until 3.1, probably)
    during which things were very much in flux and anyone with a
    professional knowledge of elementtree and XML APIs could chime in and
    point out any nonsense in py3k.

    Now Python 3.1 is out and as a result py3k also has to ensure upwards
    compatibility for its own APIs. Of course we can still make exceptions
    if the alleged breakage is truly major. To me, it doesn't /seem/ to be
    the case here.

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 11, 2010

    Sorry, Antoine, but you can't possibly mean what you say here. The culprit in question is clearly one of the best hidden features of the new Py3 ET API. The only existing reference to it that I can find is the SVN commit comment when it was applied. How is that supposed to be any reason for keeping up "backwards compatibility" within the Py3 series?

    @bitdancer
    Copy link
    Member

    I suspect that what Antoine is referring to is the fact that Python 3.1 has this behavior. Whether or not it is explicitly documented is a secondary issue.

    We're having a similar issue in the unittest package, where there's a new function, assertSameElements, that has an unfortunate and poorly documented API. But changing that API now that the function exists in a released version (3.1) is not something to be done lightly, if it is done at all.

    This is definitely an unfortunate state of affairs no matter how you look at it.

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 11, 2010

    if I don't specify an encoding, I get unicode. If I do specify an encoding, I get encoded bytes.

    You're confusing the XML document encoding with character set encoding.

    A serialized (unparsed) XML document is a byte stream, not a string of Unicode characters. And the character set encoding is both embedded in that byte stream and affects how it's generated in more than one way; you cannot just recode XML documents nilly willy and expect things to work.

    A parsed XML document (an infoset) -- for ET, that's the tree of Element objects -- does indeed contain Unicode strings, but the transformation from the byte stream to the Unicode string doesn't just involve character set decoding; there are several other constructs that are handled by the XML parser.

    Ha. There has been a very long temporal window

    You should have had plenty of time to fix it, then, right?

    @pitrou
    Copy link
    Member

    pitrou commented Mar 11, 2010

    > Ha. There has been a very long temporal window

    You should have had plenty of time to fix it, then, right?

    Under the condition that someone would have actually reported it, yes.
    We don't magically fix bugs if nobody (including us) detects and reports
    them.

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 11, 2010

    Then I would call that a clear sign that no-one actually stumbled over this feature in Py3 before I did, well hidden as it was. Still time to fix it.

    @bitdancer
    Copy link
    Member

    You may well be correct. But just because no one reported a bug does not mean that no one is using the API. The person using it may find it perfectly logical (and may be writing py3 only code, not porting py2 code).

    However, regardless of whether we decide it is acceptable to change the behavior, it seems to me that having an interface named 'tostring' that returns bytes by default in Python3 would be a broken API. I don't see any way around that terminology problem.

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 11, 2010

    >>> import array
    >>> array.array("i", [1, 2, 3]).tostring()
    b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

    @pitrou
    Copy link
    Member

    pitrou commented Mar 11, 2010

    Le Thu, 11 Mar 2010 22:03:37 +0000,
    Fredrik Lundh <report@bugs.python.org> a écrit :
    > 
    > >>> import array
    > >>> array.array("i", [1, 2, 3]).tostring()
    > b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

    The fact that array is old, rusty and slightly broken doesn't meen we
    should propagate that brokenness to other Python modules.
    Also, as David said, the fact that you think there is a bug
    here doesn't mean everyone would agree.
    Finally, the behaviour you seem to be looking for could be added
    as a separated API or an optional method argument. Patches welcome.

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 12, 2010

    So now it's the domain experts against some hypothetical people that might exist? Tricky.

    @bitdancer
    Copy link
    Member

    Well, Benjamin pointed out to me that it would be a bad thing if array.tostring produced a string. True, the method is named wrong, but it is less broken than returning a string. I suspect that that is the same argument Fredrik is making: that returning the XML as a byte string is less broken than returning it as a string when it in fact may contain other encoded stuff. The email package has some of the same problems, and there we are retooling the API to deal with this.

    Presumably ET needs to have a retooled API for Python3 as well. Then the question becomes what do we do in the meantime? For email, we are just living with the breakage until we can get something better in place, because no one has come up with any good short term solutions for email.

    @gvanrossum
    Copy link
    Member

    Hey, can we all try to get along?

    For anyone who didn't follow the link to r56841, that was mine (though Christian Heimes provided the basis for much of the patch apart from elementtree), and I wrote at the time:

    """I had to fix a few tests and modules beyond what Christian did, and
    invent a few conventions. E.g. in elementtree, I chose to
    write/return Unicode strings whe no encoding is given, but bytes when
    an explicit encoding is given."""

    I am not a user of elementtree, so this may well have been a mistake -- at the time (in 2007) we were so busy making zillions of tests pass that some mistakes were made. Some of those were caught in time, others apparently not.

    My thinking was that since an XML document looks like text, it should probably be considered text, at least by default. (There may have been some unittests that appeared to require this -- of course this was probably just the confusion between byte strings and 8-bit text strings inherent in Python 2.)

    Regarding backwards compatibility, there are now two backwards compatibility problems: with 2.x, and with 3.1. It seems we cannot easily be backwards compatible with both (though if someone figures out a way that would be best of course).

    If I were to propose an API for returning a Unicode string, I would probably add a new method (e.g. tounicode()) rather than using a "magical" argument (tostring(encoding=str)), but given that that exists in another supposedly-compatible implementation I'm not against it. Maybe tostring(encoding=None) could also be made to work? That would at least make it *possible* to write code that receives a text object and that works in 3.1 and 3.2 both. In 2.x I think neither of these should work, and there probably isn't a need -- apps needing full compatibility will just have to refrain from calling tostring() without arguments.

    ISTM that the behavior of write() is just fine -- the contents of the file will be correct after all.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 12, 2010

    Not wanting to waste my time anymore on this.

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 12, 2010

    Hi Guido, your comment was long overdue in this discussion.

    Guido van Rossum, 12.03.2010 01:35:

    My thinking was that since an XML document looks like text, it should
    probably be considered text, at least by default. (There may have
    been some unittests that appeared to require this -- of course this
    was probably just the confusion between byte strings and 8-bit text
    strings inherent in Python 2.)

    Well, well, XML...

    It does look like text, but it's encoded text that is defined as a stream of bytes, and that's the only safe way of dealing with it.

    There certainly *is* a use case for treating the serialised result as text, that's why lxml has this feature. A minor one is for debug output (which certainly doesn't merit being the default), but another one is when dealing with HTML, where encoding information is certainly less well defined and *much* less often seen in the wild. So users tend to be happy when they get their real-world HTML input fixed up into proper Unicode, still happier when they see that lxml can parse that correctly and even serialise the result back into a Unicode string directly, that they can post-process as text if they need to.

    However, the main part here is the input, i.e. getting HTML data properly decoded into Unicode. The output part is a lot less important, and it's often easier to let lxml.html do the correct serialisation into bytes with proper encoding meta information, rather than dealing with it yourself.

    Those are the two use cases I see for lxml. Their impact on ElementTree is relatively low as it doesn't support *parsing* from a Unicode string, so the most important HTML feature isn't there in the first place. The lack of major use cases in ElementTree is one of the reasons I'm so opposed to making this feature the backwards incompatible default for the output side.

    Regarding backwards compatibility, there are now two backwards
    compatibility problems: with 2.x, and with 3.1. It seems we cannot
    easily be backwards compatible with both (though if someone figures
    out a way that would be best of course).

    If I were to propose an API for returning a Unicode string, I would
    probably add a new method (e.g. tounicode()) rather than using a
    "magical" argument (tostring(encoding=str)), but given that that
    exists in another supposedly-compatible implementation I'm not
    against it.

    Actually, lxml.etree originally had a tounicode() function for this purpose, and I deprecated it in favour of tostring(encoding=unicode) to avoid having a separate interface for this, while staying just as explicit as before. I'm aware that this wasn't an all-win decision, but I found passing the unicode type to be explicit enough, and separate enough from an encoding /name/ to make it clear what happens. It's certainly less beautiful in Py3, where you write "tostring(encoding=str)".

    I still didn't remove the function from the API, but it's been deprecated for years. Reactivating it in lxml.etre, and duplicating it in ET would safe lxml.etree from having to break user code (as "tostring(encoding=str)" could simply continue to work, but disappear from the docs). It wouldn't safe ET-Py3 from breaking backwards compatibility to itself, though.

    Maybe tostring(encoding=None) could also be made to work? That would
    at least make it *possible* to write code that receives a text object
    and that works in 3.1 and 3.2 both. In 2.x I think neither of these
    should work, and there probably isn't a need -- apps needing full
    compatibility will just have to refrain from calling tostring()
    without arguments.

    It could be made to work, and it doesn't even read that bad. I can't imagine anyone using this explicitly to get the default behaviour, although you never know how people put together their keyword argument dicts programmatically. 'None' has always been the documented default for the encoding parameter, so I'm sure there's at least a tiny bit of code that uses it to say "I'm not overriding the default here".

    Actually, the encoding has been a keyword-only parameter in lxml.etree for ages, which was ok with the original default and conform with the official ET documentation. So it would be easy to switch here, although not beautiful in the implementation. Same for ElementTree, where the current default None in the signature could simply be replaced by the 'real' default 'us-ascii'. Within the Py3 series, this change would not keep up backwards compatibility either.

    So, as a solution, I do prefer separating this feature out into a separate function, so that we can simplify the interface of tostring() into always returning a byte string serialisation, as it always was in ET. The rather distinct use case of serialising to an unencoded text string can well be handled by a tounicode() function.

    ISTM that the behavior of write() is just fine -- the contents of the
    file will be correct after all.

    Not according to the Py3.2 dev docs of open():

    """
    'encoding' is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
    """

    So if a users "preferred encoding" is not UTF-8 compatible, then writing out the Unicode serialisation will result in an incorrect XML serialisation, as an XML byte stream without encoding declaration is assumed to be in UTF-8 by specification.

    Stefan

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 12, 2010

    One more thing: given that many web-frameworks are still not available for Py3 at this time, and that there are still tons of third-party libraries missing on that platform, I would be surprised if there was any ElementTree based XML/HTML processing code written specifically and only for Py3 by now. So I cannot imagine any noticeable body of code being available that relies on this new Py3 feature.

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 12, 2010

    "'None' has always been the documented default for the encoding parameter"

    That's probably mostly by accident at least in original ET, but the 1.3 draft docs at effbot.org/elementtree does spell it out explicitly for the 'write' method:

    Output encoding. If omitted or set to None, defaults to US-ASCII.

    Not sure I'd consider this text binding in itself, though (even if I'd argue that it's preferred to have the same interpretation of encoding everywhere).

    "writing out the Unicode serialisation will result in an incorrect XML serialisation"

    I think Guido meant the ElementTree.write method; is that broken too?

    The file.write(et.tostring()) issue is probably my most pressing concern here; that's a common use case (e.g. when using "iterparse" to cut pieces from a big document), and the defaults were chosen to increase the chance that this automatically do the right thing for non-ASCII even if the programmer never tests it. In 3.X, that construct is suddenly dependent on the interpreter's default encoding.

    I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

    Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 12, 2010

    (what's the Python 3 replacement for the array module, btw?)

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 12, 2010

    "'None' has always been the documented default for the encoding parameter"

    What I meant here was that "help(ET.tostring)" will show you that as the default. Also, in the docs, the signature is "tostring(tree, encoding=None)", so None is the documented default value for the argument, regardless of the internal handling.

    "writing out the Unicode serialisation will result in an incorrect
    XML serialisation"
    I think Guido meant the ElementTree.write method; is that broken too?

    Yes, the feature has been implemeted deep down in the _encode() helper function, so it impacts the entire serialiser, not only its API.

    I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

    I obviously agree that the use case for XML is fable, but that alone doesn't make this a convincing argument to move it into lxml.html when the implementation will stay in lxml.etree anyway. Besides, that's pretty off-topic for this bug tracker.

    Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

    That might be the clean break-all-bridges solution, but I don't think the name tostring() is so inherently broken in Py3 that it needs fixing. It's not "tostr()", for example.

    I wouldn't raise much opposition against tobytes() as an alias for tostring(), although that sounds more like duplicating an otherwise simple API.

    Stefan

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 12, 2010

    "Yes, the feature has been implemented deep down in the _encode() helper function, so it impacts the entire serialiser, not only its API"

    Ouch.

    >>> import locale
    >>> locale.getpreferredencoding() == "utf-8"
    False
    >>> from xml.etree.ElementTree import *
    >>> e = Element("tag")
    >>> e.text = "hellö"
    >>> tostring(e)
    '<tag>hellö</tag>'
    >>> ElementTree(e).write("out.xml")
    >>> tree = parse("out.xml")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python31\lib\xml\etree\ElementTree.py", line 843, in parse
        tree.parse(source, parser)
      File "C:\Python31\lib\xml\etree\ElementTree.py", line 581, in parse
        parser.feed(data)
      File "C:\Python31\lib\xml\etree\ElementTree.py", line 1221, in feed
        self._parser.Parse(data, 0)
    xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 12, 2010

    "I wouldn't raise much opposition against tobytes() as an alias for tostring(), although that sounds more like duplicating an otherwise simple API."

    Adding an alias would be a way address the 2.X/3.X terminology overlap; string traditionally implies 8-bit in 2.X, and apparently now Unicode in 3.X. That's likely to cause a lot of confusion for people switching from 2 to 3 (and to people writing 3.X documentation, apparently; the array module's documentation is an example of that).

    (And once everyone has switched over, we can deprecate the tostring spelling... :)

    ET isn't the only thing with tostring functionality, of course -- it's pretty much the standard name for "serialize data structure to byte string for later transmission" -- so it probably wouldn't hurt with a python-dev pronouncement here.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 12, 2010

    I plan to merge ET 1.3 in the 3.x branch tomorrow (See bpo-6472)
    Currently, the patch is consistent with 3.1 behaviour.
    It could be changed later, depending on the pronouncement on this compatibility issue.

    Previously, in ElementTree, serialising without an explicit encoding
    was a way to get a byte encoded serialisation without an XML
    declaration header.

    Now you can pass keyword argument "xml_declaration=False" to skip the header explicitely.

    xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

    Now it works better.

    ~ $ ./python 
    Python 3.2a0 (py3k:78865M, Mar 12 2010, 13:05:30) 
    [GCC 4.3.4] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale
    >>> locale.getpreferredencoding() == "utf-8"
    False
    >>> from xml.etree.ElementTree import *
    >>> e = Element("tag")
    >>> e.text = "hellö"
    >>> tostring(e)
    '<tag>hellö</tag>'
    >>> ElementTree(e).write("out.xml")
    >>> tree = parse("out.xml")
    >>> dump(tree)
    <tag>hellö</tag>

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 12, 2010

    Interesting. But isn't the problem with 3.1 that it relies on the standard encoding, which results in code that may or may not work depending on a global platform setting? Who's doing the encoding in the new version? And what ends up in the file?

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 12, 2010

    >> tree = parse("out.xml")

    Actually the test in my previous message does not prove anything.
    locale.getpreferredencoding() returns "UTF-8" != "utf-8".

    :)

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 12, 2010

    Oops :) Yeah, that was pretty lousy way to show what encoding I was using for that test:

    >>> import locale
    >>> locale.getpreferredencoding()
    'cp1252'
    >>>

    (Somewhat related, it would be nice if Python actually normalized defaultencoding/preferredencoding to some canonical name for the codec in use, i.e. preferred MIME name or at least IANA; we had a rather nice little bug recently that wouldn't have happened if that had been the case...)

    @gvanrossum
    Copy link
    Member

    I propose that we continue to see Fredrik as elementtree's "BDFL". If Fredrik wants the API in 3.2 to be changed to be backwards compatible with 2.x, we should do that, and damn the torpedoes (um, 3.1 compatibility).

    I would do this ASAP; if you can, fix it *before* merging 1.3.

    Since I hate XML equally whether it's text or bytes, please leave me out of this in the future; I apologize for having cause the problem in the first place (but note that apparently nobody cared or noticed until a week ago).

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 14, 2010

    Currently "tree.write(file)" returns Unicode in 3.1 (and 3.x).
    I would propose the following change:

    >>> tree.write(file)
    #  ==>  encode to ASCII without xml declaration (compatible 2.x)
    >>> tree.write(file, encoding="utf-8")
    #  ==>  encode to UTF-8 without xml declaration (compatible 2.x + 3.1)
    >>> tree.write(file, encoding=False)
    #  ==>  output Unicode, without xml declaration (compatible 3.1)

    The "xml_declaration" keyword argument can be set to True explicitly.

    For compatibility with lxml.etree, "encoding=str" returns the same as "encoding=False".

    Functions tostring() and tostringlist() will inherit the same behavior.
    This change could be backported to 2.7, because it is backward compatible.

    See proposed patch for implementation details.

    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 14, 2010

    That's a funny idea. I like that. +1

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Mar 21, 2010

    Hmm. I'm not entirely sure about giving False a meaning when None has traditionally had a different (and documented) meaning. And sleeping on it hasn't convinced me in either direction :-(

    (well, I'd say no, but the compatibility argument is somewhat tempting)

    I'm not that concerned by changing the default for write -- 3.x users with utf-8 as the default output encoding will get different output, but still perfectly valid XML. 3.x users with non-utf-8 default encodings will get valid XML also in cases where it didn't work before.

    tostring() is more problematic, but I'm leaning towards Guido's torpedoes approach there -- changing the default output to bytestrings is more likely to cause code to blow up than cause bad output, and you can trivially make your program backwards compatible by adding an extra check/decode after the call. Supporting unicode for lxml.etree compatibility is fine with me, but I think it might make sense to support the string "unicode" as well (as a pseudo-encoding -- it's pretty clear to me that nobody will ever define a real character encoding with that name :-).

    Have you posted/can you post the patch to riedveld, btw? I have some questions about the code that are independent of the encoding decision.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 22, 2010

    http://codereview.appspot.com/664043 (patch against 3.x)

    IIUC, the changes proposed (for 3.2) are:

    • default encoding or bool(encoding) == False
      ==> fallback to 'US-ASCII' encoding (instead of Unicode)
    • encoding=str or encoding='unicode'
      ==> serialize to Unicode

    And it changes the behavior of :

    • ET.write()
    • tostring()
    • tostringlist()

    For 2.x we could add the options for Unicode output:

    • encoding=unicode
    • and encoding='unicode'

    @florentx florentx mannequin assigned effbot and unassigned birkenfeld Mar 22, 2010
    @scoder
    Copy link
    Contributor Author

    scoder commented Mar 22, 2010

    Supporting unicode for lxml.etree compatibility is fine with me, but I
    think it might make sense to support the string "unicode" as well (as
    a pseudo-encoding -- it's pretty clear to me that nobody will ever
    define a real character encoding with that name :-).

    The reason I chose the unicode type over a 'unicode' string name at the time was that I wanted to make a clear distinction to show that this is not just selecting a different codec but that it changes the output type.

    I don't really care either way, though, given that this reads a lot less well in Py3. If ET supports both, lxml will follow.

    Stefan

    @malemburg
    Copy link
    Member

    Stefan Behnel wrote:

    Stefan Behnel <scoder@users.sourceforge.net> added the comment:

    > Supporting unicode for lxml.etree compatibility is fine with me, but I
    > think it might make sense to support the string "unicode" as well (as
    > a pseudo-encoding -- it's pretty clear to me that nobody will ever
    > define a real character encoding with that name :-).

    The reason I chose the unicode type over a 'unicode' string name at the time was that I wanted to make a clear distinction to show that this is not just selecting a different codec but that it changes the output type.

    I don't really care either way, though, given that this reads a lot less well in Py3. If ET supports both, lxml will follow.

    There's always the possibility of adding a new official codec
    called 'unicode' which converts Unicode to Unicode as no-op.

    This may also be useful to have in other situations where you
    want to signal a special case for Unicode input or output.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Jul 31, 2010

    Patch updated here, and on Rietveld too.
    http://codereview.appspot.com/664043

    Rules (as discussed):

    • tree.tostring(encoding=None) => encodes to "US-ASCII"
      (compatible with 2.7 and lxml.etree)
    • tree.tostring(encoding="unicode") => outputs Unicode
    • tree.tostring(encoding=str) => outputs Unicode
      (compatible with lxml.etree)

    For 2.7, no change planned.
    For 3.1, do we keep the current behavior?

    • tree.tostring(encoding=None) => outputs Unicode

    @florentx florentx mannequin added the topic-XML label Jul 31, 2010
    @scoder
    Copy link
    Contributor Author

    scoder commented Aug 8, 2010

    I would suggest fixing the tostring() behaviour also in a future 3.1.x bug fix release. After all, the current behaviour means that 3.0 and 3.1 would behave different from any other (released or future) Python version here.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Aug 8, 2010

    Done for 3.2 with r83851.

    Still opened, if someone wants to propose a patch for 3.1.

    @florentx florentx mannequin added the easy label Aug 8, 2010
    @florentx florentx mannequin unassigned effbot Aug 8, 2010
    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Oct 29, 2011

    3.1 is no longer in scope for this issue.

    @florentx florentx mannequin closed this as completed Oct 29, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir easy stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants