classification
Title: ElementTree (py3k) doesn't properly encode characters that can't be represented in the specified encoding
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Neil Muller, benjamin.peterson, effbot, hodgestar, jcsalterego, loewis, pitrou, rhettinger
Priority: critical Keywords: patch

Created on 2009-06-07 21:30 by Neil Muller, last changed 2010-02-09 16:56 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
issue6233_py3k.diff Neil Muller, 2009-06-07 21:47 Simple patch
issue6233_py3k_with_test.diff Neil Muller, 2009-06-18 13:12 Combined patch with test case
issue6233-escape_entities.diff jcsalterego, 2009-06-23 05:37 Excp handling in _encode + prev submitted test
issue6233-encode_cdata.diff jcsalterego, 2009-06-24 23:56 _encode & effbot's _escape_cdata, w/ test
Messages (16)
msg89058 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 21:30
In py3k, ElementTree no longer correctly converts characters to entities
when they can't be represented in the requested output encoding.

Python 2:

>>> import xml.etree.ElementTree as ET
>>> e = ET.XML("<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')
"<?xml version='1.0' encoding='ascii'?>\n<body>t&#227;t</body>"

Python 3:

>>> import xml.etree.ElementTree as ET
>>> e = ET.XML("<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')
.....
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)


It looks like _encode_entity isn't ever called inside ElementTree
anymore - it probably should be called as part of _encode for characters
that can't be represented.
msg89059 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 21:47
Simple possible patch uploaded

This doesn't give the expected answer for the test above, but does work
when starting from an XML file in utf-8 encoding. I still need to
determine why this happens.
msg89276 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-12 12:50
> This doesn't give the expected answer for the test above

Which is obviously due to not comparing apples with apples, as I should
be using a byte-string in the py3k example.

>>> import xml.etree.ElementTree as ET
>>> e = ET.XML(b"<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')


Fails without the patch, behaves as expected with the patch.
msg89504 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-18 13:12
Updated patch - adds a test for this.
msg89505 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-18 13:25
This regression is probably annoying enough to make it a blocker.
msg89583 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-21 21:33
Umm.  Isn't _encode used to encode tags and attribute names?  The charref 
syntax is only valid in CDATA sections and attribute values, which are 
encoded by the corresponding _escape functions.  I suspect this patch will 
make things blow up on a non-ASCII tag/attribute name.
msg89585 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-21 21:42
Did you look at the 1.3 alpha code base when you came up with this idea?  
Unfortunately, 1.3's _encode is used for a different purpose...

I don't have time to test it tonight, but I suspect that 1.3's 
escape_data/escape_attrib functions might work better under 3.X; they do 
the text.replace dance first, and then an explicit text.encode(encoding, 
"xmlcharrefreplace") at the end.  E.g.

def _escape_cdata(text, encoding):
    # escape character data
    try:
        # it's worth avoiding do-nothing calls for strings that are
        # shorter than 500 character, or so.  assume that's, by far,
        # the most common case in most applications.
        if "&" in text:
            text = text.replace("&", "&amp;")
        if "<" in text:
            text = text.replace("<", "&lt;")
        if ">" in text:
            text = text.replace(">", "&gt;")
        return text.encode(encoding, "xmlcharrefreplace")
    except (TypeError, AttributeError):
        _raise_serialization_error(text)
msg89623 - (view) Author: Jerry Chen (jcsalterego) Date: 2009-06-23 05:37
The attached patch includes Neil's original additions to test_xml_etree.py.


I also noticed that _encode_entity wasn't being called in ElementTree in
py3k, with the important bit being the nested function
escape_entities(), in conjunction with _escape and _escape_map.

In 2.x, _encode_entity() is used after _encode() throws Unicode
exceptions [1], so I figured it would make sense to take the core
functionality of _escape_entities() and integrate it into _encode in the
same fashion -- when an exception is thrown.

Basically, I:
- changed _escape regexp from using "[\x0080-\uffff]" to "[\x80-xff]"
- extracted _encode_entity.escape_entities() and made it
_escape_entities of module scope
- removed _encode_entity()
- added UnicodeEncodeError exception in _encode()

I'm not sure what the expected outcome is supposed to be when the text
is not type bytes but str. With this patch, the output has
b"t&#195;&#163;t" rather than b"t&#227;t".

Hope this is a step in the right direction.

[1] ElementTree.py:814, ElementTree.py:829, python 2.7 HEAD r50941
msg89684 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-24 21:51
That's backwards, unless I'm missing something here: charrefs represent 
Unicode characters, not UTF-8 byte values.  The character "LATIN SMALL 
LETTER A WITH TILDE" with the character value 227 should be represented as 
"&#227;" if serialized to an encoding that doesn't support non-ASCII 
characters.

And there's no need to use RE:s to filter things under 3.X; those parts of 
ET 1.2 are there for pre-2.0 compatibility.

Did you try running the tests with the escape function I posted?
msg89690 - (view) Author: Jerry Chen (jcsalterego) Date: 2009-06-24 23:56
Thanks for the explanation -- looks like I was way off base on that one.

I took a look at the code you provided but it doesn't work as a drop-in
replacement for _escape_cdata, since that function returns a string
rather than bytes.

However taking your code, calling it _encode_cdata and then refactoring
all calls _encode(_escape_cdata(x), encoding) to _encode_cdata(x,
encoding) seems to do the trick and passes the tests.

Specific example:

- file.write(_encode(_escape_cdata(node.text), encoding))
+ file.write(_encode_cdata(node.text, encoding))

One minor modification is to return the string as is if encoding=None,
just like _encode:

def _encode_cdata(text, encoding):
    # escape character data
    try:
        text = text.replace("&", "&amp;")
        text = text.replace("<", "&lt;")
        text = text.replace(">", "&gt;")
        if encoding:
            return text.encode(encoding, "xmlcharrefreplace")
        else:
            return text
    except (TypeError, AttributeError):
        _raise_serialization_error(text)
msg89715 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-06-25 21:26
effbot, do you have an opinion about the latest patch? It'd be nice to
not have to delay the release for this.
msg89718 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-06-26 01:12
I disagree with this report being classified as release-critical - it is
*not* a regression over 3.0 (i.e. 3.0 already behaved in the same way).
That it is a regression relative to 2.x should not make it
release-critical - we can still fix such regressions in 3.2.

In addition, there is an easy work-around for applications that run into
the problem - just use utf-8 as the output encoding always:

py> e = ET.XML(b"<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
py> ET.tostring(e,encoding='utf-8')
b'<body>t\xc3\xa3t</body>'
msg89722 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-06-26 03:30
+1 for Py3.1.1
msg89728 - (view) Author: Jerry Chen (jcsalterego) Date: 2009-06-26 14:21
Either way, it would be nice to get feedback so we can iterate on the
patch or close out this issue already :-)
msg95045 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-08 20:43
The patch looks ok to me.
msg99125 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-09 16:56
Committed in r78123 (py3k) and r78124 (3.1). I've also removed _escape_cdata() since it wasn't used anymore. Thanks Jerry for the patch.
History
Date User Action Args
2010-02-09 16:56:25pitrousetstatus: open -> closed
resolution: fixed
messages: + msg99125

stage: resolved
2009-11-08 20:43:14pitrousetmessages: + msg95045
2009-06-26 14:21:19jcsalteregosetmessages: + msg89728
2009-06-26 11:13:39pitrousetpriority: release blocker -> critical
versions: + Python 3.2, - Python 3.0
2009-06-26 03:30:55rhettingersetnosy: + rhettinger
messages: + msg89722
2009-06-26 01:12:28loewissetnosy: + loewis
messages: + msg89718
2009-06-25 21:26:50benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg89715
2009-06-24 23:56:04jcsalteregosetfiles: + issue6233-encode_cdata.diff

messages: + msg89690
2009-06-24 21:51:25effbotsetmessages: + msg89684
2009-06-23 05:37:25jcsalteregosetfiles: + issue6233-escape_entities.diff
nosy: + jcsalterego
messages: + msg89623

2009-06-21 21:42:03effbotsetmessages: + msg89585
2009-06-21 21:33:01effbotsetmessages: + msg89583
2009-06-18 13:25:14pitrousetpriority: release blocker
nosy: + pitrou
messages: + msg89505

2009-06-18 13:12:25Neil Mullersetfiles: + issue6233_py3k_with_test.diff

messages: + msg89504
2009-06-12 12:50:24Neil Mullersetmessages: + msg89276
2009-06-07 21:47:42Neil Mullersetfiles: + issue6233_py3k.diff
keywords: + patch
messages: + msg89059
2009-06-07 21:30:58Neil Mullercreate